[GH-ISSUE #6492] Models drastically quality drop on chat/completions gateway #4086

Closed
opened 2026-04-12 14:59:30 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @yaroslavyaroslav on GitHub (Aug 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6492

What is the issue?

Folks raised the following issue on my side (frontend for ollama) https://github.com/yaroslavyaroslav/OpenAI-sublime-text/issues/57

In short it's about that models response with very low quality through my app. Long story short.

  1. I've ran the export OLLAMA_DEBUG=1 && ollama serve
  2. run ollama run qwen2:1.5b --verbose --nowordwrap with the prompt from below and got quite fine answer.
  3. then I run
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_LOCAL_API_KEY" \
  -d '{
    "model": "qwen2:1.5b",
    "messages": [{"role": "user", "content": "create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"}]
  }'

and got this mess:

{"id":"chatcmpl-521","object":"chat.completion","created":1724524318,"model":"qwen2:1.5b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"To create a Sublime Text plugin that converts selected text to base64 encoding, replaces it in place, and wraps the entire process in a command-line interface (CLI), you can use a combination of file handling and scripting. Here's how to implement this feature:\n\n1. Start by creating a new project in Sublime Text.\n2. Add the following package to your `Package Info` -\u003e `Packages/User/...` subfolder:\n```\n//sublime-text-commands\n{\n  // Your custom CLI commands here\n}\n``` \n3. Copy and paste the contents of this code snippet into the newly created `.sublime-package` file:\n\n```json\n{\n  \"name\": \"Custom CLI Commands\",\n  \"description\": \"Commands for a Sublime Text plugin\",\n  \"版本\": 1,\n  \"dependencies: {\n    // ... your npm packages here\n    \"command-line-encoder\": \"^2.3.0\"\n  },\n  \"cmd\": [\n      \"perl -e 'print Encode::b64_encode(\\$arg2));'\"\n  ]\n}\n```\n\n4. Save the file and restart Sublime Text.\n\nNow, you should be able to see your plugin under the `\"commands\"` menu when launching the Sublime Text command palette:\n\n1. Choose `View-\u003e Find -\u003e Replace with` \u003e `\u003cPackage name\u003e`.\n2. Select any line in the current document that contains text.\n3. Press `Enter` to apply the above code.\n4. Choose an item from the output list on the right (you should see your plugin's CLI).\n\nTo convert selected text, use a combination of the following commands:\n\n- `\u003cPackage name\u003e`: Open the Sublime Text command palette and type in `\u003cPackage name\u003e`.\n  - Then press `Enter`.\n  - Check the box next to `\"Find\"` to search for any specific text.\n\nHere's an example of how you can do this with regular expressions and the new plugin:\n\n1. Add a custom regular expression to find any line that matches the input:\n- Search: `'(?s)^\\s+'\n- Replace: `'`\n  - The `\\s+` captures one or more whitespace characters before any text.\n\n2. Press `Enter`.\n\n3. To replace selected text with base64 encoding and wrap it in a command prompt and press `Enter`. \n\n```json\n\"cmd\": [ \n    \"perl -e 'print Encode::b64_encode($arg2);'\",\n    \"(perl -e 'print Encode::b64_encode(\\$arg2));'\"\n]\n```\n\n4. Repeat the process by pressing `Enter` a few times for multiple lines."},"finish_reason":"stop"}],"usage":{"prompt_tokens":32,"completion_tokens":541,"total_tokens":573}}

On the server side I noticed that ollama run triggers another gateway than chat/completions and that the request appeared in logs are far greater than the one appeared on curl call.

Not that I dug this any deep enough but my shot is that there's some additional setup happening when calling ollama run.

here's the logs.

2024/08/24 20:18:58 routes.go:1125: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/path-to-ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-08-24T20:18:58.262+02:00 level=INFO source=images.go:782 msg="total blobs: 5"
time=2024-08-24T20:18:58.263+02:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-24T20:18:58.263+02:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)"
time=2024-08-24T20:18:58.267+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-common.h.gz
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-metal.metal.gz
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ollama_llama_server.gz
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:18:58.290+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-24T20:18:58.338+02:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="10.7 GiB" available="10.7 GiB"
time=2024-08-24T20:19:07.474+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x100950390 gpu_count=1
time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]"
time=2024-08-24T20:19:07.490+02:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e gpu=0 parallel=4 available=11453251584 required="1.9 GiB"
time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=server.go:101 msg="system memory" total="16.0 GiB" free="4.0 GiB" free_swap="0 B"
time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]"
time=2024-08-24T20:19:07.490+02:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="927.4 MiB" memory.weights.repeating="744.8 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="299.8 MiB"
time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:19:07.492+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server --model /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --verbose --parallel 4 --port 54635"
time=2024-08-24T20:19:07.492+02:00 level=DEBUG source=server.go:410 msg=subprocess environment="[PATH=/opt/homebrew/opt/ruby/bin:/path-to-ollama/.mint/bin:/Applications/Sublime Merge.app/Contents/SharedSupport/bin:/Applications/Sublime Text.app/Contents/SharedSupport/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Applications/Little Snitch.app/Contents/Components:/path-to-ollama/.cargo/bin:/Applications/kitty.app/Contents/MacOS LD_LIBRARY_PATH=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal:/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners]"
time=2024-08-24T20:19:07.493+02:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-24T20:19:07.493+02:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-24T20:19:07.494+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3535 commit="1e6f6554" tid="0x1e9306940" timestamp=1724523548
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x1e9306940" timestamp=1724523548 total_threads=10
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="9" port="54635" tid="0x1e9306940" timestamp=1724523548
llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen2-1.5B-Instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-24T20:19:08.250+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.54 B
llm_load_print_meta: model size       = 885.97 MiB (4.81 BPW) 
llm_load_print_meta: general.name     = Qwen2-1.5B-Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   885.97 MiB, (  886.03 / 10922.67)
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   182.57 MiB
llm_load_tensors:      Metal buffer size =   885.97 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.34 MiB
llama_new_context_with_model:      Metal compute buffer size =   299.75 MiB
llama_new_context_with_model:        CPU compute buffer size =    19.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2
time=2024-08-24T20:19:08.501+02:00 level=DEBUG source=server.go:638 msg="model load progress 1.00"
DEBUG [initialize] initializing slots | n_slots=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="0x1e9306940" timestamp=1724523548
INFO [main] model loaded | tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="0x1e9306940" timestamp=1724523548
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="0x1e9306940" timestamp=1724523548
time=2024-08-24T20:19:08.755+02:00 level=INFO source=server.go:632 msg="llama runner started in 1.26 seconds"
time=2024-08-24T20:19:08.755+02:00 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="0x1e9306940" timestamp=1724523548
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54639 status=200 tid="0x16ba43000" timestamp=1724523548
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="0x1e9306940" timestamp=1724523548
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523548
time=2024-08-24T20:19:08.777+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\nCreating a Sublime Text plugin to perform Base64 encoding on selected text and replace it with converted data is a complex task as you are asking for more than one operation. However, I can provide an outline of how such a feature might be implemented in Sublime Text.\n\nHere's the step-by-step guide to creating such a plugin:\n\n1. **Define Keybinds:** First, you need to define the key bindings that trigger the conversion when the user selects text and presses a specific key.\n\n2. **Create the Plugin:** Create a new Sublime Text plugin file (like `sublime_text_plugin.py`). This file should include the necessary functions for handling command execution, event listeners, etc.\n\n3. **Implement Conversion Function:** In this function, you need to convert the selected text using Base64 encoding. You can use libraries like `base64` in Python to do this.\n\n4. **Insert or Replace Selected Text:** Once the conversion is complete, you need to either insert the converted text into the user's selection or replace it if the user previously typed something there.\n5. **Check for Keybinds to Continue:** If the user presses a key to continue, check whether `execute_command` has been called and if not, call it with the correct parameters.\n\n6. **Event Listening:** Add event listeners in Sublime Text itself so that when changes are made to the selected text (e.g., typ
ed characters), they can trigger the conversion.\n\n7. **Error Handling:** Include error handling for situations where the Base64 encoding process fails or if something else goes wrong during the execution of the command.\n8. **Testing:** Ensure your plugin works as expected by testing it with different scenarios and edge cases, such as when there's no text selected in Sublime Text.\n\nPlease note that this is a high-level overview of creating a Sublime Text plugin. The specifics will depend on the programming language you're using for the plugin (in this case, Python), and how you choose to implement the features described above. For full documentation, follow your chosen platform's official documentation or look up examples online.\n\nRemember that creating plugins like these can be a significant commitment, especially if they are complex and need thorough testing before release.<|im_end|>\n<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=3 tid="0x1e9306940" timestamp=1724523548
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=523 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [print_timings] prompt eval time     =     473.73 ms /   523 tokens (    0.91 ms per token,  1104.00 tokens per second) | n_prompt_tokens_processed=523 n_tokens_second=1103.9997298050373 slot_id=0 t_prompt_processing=473.732 t_token=0.9057973231357553 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [print_timings] generation eval time =    4179.91 ms /   322 runs   (   12.98 ms per token,    77.04 tokens per second) | n_decoded=322 n_tokens_second=77.0351330446988 slot_id=0 t_token=12.981090062111802 t_token_generation=4179.911 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [print_timings]           total time =    4653.64 ms | slot_id=0 t_prompt_processing=473.732 t_token_generation=4179.911 t_total=4653.643 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [update_slots] slot released | n_cache_tokens=845 n_ctx=8192 n_past=844 n_system_tokens=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523553 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523553
[GIN] 2024/08/24 - 20:19:13 | 200 |  5.980737958s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0




time=2024-08-24T20:21:08.524+02:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=329 tid="0x1e9306940" timestamp=1724523668
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=330 tid="0x1e9306940" timestamp=1724523668
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523668
time=2024-08-24T20:21:08.527+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=331 tid="0x1e9306940" timestamp=1724523668
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",193]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] slot progression | ga_i=0 n_past=34 n_past_se=0 n_prompt_tokens_processed=34 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] kv cache rm [p0, end) | p0=33 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [print_timings] prompt eval time     =     166.62 ms /    34 tokens (    4.90 ms per token,   204.05 tokens per second) | n_prompt_tokens_processed=34 n_tokens_second=204.0546866560238 slot_id=0 t_prompt_processing=166.622 t_token=4.90064705882353 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [print_timings] generation eval time =    8888.16 ms /   596 runs   (   14.91 ms per token,    67.06 tokens per second) | n_decoded=596 n_tokens_second=67.0554910065198 slot_id=0 t_token=14.913021812080537 t_token_generation=8888.161 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [print_timings]           total time =    9054.78 ms | slot_id=0 t_prompt_processing=166.622 t_token_generation=8888.161 t_total=9054.783 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [update_slots] slot released | n_cache_tokens=630 n_ctx=8192 n_past=629 n_system_tokens=0 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523677 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523677
[GIN] 2024/08/24 - 20:21:17 | 200 |   9.10044125s |       127.0.0.1 | POST     "/v1/chat/completions"
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:403 msg="context for request finished"
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

ollama version is 0.3.6

Originally created by @yaroslavyaroslav on GitHub (Aug 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6492 ### What is the issue? Folks raised the following issue on my side (frontend for ollama) https://github.com/yaroslavyaroslav/OpenAI-sublime-text/issues/57 In short it's about that models response with very low quality through my app. Long story short. 1. I've ran the `export OLLAMA_DEBUG=1 && ollama serve` 2. run `ollama run qwen2:1.5b --verbose --nowordwrap` with the prompt from below and got quite fine answer. 3. then I run ```bash curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_LOCAL_API_KEY" \ -d '{ "model": "qwen2:1.5b", "messages": [{"role": "user", "content": "create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"}] }' ``` and got this mess: ``` {"id":"chatcmpl-521","object":"chat.completion","created":1724524318,"model":"qwen2:1.5b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"To create a Sublime Text plugin that converts selected text to base64 encoding, replaces it in place, and wraps the entire process in a command-line interface (CLI), you can use a combination of file handling and scripting. Here's how to implement this feature:\n\n1. Start by creating a new project in Sublime Text.\n2. Add the following package to your `Package Info` -\u003e `Packages/User/...` subfolder:\n```\n//sublime-text-commands\n{\n // Your custom CLI commands here\n}\n``` \n3. Copy and paste the contents of this code snippet into the newly created `.sublime-package` file:\n\n```json\n{\n \"name\": \"Custom CLI Commands\",\n \"description\": \"Commands for a Sublime Text plugin\",\n \"版本\": 1,\n \"dependencies: {\n // ... your npm packages here\n \"command-line-encoder\": \"^2.3.0\"\n },\n \"cmd\": [\n \"perl -e 'print Encode::b64_encode(\\$arg2));'\"\n ]\n}\n```\n\n4. Save the file and restart Sublime Text.\n\nNow, you should be able to see your plugin under the `\"commands\"` menu when launching the Sublime Text command palette:\n\n1. Choose `View-\u003e Find -\u003e Replace with` \u003e `\u003cPackage name\u003e`.\n2. Select any line in the current document that contains text.\n3. Press `Enter` to apply the above code.\n4. Choose an item from the output list on the right (you should see your plugin's CLI).\n\nTo convert selected text, use a combination of the following commands:\n\n- `\u003cPackage name\u003e`: Open the Sublime Text command palette and type in `\u003cPackage name\u003e`.\n - Then press `Enter`.\n - Check the box next to `\"Find\"` to search for any specific text.\n\nHere's an example of how you can do this with regular expressions and the new plugin:\n\n1. Add a custom regular expression to find any line that matches the input:\n- Search: `'(?s)^\\s+'\n- Replace: `'`\n - The `\\s+` captures one or more whitespace characters before any text.\n\n2. Press `Enter`.\n\n3. To replace selected text with base64 encoding and wrap it in a command prompt and press `Enter`. \n\n```json\n\"cmd\": [ \n \"perl -e 'print Encode::b64_encode($arg2);'\",\n \"(perl -e 'print Encode::b64_encode(\\$arg2));'\"\n]\n```\n\n4. Repeat the process by pressing `Enter` a few times for multiple lines."},"finish_reason":"stop"}],"usage":{"prompt_tokens":32,"completion_tokens":541,"total_tokens":573}} ``` On the server side I noticed that `ollama run` triggers another gateway than `chat/completions` and that the request appeared in logs are far greater than the one appeared on `curl` call. Not that I dug this any deep enough but my shot is that there's some additional setup happening when calling `ollama run`. here's the logs. ```log 2024/08/24 20:18:58 routes.go:1125: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/path-to-ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-08-24T20:18:58.262+02:00 level=INFO source=images.go:782 msg="total blobs: 5" time=2024-08-24T20:18:58.263+02:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-24T20:18:58.263+02:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)" time=2024-08-24T20:18:58.267+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-common.h.gz time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-metal.metal.gz time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ollama_llama_server.gz time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server time=2024-08-24T20:18:58.290+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]" time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-08-24T20:18:58.338+02:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="10.7 GiB" available="10.7 GiB" time=2024-08-24T20:19:07.474+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x100950390 gpu_count=1 time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]" time=2024-08-24T20:19:07.490+02:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e gpu=0 parallel=4 available=11453251584 required="1.9 GiB" time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=server.go:101 msg="system memory" total="16.0 GiB" free="4.0 GiB" free_swap="0 B" time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]" time=2024-08-24T20:19:07.490+02:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="927.4 MiB" memory.weights.repeating="744.8 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="299.8 MiB" time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server time=2024-08-24T20:19:07.492+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server --model /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --verbose --parallel 4 --port 54635" time=2024-08-24T20:19:07.492+02:00 level=DEBUG source=server.go:410 msg=subprocess environment="[PATH=/opt/homebrew/opt/ruby/bin:/path-to-ollama/.mint/bin:/Applications/Sublime Merge.app/Contents/SharedSupport/bin:/Applications/Sublime Text.app/Contents/SharedSupport/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Applications/Little Snitch.app/Contents/Components:/path-to-ollama/.cargo/bin:/Applications/kitty.app/Contents/MacOS LD_LIBRARY_PATH=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal:/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners]" time=2024-08-24T20:19:07.493+02:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-24T20:19:07.493+02:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-24T20:19:07.494+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3535 commit="1e6f6554" tid="0x1e9306940" timestamp=1724523548 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x1e9306940" timestamp=1724523548 total_threads=10 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="9" port="54635" tid="0x1e9306940" timestamp=1724523548 llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen2-1.5B-Instruct llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-08-24T20:19:08.250+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 293 llm_load_vocab: token to piece cache size = 0.9338 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 1536 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 6 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8960 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1.54 B llm_load_print_meta: model size = 885.97 MiB (4.81 BPW) llm_load_print_meta: general.name = Qwen2-1.5B-Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.30 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 885.97 MiB, ( 886.03 / 10922.67) llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 182.57 MiB llm_load_tensors: Metal buffer size = 885.97 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Pro ggml_metal_init: picking default device: Apple M1 Pro ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M1 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_kv_cache_init: Metal KV buffer size = 224.00 MiB llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_new_context_with_model: CPU output buffer size = 2.34 MiB llama_new_context_with_model: Metal compute buffer size = 299.75 MiB llama_new_context_with_model: CPU compute buffer size = 19.01 MiB llama_new_context_with_model: graph nodes = 986 llama_new_context_with_model: graph splits = 2 time=2024-08-24T20:19:08.501+02:00 level=DEBUG source=server.go:638 msg="model load progress 1.00" DEBUG [initialize] initializing slots | n_slots=4 tid="0x1e9306940" timestamp=1724523548 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="0x1e9306940" timestamp=1724523548 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="0x1e9306940" timestamp=1724523548 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="0x1e9306940" timestamp=1724523548 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="0x1e9306940" timestamp=1724523548 INFO [main] model loaded | tid="0x1e9306940" timestamp=1724523548 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="0x1e9306940" timestamp=1724523548 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="0x1e9306940" timestamp=1724523548 time=2024-08-24T20:19:08.755+02:00 level=INFO source=server.go:632 msg="llama runner started in 1.26 seconds" time=2024-08-24T20:19:08.755+02:00 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="0x1e9306940" timestamp=1724523548 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54639 status=200 tid="0x16ba43000" timestamp=1724523548 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="0x1e9306940" timestamp=1724523548 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523548 time=2024-08-24T20:19:08.777+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\nCreating a Sublime Text plugin to perform Base64 encoding on selected text and replace it with converted data is a complex task as you are asking for more than one operation. However, I can provide an outline of how such a feature might be implemented in Sublime Text.\n\nHere's the step-by-step guide to creating such a plugin:\n\n1. **Define Keybinds:** First, you need to define the key bindings that trigger the conversion when the user selects text and presses a specific key.\n\n2. **Create the Plugin:** Create a new Sublime Text plugin file (like `sublime_text_plugin.py`). This file should include the necessary functions for handling command execution, event listeners, etc.\n\n3. **Implement Conversion Function:** In this function, you need to convert the selected text using Base64 encoding. You can use libraries like `base64` in Python to do this.\n\n4. **Insert or Replace Selected Text:** Once the conversion is complete, you need to either insert the converted text into the user's selection or replace it if the user previously typed something there.\n5. **Check for Keybinds to Continue:** If the user presses a key to continue, check whether `execute_command` has been called and if not, call it with the correct parameters.\n\n6. **Event Listening:** Add event listeners in Sublime Text itself so that when changes are made to the selected text (e.g., typ ed characters), they can trigger the conversion.\n\n7. **Error Handling:** Include error handling for situations where the Base64 encoding process fails or if something else goes wrong during the execution of the command.\n8. **Testing:** Ensure your plugin works as expected by testing it with different scenarios and edge cases, such as when there's no text selected in Sublime Text.\n\nPlease note that this is a high-level overview of creating a Sublime Text plugin. The specifics will depend on the programming language you're using for the plugin (in this case, Python), and how you choose to implement the features described above. For full documentation, follow your chosen platform's official documentation or look up examples online.\n\nRemember that creating plugins like these can be a significant commitment, especially if they are complex and need thorough testing before release.<|im_end|>\n<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n" DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=3 tid="0x1e9306940" timestamp=1724523548 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=523 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548 DEBUG [print_timings] prompt eval time = 473.73 ms / 523 tokens ( 0.91 ms per token, 1104.00 tokens per second) | n_prompt_tokens_processed=523 n_tokens_second=1103.9997298050373 slot_id=0 t_prompt_processing=473.732 t_token=0.9057973231357553 task_id=4 tid="0x1e9306940" timestamp=1724523553 DEBUG [print_timings] generation eval time = 4179.91 ms / 322 runs ( 12.98 ms per token, 77.04 tokens per second) | n_decoded=322 n_tokens_second=77.0351330446988 slot_id=0 t_token=12.981090062111802 t_token_generation=4179.911 task_id=4 tid="0x1e9306940" timestamp=1724523553 DEBUG [print_timings] total time = 4653.64 ms | slot_id=0 t_prompt_processing=473.732 t_token_generation=4179.911 t_total=4653.643 task_id=4 tid="0x1e9306940" timestamp=1724523553 DEBUG [update_slots] slot released | n_cache_tokens=845 n_ctx=8192 n_past=844 n_system_tokens=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523553 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523553 [GIN] 2024/08/24 - 20:19:13 | 200 | 5.980737958s | 127.0.0.1 | POST "/api/chat" time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0 time=2024-08-24T20:21:08.524+02:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=329 tid="0x1e9306940" timestamp=1724523668 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=330 tid="0x1e9306940" timestamp=1724523668 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523668 time=2024-08-24T20:21:08.527+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n" DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=331 tid="0x1e9306940" timestamp=1724523668 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",193] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668 DEBUG [update_slots] slot progression | ga_i=0 n_past=34 n_past_se=0 n_prompt_tokens_processed=34 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668 DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668 DEBUG [update_slots] kv cache rm [p0, end) | p0=33 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668 DEBUG [print_timings] prompt eval time = 166.62 ms / 34 tokens ( 4.90 ms per token, 204.05 tokens per second) | n_prompt_tokens_processed=34 n_tokens_second=204.0546866560238 slot_id=0 t_prompt_processing=166.622 t_token=4.90064705882353 task_id=332 tid="0x1e9306940" timestamp=1724523677 DEBUG [print_timings] generation eval time = 8888.16 ms / 596 runs ( 14.91 ms per token, 67.06 tokens per second) | n_decoded=596 n_tokens_second=67.0554910065198 slot_id=0 t_token=14.913021812080537 t_token_generation=8888.161 task_id=332 tid="0x1e9306940" timestamp=1724523677 DEBUG [print_timings] total time = 9054.78 ms | slot_id=0 t_prompt_processing=166.622 t_token_generation=8888.161 t_total=9054.783 task_id=332 tid="0x1e9306940" timestamp=1724523677 DEBUG [update_slots] slot released | n_cache_tokens=630 n_ctx=8192 n_past=629 n_system_tokens=0 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523677 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523677 [GIN] 2024/08/24 - 20:21:17 | 200 | 9.10044125s | 127.0.0.1 | POST "/v1/chat/completions" time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:403 msg="context for request finished" time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0 ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version ollama version is 0.3.6
GiteaMirror added the bug label 2026-04-12 14:59:30 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 24, 2024):

Not a sublime user, so can't evaluate the quality of the answer. It would have been useful to see the fine answer from the ollama run try. Note however that the logs show that ollama run was a multi-turn conversation, so the quality of the output may have been improved by the context from the previous answer.

ollama run uses the ollama /api/chat API and curl uses the OpenAI API compatibility endpoint, /v1/chat/completions. The two endpoints have different default values for the parameters temperature and top_p. If you control for those and seed, both endpoints return the same answer. Note that ollama doubles the value of temperature when passed through the OpenAPI endpoint, ie temperature of 0.8 (the default) for /api/chat is 0.4 for /v1/chat/completions.

OpenAI API response to the prompt:

curl -s localhost:11434/v1/chat/completions -d '{
  "model":"qwen2:1.5b",
  "seed":0,
  "temperature":0.4,
  "top_p":0.9,
  "messages":[{
    "role":"user",
    "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"
  }]
}' > result.openai

Ollama response to the prompt:

curl -s localhost:11434/api/chat -d '{
  "model":"qwen2:1.5b",
  "options":{                              
    "seed":0
  },              
  "stream":false,                                                                                                                                            
  "messages":[{
    "role":"user",
    "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"
  }]
}' > result.ollama

Comparing the output we see that the content is the same:

$ sdiff -b <(jq -r . result.openai) <(jq -r . result.ollama)
{                                                               {
  "id": "chatcmpl-933",                                       <
  "object": "chat.completion",                                <
  "created": 1724539470,                                      <
  "model": "qwen2:1.5b",                                          "model": "qwen2:1.5b",
  "system_fingerprint": "fp_ollama",                          |   "created_at": "2024-08-24T22:43:27.102258036Z",
  "choices": [                                                <
    {                                                         <
      "index": 0,                                             <
      "message": {                                                "message": {
        "role": "assistant",                                        "role": "assistant",
        "content": "Here's an example of a Sublime Text plugi       "content": "Here's an example of a Sublime Text plugin th
      },                                                          },
      "finish_reason": "stop"                                 |   "done_reason": "stop",
    }                                                         |   "done": true,
  ],                                                          |   "total_duration": 1696781939,
  "usage": {                                                  |   "load_duration": 18291392,
    "prompt_tokens": 32,                                      |   "prompt_eval_count": 32,
    "completion_tokens": 280,                                 |   "prompt_eval_duration": 19763000,
    "total_tokens": 312                                       |   "eval_count": 280,
  }                                                           |   "eval_duration": 1616478000
}                                                   

Ditto for ollama run:

$ script -c 'ollama run qwen2:1.5b --nowordwrap'
Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256color" TTY="/dev/pts/1" COLUMNS="211" LINES="41"]
>>> /set parameter seed 0
Set parameter 'seed' to '0'
>>> create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion
Here's an example of a Sublime Text plugin that converts selected text to base64 encoding and replaces it with the converted value:
...

If we then compare the output of ollama run and the curl result, we see they are the same:

$ sdiff <(jq -r '.choices[0].message.content' result.openai) <(ansifilter typescript)
Here's an example of a Sublime Text plugin that converts sele | Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256c
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              > Set parameter 'seed' to '0'
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              > ⠋ Here's an example of a Sublime Text plugin that converts se

```python                                                       ```python
// In your Sublime Text preferences, create a new folder call   // In your Sublime Text preferences, create a new folder call
// Alternatively, you can edit this file directly from the Su   // Alternatively, you can edit this file directly from the Su

package = require("sublime-package");                           package = require("sublime-package");

module.exports = {                                              module.exports = {
  init: function() {                                              init: function() {

    var plugin = {};                                                var plugin = {};
                                                                    
    plugin.exec = function(editor) {                                plugin.exec = function(editor) {
      editor.commands.executeCommand("repl.text.edit", "", nu         editor.commands.executeCommand("repl.text.edit", "", nu
      editor.commands.executeCommand("repl.text.replace", "",         editor.commands.executeCommand("repl.text.replace", "",
    };                                                              };

    return plugin;                                                  return plugin;
  }                                                               }
};                                                              };
```                                                             ```

This plugin defines a `init` function that runs when the plug   This plugin defines a `init` function that runs when the plug

To use this plugin, go to the "Preferences" > "Package Contro   To use this plugin, go to the "Preferences" > "Package Contro
                                                              >
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              >
                                                              > Script done on 2024-08-25 00:51:54+02:00 [COMMAND_EXIT_CODE="

The upshot is that controlling seed, temperature and top_p parameters will get the same results from the different endpoints that ollama provides.

<!-- gh-comment-id:2308566476 --> @rick-github commented on GitHub (Aug 24, 2024): Not a sublime user, so can't evaluate the quality of the answer. It would have been useful to see the fine answer from the `ollama run` try. Note however that the logs show that `ollama run` was a multi-turn conversation, so the quality of the output may have been improved by the context from the previous answer. `ollama run` uses the ollama `/api/chat` API and `curl` uses the OpenAI API compatibility endpoint, `/v1/chat/completions`. The two endpoints have different default values for the parameters `temperature` and `top_p`. If you control for those and `seed`, both endpoints return the same answer. Note that ollama doubles the value of `temperature` when passed through the OpenAPI endpoint, ie `temperature` of 0.8 (the default) for `/api/chat` is 0.4 for `/v1/chat/completions`. OpenAI API response to the prompt: ``` curl -s localhost:11434/v1/chat/completions -d '{ "model":"qwen2:1.5b", "seed":0, "temperature":0.4, "top_p":0.9, "messages":[{ "role":"user", "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion" }] }' > result.openai ``` Ollama response to the prompt: ``` curl -s localhost:11434/api/chat -d '{ "model":"qwen2:1.5b", "options":{ "seed":0 }, "stream":false, "messages":[{ "role":"user", "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion" }] }' > result.ollama ``` Comparing the output we see that the `content` is the same: ``` $ sdiff -b <(jq -r . result.openai) <(jq -r . result.ollama) { { "id": "chatcmpl-933", < "object": "chat.completion", < "created": 1724539470, < "model": "qwen2:1.5b", "model": "qwen2:1.5b", "system_fingerprint": "fp_ollama", | "created_at": "2024-08-24T22:43:27.102258036Z", "choices": [ < { < "index": 0, < "message": { "message": { "role": "assistant", "role": "assistant", "content": "Here's an example of a Sublime Text plugi "content": "Here's an example of a Sublime Text plugin th }, }, "finish_reason": "stop" | "done_reason": "stop", } | "done": true, ], | "total_duration": 1696781939, "usage": { | "load_duration": 18291392, "prompt_tokens": 32, | "prompt_eval_count": 32, "completion_tokens": 280, | "prompt_eval_duration": 19763000, "total_tokens": 312 | "eval_count": 280, } | "eval_duration": 1616478000 } ``` Ditto for `ollama run`: ``` $ script -c 'ollama run qwen2:1.5b --nowordwrap' Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256color" TTY="/dev/pts/1" COLUMNS="211" LINES="41"] >>> /set parameter seed 0 Set parameter 'seed' to '0' >>> create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion Here's an example of a Sublime Text plugin that converts selected text to base64 encoding and replaces it with the converted value: ... ``` If we then compare the output of `ollama run` and the curl result, we see they are the same: ``` $ sdiff <(jq -r '.choices[0].message.content' result.openai) <(ansifilter typescript) Here's an example of a Sublime Text plugin that converts sele | Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256c > >>> Send a message (/? for help)Send a message (/? for help)S > Set parameter 'seed' to '0' > >>> Send a message (/? for help)Send a message (/? for help)S > ⠋ Here's an example of a Sublime Text plugin that converts se ```python ```python // In your Sublime Text preferences, create a new folder call // In your Sublime Text preferences, create a new folder call // Alternatively, you can edit this file directly from the Su // Alternatively, you can edit this file directly from the Su package = require("sublime-package"); package = require("sublime-package"); module.exports = { module.exports = { init: function() { init: function() { var plugin = {}; var plugin = {}; plugin.exec = function(editor) { plugin.exec = function(editor) { editor.commands.executeCommand("repl.text.edit", "", nu editor.commands.executeCommand("repl.text.edit", "", nu editor.commands.executeCommand("repl.text.replace", "", editor.commands.executeCommand("repl.text.replace", "", }; }; return plugin; return plugin; } } }; }; ``` ``` This plugin defines a `init` function that runs when the plug This plugin defines a `init` function that runs when the plug To use this plugin, go to the "Preferences" > "Package Contro To use this plugin, go to the "Preferences" > "Package Contro > > >>> Send a message (/? for help)Send a message (/? for help)S > > Script done on 2024-08-25 00:51:54+02:00 [COMMAND_EXIT_CODE=" ``` The upshot is that controlling `seed`, `temperature` and `top_p` parameters will get the same results from the different endpoints that ollama provides.
Author
Owner

@yaroslavyaroslav commented on GitHub (Aug 24, 2024):

By saying that it provides mess, I meant it 😅 if you unfold that response provided by me right before the full logs you'll see some Unicode codes Chinese hieroglyphs and all the stuff, which is in any way can't be considered as a quite acceptable answer, I mean it's not the matter of correctness it's the matter of not being broken.

I've also thought that it's somehow related to model setup, but wasn't sure.

Thanks for highlighting the fact about different default settings of those two API, is there a way to get the full default config of both of those?

<!-- gh-comment-id:2308579043 --> @yaroslavyaroslav commented on GitHub (Aug 24, 2024): By saying that it provides mess, I meant it 😅 if you unfold that response provided by me right before the full logs you'll see some Unicode codes Chinese hieroglyphs and all the stuff, which is in any way can't be considered as a quite acceptable answer, I mean it's not the matter of correctness it's the matter of not being broken. I've also thought that it's somehow related to model setup, but wasn't sure. Thanks for highlighting the fact about different default settings of those two API, is there a way to get the full default config of both of those?
Author
Owner

@rick-github commented on GitHub (Aug 25, 2024):

69be940bf6/api/types.go (L585)

69be940bf6/openai/openai.go (L454)

<!-- gh-comment-id:2308582993 --> @rick-github commented on GitHub (Aug 25, 2024): https://github.com/ollama/ollama/blob/69be940bf6d2816f61c79facfa336183bc882720/api/types.go#L585 https://github.com/ollama/ollama/blob/69be940bf6d2816f61c79facfa336183bc882720/openai/openai.go#L454
Author
Owner

@yaroslavyaroslav commented on GitHub (Aug 25, 2024):

Thank you, for the answer it helped a lot. I was able to settle down the issue with the response, as the default temp value on my plugin is 1, so it becomes 2 on ollama's side.

So can I ask you the reason of doubling temp for v1/chat/completions endpoint?

<!-- gh-comment-id:2308777917 --> @yaroslavyaroslav commented on GitHub (Aug 25, 2024): Thank you, for the answer it helped a lot. I was able to settle down the issue with the response, as the default `temp` value on my plugin is 1, so it becomes 2 on ollama's side. So can I ask you the reason of doubling `temp` for `v1/chat/completions` endpoint?
Author
Owner

@rick-github commented on GitHub (Aug 25, 2024):

It's to try to match the different scales used for temperature between the different APIs. OpenAI uses a scale from 0 to 2 (minimum randomness to maximum randomness) for temperature, while ollama considers 1 to be maximum randomness. In which case, it should really be /2, not *2.

TBH, I don't think temperature is well defined. Looking at the source code for llama.cpp, there's no explanation of the possible range for temperature, other than a single comment that 1.5 is more creative, implying that the range is not [0..1]. Even the API documentation for OpenAI switches between [0..1] and [0..2] for different endpoints.

<!-- gh-comment-id:2308811751 --> @rick-github commented on GitHub (Aug 25, 2024): It's to try to match the different scales used for temperature between the different APIs. OpenAI uses a scale from [0 to 2](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) (minimum randomness to maximum randomness) for temperature, while ollama considers 1 to be maximum randomness. In which case, it should really be /2, not *2. TBH, I don't think temperature is well defined. Looking at the source code for llama.cpp, there's no explanation of the possible range for temperature, other than a single comment that 1.5 is more creative, implying that the range is not [0..1]. Even the API documentation for OpenAI switches between [0..1] and [0..2] for different endpoints.
Author
Owner

@yaroslavyaroslav commented on GitHub (Aug 25, 2024):

TBH, I don't think temperature is well defined.

Yeah, this is my actual concern. My front to ollama which uses OpenAI endpoint has predefined config with default values, which are in its turn mirror OpenAI default values for their API.

But it appears to be broken in ollama case coz default temp=1 appears to be temp=2 which makes the model goes crazy.

Although I made a note in faq on my side, I'm pretty sure there are more similar issues would be risen in future, because this very thing is too confusing to figure it out by yourself.

Hence I have a question, would you consider a PR that removes this very bit of temperature multiplication?

Ps: llama.cpp had no modification on its side, and worked reliably with OpenAI default config values last time I tested it.

<!-- gh-comment-id:2308878801 --> @yaroslavyaroslav commented on GitHub (Aug 25, 2024): > TBH, I don't think temperature is well defined. Yeah, this is my actual concern. My front to ollama which uses OpenAI endpoint has predefined config with default values, which are in its turn mirror OpenAI default values for their API. But it appears to be broken in ollama case coz default temp=1 appears to be temp=2 which makes the model goes crazy. Although I made a note in faq on my side, I'm pretty sure there are more similar issues would be risen in future, because this very thing is too confusing to figure it out by yourself. Hence I have a question, would you consider a PR that removes this very bit of temperature multiplication? Ps: llama.cpp had no modification on its side, and worked reliably with OpenAI default config values last time I tested it.
Author
Owner

@rick-github commented on GitHub (Aug 25, 2024):

You're more than welcome to submit a PR, but it's up to the ollama team (of which I am not a part) to merge it.

<!-- gh-comment-id:2308894293 --> @rick-github commented on GitHub (Aug 25, 2024): You're more than welcome to submit a PR, but it's up to the ollama team (of which I am not a part) to merge it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4086