[GH-ISSUE #14186] Ollama embedding api does not truncate input, regardless of "truncate" option #35005

Open
opened 2026-04-22 19:07:07 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @salivian on GitHub (Feb 10, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14186

What is the issue?

Since 0.13.5 (The last version that works is 0.13.4)

Ollama embedding api always return an error when input exceed model context size.

The "truncate" parameter has no effect.

This contradicts with the documentation
https://docs.ollama.com/api/embed

truncate boolean default:true

Relevant log output

v0.15.6
 
$curl  http://localhost:11434/api/embed  --json  '{      
    "model": "all-minilm",
    "input": "A week ago a friend invited a couple of other couples over for dinner. Eventually, the food (but not the wine) was cleared off the table for what turned out to be some fierce Scrabbling. Heeding the strategy of going for the shorter, more valuable word over the longer cheaper word, our final play was “Bon,” which–as luck would have it!–happens to be a Japanese Buddhist festival, and not, as I had originally asserted while laying the tiles on the board, one half of a chocolate-covered cherry treat. Anyway, the strategy worked. My team only lost by 53 points instead of 58.\nJust the day before, our host had written of the challenges of writing short. In journalism–my friend’s chosen trade, and mostly my own, too–Mark Twain’s observation undoubtedly applies: “I didn’t have time to write a short letter, so I wrote a long one instead.” The principle holds across genres, in letters, reporting, and other writing. It’s harder to be concise than to blather. (Full disclosure, this blog post will clock in at a blather-esque 803 words.) Good writing is boiled down, not baked full of air like a souffl??. No matter how yummy souffl??s may be. Which they are. Yummy like a Grisham novel.", "truncate":true}'
{"error":"the input length exceeds the context length"}

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.15.6

Originally created by @salivian on GitHub (Feb 10, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14186 ### What is the issue? Since 0.13.5 (The last version that works is 0.13.4) Ollama embedding api always return an error when input exceed model context size. The "truncate" parameter has no effect. This contradicts with the documentation https://docs.ollama.com/api/embed truncate boolean default:true ### Relevant log output ```shell v0.15.6 $curl http://localhost:11434/api/embed --json '{ "model": "all-minilm", "input": "A week ago a friend invited a couple of other couples over for dinner. Eventually, the food (but not the wine) was cleared off the table for what turned out to be some fierce Scrabbling. Heeding the strategy of going for the shorter, more valuable word over the longer cheaper word, our final play was “Bon,” which–as luck would have it!–happens to be a Japanese Buddhist festival, and not, as I had originally asserted while laying the tiles on the board, one half of a chocolate-covered cherry treat. Anyway, the strategy worked. My team only lost by 53 points instead of 58.\nJust the day before, our host had written of the challenges of writing short. In journalism–my friend’s chosen trade, and mostly my own, too–Mark Twain’s observation undoubtedly applies: “I didn’t have time to write a short letter, so I wrote a long one instead.” The principle holds across genres, in letters, reporting, and other writing. It’s harder to be concise than to blather. (Full disclosure, this blog post will clock in at a blather-esque 803 words.) Good writing is boiled down, not baked full of air like a souffl??. No matter how yummy souffl??s may be. Which they are. Yummy like a Grisham novel.", "truncate":true}' {"error":"the input length exceeds the context length"} ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.15.6
GiteaMirror added the bug label 2026-04-22 19:07:07 -05:00
Author
Owner

@BurakBebek1 commented on GitHub (Feb 12, 2026):

I tested this on the latest main branch and it seems the issue is already resolved. Long inputs are now correctly truncated to the context limit (256 tokens in my test) instead of returning an error.

<!-- gh-comment-id:3890044661 --> @BurakBebek1 commented on GitHub (Feb 12, 2026): I tested this on the latest main branch and it seems the issue is already resolved. Long inputs are now correctly truncated to the context limit (256 tokens in my test) instead of returning an error.
Author
Owner

@salivian commented on GitHub (Feb 13, 2026):

I have just cloned the main branch, compiled , pulled allminilm fresh and tested on a Mac

It is still returning

{"error":"the input length exceeds the context length"}

from the server side messages

time=2026-02-13T00:13:02.870-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=general.alignment default=32
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=101
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:61 msg="adding eos token to prompt" id=102
time=2026-02-13T00:13:02.871-05:00 level=INFO source=server.go:1751 msg="llm embedding error: the input length exceeds the context length"
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=bert.add_bos_token default=true
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=bert.add_eos_token default=true
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=101
time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:61 msg="adding eos token to prompt" id=102
time=2026-02-13T00:13:02.872-05:00 level=INFO source=server.go:1751 msg="llm embedding error: the input length exceeds the context length"
[GIN] 2026/02/13 - 00:13:02 | 400 | 14.992083ms | 127.0.0.1 | POST "/api/embed"
time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:405 msg="context for request finished" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256
time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:310 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256 duration=5m0s
time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:328 msg="after processing request finished event" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256 refCount=0

<!-- gh-comment-id:3894899118 --> @salivian commented on GitHub (Feb 13, 2026): I have just cloned the main branch, compiled , pulled allminilm fresh and tested on a Mac It is still returning {"error":"the input length exceeds the context length"} from the server side messages time=2026-02-13T00:13:02.870-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=general.alignment default=32 time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=101 time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:61 msg="adding eos token to prompt" id=102 time=2026-02-13T00:13:02.871-05:00 level=INFO source=server.go:1751 msg="llm embedding error: the input length exceeds the context length" time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=bert.add_bos_token default=true time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=ggml.go:300 msg="key with type not found" key=bert.add_eos_token default=true time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=101 time=2026-02-13T00:13:02.871-05:00 level=DEBUG source=vocabulary.go:61 msg="adding eos token to prompt" id=102 time=2026-02-13T00:13:02.872-05:00 level=INFO source=server.go:1751 msg="llm embedding error: the input length exceeds the context length" [GIN] 2026/02/13 - 00:13:02 | 400 | 14.992083ms | 127.0.0.1 | POST "/api/embed" time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:405 msg="context for request finished" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256 time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:310 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256 duration=5m0s time=2026-02-13T00:13:02.872-05:00 level=DEBUG source=sched.go:328 msg="after processing request finished event" runner.name=registry.ollama.ai/library/all-minilm:latest runner.inference="[{ID:0 Library:Metal}]" runner.size="71.1 MiB" runner.vram="71.1 MiB" runner.parallel=1 runner.pid=7026 runner.model=/Users/horace/.ollama/models/blobs/sha256-797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 runner.num_ctx=256 refCount=0
Author
Owner

@salivian commented on GitHub (Feb 13, 2026):

I found that sometimes the detokenized truncated string on https://github.com/ollama/ollama/blob/main/server/routes.go#L760
may produce more tokens than its ctxLen (may be truncate string gets tokenized differently ?). This lead to the error. Does it make sense at all ?

In my test cases, the tokens number from the truncated string > ctxLen and led to the error.

I have add a loop to ensure the truncated string produces less tokens than the ctxLen, but is there a better way?

	for {		
	    truncatedTokens, err = r.Tokenize(ctx,truncated)
	    if len(truncatedTokens) <= ctxLen {
	        break
	    }
	    truncatedTokens = truncatedTokens[:ctxLen]
	    truncated, err = r.Detokenize(ctx, truncatedTokens)
	}
<!-- gh-comment-id:3895230514 --> @salivian commented on GitHub (Feb 13, 2026): I found that sometimes the detokenized truncated string on https://github.com/ollama/ollama/blob/main/server/routes.go#L760 may produce more tokens than its ctxLen (may be truncate string gets tokenized differently ?). This lead to the error. Does it make sense at all ? In my test cases, the tokens number from the truncated string > ctxLen and led to the error. I have add a loop to ensure the truncated string produces less tokens than the ctxLen, but is there a better way? for { truncatedTokens, err = r.Tokenize(ctx,truncated) if len(truncatedTokens) <= ctxLen { break } truncatedTokens = truncatedTokens[:ctxLen] truncated, err = r.Detokenize(ctx, truncatedTokens) }
Author
Owner

@BurakBebek1 commented on GitHub (Feb 13, 2026):

UPDATE: I have to retract my previous 'resolved' comment. I performed a more rigorous stress test using multilingual input (Turkish, Japanese, and emojis) and it failed with the same error: {"error":"the input length exceeds the context length"}.

Test Environment: Windows/WSL2, latest main branch.
Even with truncate: true, the re-tokenization of special characters/emojis seems to push the token count over the limit after the initial truncation. @salivian was right, the truncation logic is not strictly enforcing the limit for complex encodings.

<!-- gh-comment-id:3895322162 --> @BurakBebek1 commented on GitHub (Feb 13, 2026): UPDATE: I have to retract my previous 'resolved' comment. I performed a more rigorous stress test using multilingual input (Turkish, Japanese, and emojis) and it failed with the same error: {"error":"the input length exceeds the context length"}. Test Environment: Windows/WSL2, latest main branch. Even with truncate: true, the re-tokenization of special characters/emojis seems to push the token count over the limit after the initial truncation. @salivian was right, the truncation logic is not strictly enforcing the limit for complex encodings.
Author
Owner

@salivian commented on GitHub (Feb 13, 2026):

Thanks for taking care of this

<!-- gh-comment-id:3899980318 --> @salivian commented on GitHub (Feb 13, 2026): Thanks for taking care of this
Author
Owner

@antoninbas commented on GitHub (Mar 5, 2026):

This is still happening
Any chance of #14230 getting merged soon?

<!-- gh-comment-id:4008354998 --> @antoninbas commented on GitHub (Mar 5, 2026): This is still happening Any chance of #14230 getting merged soon?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35005