ollama

mirror of https://github.com/ollama/ollama.git synced 2026-05-05 23:53:43 -05:00

Author	SHA1	Message	Date
Daniel Hiltgen	ec9b4e9e47	tokenizer: fix multi-regex BPE offset handling (#15844 ) Use the current fragment offset when emitting unmatched spans during multi-regex BPE splitting. This avoids duplicating earlier prompt text and inflating token counts for multi-stage BPE tokenizers.	2026-04-27 14:14:27 -07:00
Daniel Hiltgen	de9673ac3f	tokenizer: add byte fallback for SentencePiece BPE encoding (#15232 ) * tokenizer: add byte fallback for SentencePiece BPE encoding When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes. Fixes #15229, fixes #15231 * tokenizer fixes	2026-04-02 13:04:45 -07:00
Daniel Hiltgen	cb0033598e	tokenizer: add SentencePiece-style BPE support (#15162 ) * tokenizer: add SentencePiece-style BPE support Add WithSentencePieceNormalizer option to BytePairEncoding for models that use BPE with SentencePiece-style space markers (space to/from U+2581). NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions constructor accepts BPEOption functions. Decoding handles the reverse mapping of U+2581 back to spaces. * review comments	2026-03-31 17:00:36 -07:00
Michael Yang	f1373193dc	move tokenizers to separate package (#13825 )	2026-02-05 17:44:11 -08:00