Use the current fragment offset when emitting unmatched spans during multi-regex BPE splitting. This avoids duplicating earlier prompt text and inflating token counts for multi-stage BPE tokenizers.
* tokenizer: add byte fallback for SentencePiece BPE encoding
When BPE merging produces tokens not in the vocabulary, fall back to
encoding each UTF-8 byte as <0xHH> byte tokens instead of silently
dropping the character. Also teach Decode to convert <0xHH> tokens
back to raw bytes.
Fixes#15229, fixes#15231
* tokenizer fixes
* tokenizer: add SentencePiece-style BPE support
Add WithSentencePieceNormalizer option to BytePairEncoding for models
that use BPE with SentencePiece-style space markers (space to/from
U+2581).
NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions
constructor accepts BPEOption functions. Decoding handles the reverse
mapping of U+2581 back to spaces.
* review comments