[GH-ISSUE #14533] BERTimbau, Sabiá #9428

Closed
opened 2026-04-12 22:21:16 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @guicybercode on GitHub (Mar 1, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14533

[Feature Request] Add Brazilian Portuguese (PT-BR) native LLM models to the Ollama Library

Summary

The Ollama official library currently has zero natively trained Brazilian Portuguese models. With 260M+ Portuguese speakers worldwide (215M+ in Brazil alone) and a growing ecosystem of open-source PT-BR LLMs, adding these models would significantly expand Ollama's reach in Latin America's largest AI market.

The Problem

While models like Llama 3.2/3.3/4, Gemma, and Qwen include Portuguese as a "supported language," they are not optimized for Brazilian Portuguese. This leads to:

  • Poor tokenizer efficiency: Models using English-centric tokenizers (e.g., Llama 2's BPE) encode Portuguese text very inefficiently — often 2-3x more tokens than necessary for the same content (research from Tucano paper).
  • Cultural blindspots: Mainstream models struggle with Brazilian-specific concepts (PIX payment system, CPF/CNPJ, ENEM, vestibular, regionalismos, gírias).
  • Lack of local compliance awareness: Brazilian enterprises need models that understand LGPD (Brazil's data protection law) and local regulatory frameworks.

Currently, the only PT-BR options on Ollama are community-uploaded models (e.g., splitpierre/bode-alpaca-pt-br, brunoconterato/Gemma-3-Gaia-PT-BR-4b-it, RecognaNLP/chatbode) — none are in the official ollama.com/library.

Proposed Models for Inclusion

Tier 1 — Strong candidates (peer-reviewed, Apache 2.0, active development)

Model Params License HuggingFace Notes
Tucano (160m, 630m, 1.1B, 2.4B) 160M–2.4B Apache 2.0 TucanoBR Natively pre-trained on GigaVerbo (200B tokens PT-BR). Published in Patterns (Cell Press). Includes instruct variants and ViTucano (vision).
TeenyTinyLlama 160M, 460M Apache 2.0 nicholasKluge First open-source tiny LLMs natively trained in PT-BR. Same team as Tucano.
GAIA 4B (Gemma 3 base) Open Google DeepMind / CEIA-UFG Google DeepMind + Universidade Federal de Goiás. Already piloted by Brazilian government (TCM-GO) and Unimed.

Tier 2 — Established models (Llama-based fine-tunes)

Model Params License HuggingFace Notes
Canarim (7B, 7B-Instruct) 7B Llama 2 Community dominguesm Extended training on 16B tokens from Portuguese Common Crawl. Vestibulaide variant for Brazilian entrance exams.
Cabrita 3B Apache 2.0 22h/cabrita-lora-v0-1 Fine-tuned OpenLLaMA 3B with custom PT-BR tokenizer.
Bode 7B, 13B Llama 2 Community recogna-nlp LoRA fine-tune of Llama 2 on translated Alpaca dataset.

Tier 3 — Notable but limited availability

Model Notes
Sabiá (7B, Sabiá-2, Sabiá-3) By maritaca.ai (Brazilian AI startup). Sabiá-3 is commercial API only. Sabiá-7B is Llama-based and open.
Gervásio (7B) European + Brazilian Portuguese variants. Llama 2 derivative.
Mula Sparse MoE architecture, natively trained in PT-BR. Early stage.

Why This Matters

  1. Market size: Brazil is the 5th largest country by population and the largest economy in Latin America. Portuguese is the 6th most spoken language globally.
  2. Growing AI ecosystem: Brazil's national AI plan (2024–2028) is actively investing in local AI development. The state of Goiás became the first Brazilian state to create an AI regulatory framework.
  3. Developer demand: The Open-PT-LLM-Leaderboard on HuggingFace has 100+ entries specializing in Portuguese, showing massive community interest.
  4. Edge/local deployment: Brazilian enterprises (banks, healthcare, government) increasingly need local LLM deployment for LGPD compliance — exactly Ollama's sweet spot.
  5. Tokenizer efficiency: Natively trained models like Tucano use optimized PT-BR tokenizers, meaning faster inference and lower memory usage for Portuguese text compared to multilingual models.

Suggested Starting Point

The Tucano series would be the ideal first addition:

  • Apache 2.0 license
  • Peer-reviewed publication in Patterns (Cell Press, July 2025)
  • Multiple sizes (160M to 2.4B) — perfect for edge deployment
  • Instruct variants available (Tucano-1b1-Instruct, Tucano-2b4-Instruct)
  • Vision variant (ViTucano)
  • Full open-source training code + evaluation harness on GitHub
  • Custom Portuguese tokenizer with superior compression
  • Already outperforms Llama-3.2-1B on Portuguese benchmarks

References

  • Corrêa et al. (2025). "Tucano: Advancing Neural Text Generation for Portuguese." Patterns, Cell Press. arXiv:2411.07854
  • Corrêa et al. (2024). "TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese." SoftwareX. DOI:10.1016/j.softx.2024.101724
  • Assis et al. (2025). "Exploring Brazil's LLM Fauna." Journal of the Brazilian Computer Society, 31(1). DOI:10.5753/jbcs.2025.5814
  • GAIA announcement: Google DeepMind + CEIA-UFG collaboration (June 2025)

Labels: feature, models, multilingual

I'm happy to help with GGUF conversions, Modelfile creation, or testing if the team is interested in moving forward with any of these models.

Originally created by @guicybercode on GitHub (Mar 1, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14533 # [Feature Request] Add Brazilian Portuguese (PT-BR) native LLM models to the Ollama Library ## Summary The Ollama official library currently has **zero natively trained Brazilian Portuguese models**. With 260M+ Portuguese speakers worldwide (215M+ in Brazil alone) and a growing ecosystem of open-source PT-BR LLMs, adding these models would significantly expand Ollama's reach in Latin America's largest AI market. ## The Problem While models like Llama 3.2/3.3/4, Gemma, and Qwen include Portuguese as a "supported language," they are **not optimized** for Brazilian Portuguese. This leads to: - **Poor tokenizer efficiency**: Models using English-centric tokenizers (e.g., Llama 2's BPE) encode Portuguese text very inefficiently — often 2-3x more tokens than necessary for the same content ([research from Tucano paper](https://arxiv.org/html/2411.07854v1)). - **Cultural blindspots**: Mainstream models struggle with Brazilian-specific concepts (PIX payment system, CPF/CNPJ, ENEM, vestibular, regionalismos, gírias). - **Lack of local compliance awareness**: Brazilian enterprises need models that understand LGPD (Brazil's data protection law) and local regulatory frameworks. Currently, the only PT-BR options on Ollama are **community-uploaded models** (e.g., `splitpierre/bode-alpaca-pt-br`, `brunoconterato/Gemma-3-Gaia-PT-BR-4b-it`, `RecognaNLP/chatbode`) — none are in the official `ollama.com/library`. ## Proposed Models for Inclusion ### Tier 1 — Strong candidates (peer-reviewed, Apache 2.0, active development) | Model | Params | License | HuggingFace | Notes | |-------|--------|---------|-------------|-------| | **Tucano** (160m, 630m, 1.1B, 2.4B) | 160M–2.4B | Apache 2.0 | [TucanoBR](https://huggingface.co/TucanoBR) | Natively pre-trained on GigaVerbo (200B tokens PT-BR). Published in *Patterns* (Cell Press). Includes instruct variants and ViTucano (vision). | | **TeenyTinyLlama** | 160M, 460M | Apache 2.0 | [nicholasKluge](https://huggingface.co/nicholasKluge) | First open-source tiny LLMs natively trained in PT-BR. Same team as Tucano. | | **GAIA** | 4B (Gemma 3 base) | Open | [Google DeepMind / CEIA-UFG](https://huggingface.co/) | Google DeepMind + Universidade Federal de Goiás. Already piloted by Brazilian government (TCM-GO) and Unimed. | ### Tier 2 — Established models (Llama-based fine-tunes) | Model | Params | License | HuggingFace | Notes | |-------|--------|---------|-------------|-------| | **Canarim** (7B, 7B-Instruct) | 7B | Llama 2 Community | [dominguesm](https://huggingface.co/dominguesm/canarim-7b) | Extended training on 16B tokens from Portuguese Common Crawl. Vestibulaide variant for Brazilian entrance exams. | | **Cabrita** | 3B | Apache 2.0 | [22h/cabrita-lora-v0-1](https://huggingface.co/22h) | Fine-tuned OpenLLaMA 3B with custom PT-BR tokenizer. | | **Bode** | 7B, 13B | Llama 2 Community | [recogna-nlp](https://huggingface.co/recogna-nlp) | LoRA fine-tune of Llama 2 on translated Alpaca dataset. | ### Tier 3 — Notable but limited availability | Model | Notes | |-------|-------| | **Sabiá** (7B, Sabiá-2, Sabiá-3) | By maritaca.ai (Brazilian AI startup). Sabiá-3 is commercial API only. Sabiá-7B is Llama-based and open. | | **Gervásio** (7B) | European + Brazilian Portuguese variants. Llama 2 derivative. | | **Mula** | Sparse MoE architecture, natively trained in PT-BR. Early stage. | ## Why This Matters 1. **Market size**: Brazil is the 5th largest country by population and the largest economy in Latin America. Portuguese is the 6th most spoken language globally. 2. **Growing AI ecosystem**: Brazil's national AI plan (2024–2028) is actively investing in local AI development. The state of Goiás became the first Brazilian state to create an AI regulatory framework. 3. **Developer demand**: The [Open-PT-LLM-Leaderboard](https://huggingface.co/spaces) on HuggingFace has 100+ entries specializing in Portuguese, showing massive community interest. 4. **Edge/local deployment**: Brazilian enterprises (banks, healthcare, government) increasingly need local LLM deployment for LGPD compliance — exactly Ollama's sweet spot. 5. **Tokenizer efficiency**: Natively trained models like Tucano use optimized PT-BR tokenizers, meaning faster inference and lower memory usage for Portuguese text compared to multilingual models. ## Suggested Starting Point The **Tucano series** would be the ideal first addition: - ✅ Apache 2.0 license - ✅ Peer-reviewed publication in *Patterns* (Cell Press, July 2025) - ✅ Multiple sizes (160M to 2.4B) — perfect for edge deployment - ✅ Instruct variants available (Tucano-1b1-Instruct, Tucano-2b4-Instruct) - ✅ Vision variant (ViTucano) - ✅ Full open-source training code + evaluation harness on [GitHub](https://github.com/Nkluge-correa/Tucano) - ✅ Custom Portuguese tokenizer with superior compression - ✅ Already outperforms Llama-3.2-1B on Portuguese benchmarks ## References - Corrêa et al. (2025). "Tucano: Advancing Neural Text Generation for Portuguese." *Patterns*, Cell Press. [arXiv:2411.07854](https://arxiv.org/abs/2411.07854) - Corrêa et al. (2024). "TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese." *SoftwareX*. [DOI:10.1016/j.softx.2024.101724](https://doi.org/10.1016/j.softx.2024.101724) - Assis et al. (2025). "Exploring Brazil's LLM Fauna." *Journal of the Brazilian Computer Society*, 31(1). [DOI:10.5753/jbcs.2025.5814](https://doi.org/10.5753/jbcs.2025.5814) - GAIA announcement: Google DeepMind + CEIA-UFG collaboration (June 2025) --- **Labels**: `feature`, `models`, `multilingual` I'm happy to help with GGUF conversions, Modelfile creation, or testing if the team is interested in moving forward with any of these models.
Author
Owner

@rick-github commented on GitHub (Mar 1, 2026):

The Tucano series is based on llama2 architecture and is theoretically easy to import. However, both ollama and llama.cpp imports create models that generate poor output.

ollama import:

$ ollama run frob/tucano:2b4-fp16 hello
h[UNK_BYTE_0xe29681▁você]voc[UNK_BYTE_0xe29681▁me]mesA[UNK_BYTE_0xe29681▁Você]Voc
[UNK_BYTE_0xe29681▁o]o[UNK_BYTE_0xe29681▁o]o[UNK_BYTE_0xe29681▁eu]eu
[UNK_BYTE_0xe29681▁Eu]Eu.[UNK_BYTE_0xe29681▁Este]Esteart[h[UNK_BYTE_0xe29681▁você]voc
[UNK_BYTE_0xe29681▁me]mesA[UNK_BYTE_0xe29681▁Você]Voc[UNK_BYTE_0xe29681▁o]o
[UNK_BYTE_0xe29681▁o]

This looks like umapped codepoints in the tokenizer.

llama.cpp import:

$ ollama run frob/tucano:2b4-fp16 hello
h "helsin O aada euEu Eu eu A - você "Eu gostaria A é a é para O.: Resposta O que O artigoUm
o'Reseu: EuVocêQ "Oeremos</>:<pad> tem:</Resposta: "A: Analcreva Você é
</No.Resposta:"Res: Escreva um teste: CenaEu estava planejTipo: "Oi, eu eu tenho "Eu quero
você pode me encontre esta pergunta: Quem sãouitasarei:O que pode produzir ser um bom
amigo de um amigo? Por favor, responda à seguinte pergunta: "O que os alunos fazem em
uma sala? Uma pessoa quer fazer um bolo de chocolate para sua festa de aniversário Para
cada 2 xícaras de chocolate em um bolo, quanto açúcar deve ser adicionado?
</instruction: Você está preparando uma festa de aniversário e deseja fazer um bolo de
chocolate com 2 xícaras de chocolate e 1/2 xícara de açúcar, mas não quer usar 2 xícaras
de chocolate. Em seguida, você adiciona mais açúcar do que o necessário para a receita.
Quantas colheres de açúcar você precisará adicionar?

I'm not familiar with Portuguese but the existence of </ sequences leads me to believe
that this is not coherent.

<!-- gh-comment-id:3981013001 --> @rick-github commented on GitHub (Mar 1, 2026): The Tucano series is based on llama2 architecture and is theoretically easy to [import](https://github.com/ollama/ollama/blob/main/docs/import.mdx#importing-a-model-from-safetensors-weights). However, both ollama and llama.cpp imports create models that generate poor output. ollama import: ``` $ ollama run frob/tucano:2b4-fp16 hello h[UNK_BYTE_0xe29681▁você]voc[UNK_BYTE_0xe29681▁me]mesA[UNK_BYTE_0xe29681▁Você]Voc [UNK_BYTE_0xe29681▁o]o[UNK_BYTE_0xe29681▁o]o[UNK_BYTE_0xe29681▁eu]eu [UNK_BYTE_0xe29681▁Eu]Eu.[UNK_BYTE_0xe29681▁Este]Esteart[h[UNK_BYTE_0xe29681▁você]voc [UNK_BYTE_0xe29681▁me]mesA[UNK_BYTE_0xe29681▁Você]Voc[UNK_BYTE_0xe29681▁o]o [UNK_BYTE_0xe29681▁o] ``` This looks like umapped codepoints in the tokenizer. llama.cpp import: ``` $ ollama run frob/tucano:2b4-fp16 hello h "helsin O aada euEu Eu eu A - você "Eu gostaria A é a é para O.: Resposta O que O artigoUm o'Reseu: EuVocêQ "Oeremos</>:<pad> tem:</Resposta: "A: Analcreva Você é </No.Resposta:"Res: Escreva um teste: CenaEu estava planejTipo: "Oi, eu eu tenho "Eu quero você pode me encontre esta pergunta: Quem sãouitasarei:O que pode produzir ser um bom amigo de um amigo? Por favor, responda à seguinte pergunta: "O que os alunos fazem em uma sala? Uma pessoa quer fazer um bolo de chocolate para sua festa de aniversário Para cada 2 xícaras de chocolate em um bolo, quanto açúcar deve ser adicionado? </instruction: Você está preparando uma festa de aniversário e deseja fazer um bolo de chocolate com 2 xícaras de chocolate e 1/2 xícara de açúcar, mas não quer usar 2 xícaras de chocolate. Em seguida, você adiciona mais açúcar do que o necessário para a receita. Quantas colheres de açúcar você precisará adicionar? ``` I'm not familiar with Portuguese but the existence of `</` sequences leads me to believe that this is not coherent.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9428