[GH-ISSUE #15619] Feature: Multi-model tier flags for ollama launch claude (--large, --medium, --small) #56478

Open
opened 2026-04-29 10:52:57 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jyxtn on GitHub (Apr 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15619

Originally assigned to: @ParthSareen on GitHub.

Feature: Multi-model tier flags for ollama launch claude (--large, --medium, --small)

Problem

When using ollama launch claude --model <model>, all four of Claude Code's model tiers (opus, sonnet, haiku, subagent) are mapped to the same model:

func (c *Claude) modelEnvVars(model string) []string {
    env := []string{
        "ANTHROPIC_DEFAULT_OPUS_MODEL=" + model,
        "ANTHROPIC_DEFAULT_SONNET_MODEL=" + model,
        "ANTHROPIC_DEFAULT_HAIKU_MODEL=" + model,
        "CLAUDE_CODE_SUBAGENT_MODEL=" + model,
    }

This means the main conversation, subagents, and the built-in Explore agent all hit the same model. For users with multiple models available (e.g., a large model for primary reasoning, a medium model for daily work, a small model for quick tasks), there's no way to take advantage of that tiering through ollama launch.

Standard Claude Code uses three tiers by default (Opus/Sonnet/Haiku), and each serves a distinct purpose — the primary conversation uses the most capable model, subagents use a balanced model, and the Explore agent uses a small/fast model for search tasks. The current ollama launch behavior loses this architectural benefit.

Proposed Solution

Add --large, --medium, and --small flags to ollama launch claude that map to the corresponding Claude Code model tiers:

# Today: one model for everything
ollama launch claude --model qwen3:32b

# Proposed: different models for different tiers
ollama launch claude --large qwen3:32b --medium qwen3:14b --small qwen3:4b

Flag mapping:

Flag Claude Code tier Env var set
--large Opus (primary conversation) ANTHROPIC_DEFAULT_OPUS_MODEL
--medium Sonnet (subagents, daily work) ANTHROPIC_DEFAULT_SONNET_MODEL, CLAUDE_CODE_SUBAGENT_MODEL
--small Haiku (Explore agent, background) ANTHROPIC_DEFAULT_HAIKU_MODEL

Backward compatibility: --model continues to work as today (sets all tiers to one model). If --large is specified without --medium or --small, the unspecified tiers fall back to --large's value (or --model if provided). This avoids requiring all three flags every time.

Why this matters

  1. Cost/performance optimization. Users with limited compute (especially local GPU users) benefit from routing lightweight tasks (Explore agent searches, subagent file reads) to a small model while reserving their largest model for the primary conversation.

  2. Cloud model users. Users of Ollama Cloud with access to multiple model sizes can match each tier to the right model without manual env var configuration.

  3. Parity with standard Claude Code. The three-tier model is how Claude Code is designed to work. ollama launch should make it easy to use this architecture, not flatten it.

Prior art

PR #14099 introduced the tier env var overrides and an alias system. The alias system was later removed in PR #14810, with @drifkin noting: "I could imagine bringing it back in the future, but I'd want to think through the semantics a bit more." This proposal offers clear semantics — three CLI flags with explicit tier mapping and a simple fallback rule.

Happy to take a stab at the implementation if this aligns with the team's thinking.

Originally created by @jyxtn on GitHub (Apr 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15619 Originally assigned to: @ParthSareen on GitHub. # Feature: Multi-model tier flags for `ollama launch claude` (`--large`, `--medium`, `--small`) ## Problem When using `ollama launch claude --model <model>`, all four of Claude Code's model tiers (opus, sonnet, haiku, subagent) are mapped to the same model: ```go func (c *Claude) modelEnvVars(model string) []string { env := []string{ "ANTHROPIC_DEFAULT_OPUS_MODEL=" + model, "ANTHROPIC_DEFAULT_SONNET_MODEL=" + model, "ANTHROPIC_DEFAULT_HAIKU_MODEL=" + model, "CLAUDE_CODE_SUBAGENT_MODEL=" + model, } ``` This means the main conversation, subagents, and the built-in Explore agent all hit the same model. For users with multiple models available (e.g., a large model for primary reasoning, a medium model for daily work, a small model for quick tasks), there's no way to take advantage of that tiering through `ollama launch`. Standard Claude Code uses three tiers by default (Opus/Sonnet/Haiku), and each serves a distinct purpose — the primary conversation uses the most capable model, subagents use a balanced model, and the Explore agent uses a small/fast model for search tasks. The current `ollama launch` behavior loses this architectural benefit. ## Proposed Solution Add `--large`, `--medium`, and `--small` flags to `ollama launch claude` that map to the corresponding Claude Code model tiers: ```bash # Today: one model for everything ollama launch claude --model qwen3:32b # Proposed: different models for different tiers ollama launch claude --large qwen3:32b --medium qwen3:14b --small qwen3:4b ``` Flag mapping: | Flag | Claude Code tier | Env var set | |------|-----------------|------------| | `--large` | Opus (primary conversation) | `ANTHROPIC_DEFAULT_OPUS_MODEL` | | `--medium` | Sonnet (subagents, daily work) | `ANTHROPIC_DEFAULT_SONNET_MODEL`, `CLAUDE_CODE_SUBAGENT_MODEL` | | `--small` | Haiku (Explore agent, background) | `ANTHROPIC_DEFAULT_HAIKU_MODEL` | Backward compatibility: `--model` continues to work as today (sets all tiers to one model). If `--large` is specified without `--medium` or `--small`, the unspecified tiers fall back to `--large`'s value (or `--model` if provided). This avoids requiring all three flags every time. ## Why this matters 1. **Cost/performance optimization.** Users with limited compute (especially local GPU users) benefit from routing lightweight tasks (Explore agent searches, subagent file reads) to a small model while reserving their largest model for the primary conversation. 2. **Cloud model users.** Users of Ollama Cloud with access to multiple model sizes can match each tier to the right model without manual env var configuration. 3. **Parity with standard Claude Code.** The three-tier model is how Claude Code is designed to work. `ollama launch` should make it easy to use this architecture, not flatten it. ## Prior art PR #14099 introduced the tier env var overrides and an alias system. The alias system was later removed in PR #14810, with @drifkin noting: *"I could imagine bringing it back in the future, but I'd want to think through the semantics a bit more."* This proposal offers clear semantics — three CLI flags with explicit tier mapping and a simple fallback rule. Happy to take a stab at the implementation if this aligns with the team's thinking.
Author
Owner

@ParthSareen commented on GitHub (Apr 17, 2026):

I think the simplest approach would be to just detect whether the env vars are already set and then not override those. There could be some behavior where the user has stale models set from some config so maybe not as cut and dry. I'll try out a couple options and see what is feasible.

<!-- gh-comment-id:4264263508 --> @ParthSareen commented on GitHub (Apr 17, 2026): I think the simplest approach would be to just detect whether the env vars are already set and then not override those. There could be some behavior where the user has stale models set from some config so maybe not as cut and dry. I'll try out a couple options and see what is feasible.
Author
Owner

@jyxtn commented on GitHub (Apr 17, 2026):

The "don't override" approach makes sense as a baseline; it respects user config and prevents clobbering. I'd suggest also logging when it skips an override, so users can see that their existing config is active rather than wondering why the launch command didn't take effect.

To clarify where I was heading with the original proposal: the three tier flags are the main ask. The "don't override" behavior is how those flags should interact with existing user config.

The current launch command forces all three (actually four, if you consider CLAUDE_CODE_SUBAGENT_MODEL) model tiers to the same model. For anyone mixing providers, e.g. running a local model for primary but routing fast tiers to an API, the only way to configure this today is to know which env vars to set manually. It's not hard, just another step to replicate the usual tiered-model ClaudeCode behavior.

The UX:

ollama launch claude --large MODEL --medium MODEL --small MODEL

Three flags, mapping to the three model tiers that Claude Code already routes between:

  • --large → ANTHROPIC_DEFAULT_OPUS_MODEL
  • --medium → ANTHROPIC_DEFAULT_SONNET_MODEL
  • --small → ANTHROPIC_DEFAULT_HAIKU_MODEL

Unset flags fall back to --model or current behavior. This preserves Claude Code's built-in per-agent-type routing: Explore agents use the small model, Plan and Verification agents inherit the large model, etc. rather than overriding it.

I'm deliberately not including a flag for CLAUDE_CODE_SUBAGENT_MODEL. That's a global override that bypasses all per-agent-type routing. Users who need it can set the env var manually, but it shouldn't be a first-class flag in the launch command.

One thing worth flagging: the launch command sets these env vars process-locally on the spawned claude child, not in the parent shell. If that's the current behavior, this is already ephemeral by default. A user could run Ollama-native ollama launch claude --large glm5.1 --medium qwen2.5 --small llama3.2 in one terminal and Anthropic-sub native claude in another, simultaneously, with no conflict or restore step.

<!-- gh-comment-id:4265009189 --> @jyxtn commented on GitHub (Apr 17, 2026): The "don't override" approach makes sense as a baseline; it respects user config and prevents clobbering. I'd suggest also logging when it skips an override, so users can see that their existing config is active rather than wondering why the launch command didn't take effect. To clarify where I was heading with the original proposal: the three tier flags are the main ask. The "don't override" behavior is how those flags should interact with existing user config. The current launch command forces all three (actually four, if you consider CLAUDE_CODE_SUBAGENT_MODEL) model tiers to the same model. For anyone mixing providers, e.g. running a local model for primary but routing fast tiers to an API, the only way to configure this today is to know which env vars to set manually. It's not hard, just another step to replicate the usual tiered-model ClaudeCode behavior. The UX: `ollama launch claude --large MODEL --medium MODEL --small MODEL` Three flags, mapping to the three model tiers that Claude Code already routes between: - --large → ANTHROPIC_DEFAULT_OPUS_MODEL - --medium → ANTHROPIC_DEFAULT_SONNET_MODEL - --small → ANTHROPIC_DEFAULT_HAIKU_MODEL Unset flags fall back to --model or current behavior. This preserves Claude Code's built-in per-agent-type routing: Explore agents use the small model, Plan and Verification agents inherit the large model, etc. rather than overriding it. I'm deliberately not including a flag for CLAUDE_CODE_SUBAGENT_MODEL. That's a global override that bypasses all per-agent-type routing. Users who need it can set the env var manually, but it shouldn't be a first-class flag in the launch command. One thing worth flagging: the launch command sets these env vars process-locally on the spawned claude child, not in the parent shell. If that's the current behavior, this is already ephemeral by default. A user could run Ollama-native `ollama launch claude --large glm5.1 --medium qwen2.5 --small llama3.2` in one terminal and Anthropic-sub native `claude` in another, simultaneously, with no conflict or restore step.
Author
Owner

@lawcontinue commented on GitHub (Apr 17, 2026):

Nice proposal. The three-tier model mapping aligns well with how we approach model routing in our local setup.

1. Fallback semantics

The proposed fallback (--large falls back to itself for unspecified tiers) is sensible. One consideration: when running locally with limited RAM (e.g., 16GB), loading three different models simultaneously may not be feasible. It might be worth documenting that --small tier models should ideally fit in memory alongside --large without causing OOM.

2. Related: local model routing

We built a local router that dispatches to different models based on intent detection (coding → DeepSeek-R1 8B, tool-calling → Gemma3-Tools 12B). The tier approach here is complementary — routing by task type vs routing by model capability tier.

3. Suggestion: profile-based configuration

Instead of CLI flags only, consider a config file approach (e.g., ~/.ollama/tiers.json):

{
  "tiers": {
    "large": "qwen3:32b",
    "medium": "qwen3:14b",
    "small": "qwen3:4b"
  },
  "profiles": {
    "coding": {"large": "deepseek-r1:32b", "medium": "deepseek-r1:8b"},
    "general": {"large": "qwen3:32b", "medium": "qwen3:14b", "small": "qwen3:4b"}
  }
}

This lets users define profiles once and switch with ollama launch claude --profile coding. The CLI flags (--large, --medium, --small) would override profile defaults, giving both convenience and flexibility.

4. Memory awareness

For local users, it would be helpful if ollama launch could warn when the combined VRAM requirements of all tiers exceed available memory. Even a simple check (ollama show <model> --size for each tier) would prevent confusing OOM crashes mid-session.

<!-- gh-comment-id:4268673515 --> @lawcontinue commented on GitHub (Apr 17, 2026): Nice proposal. The three-tier model mapping aligns well with how we approach model routing in our local setup. **1. Fallback semantics** The proposed fallback (`--large` falls back to itself for unspecified tiers) is sensible. One consideration: when running locally with limited RAM (e.g., 16GB), loading three different models simultaneously may not be feasible. It might be worth documenting that `--small` tier models should ideally fit in memory alongside `--large` without causing OOM. **2. Related: local model routing** We built a [local router](https://github.com/lawcontinue/hippo) that dispatches to different models based on intent detection (coding → DeepSeek-R1 8B, tool-calling → Gemma3-Tools 12B). The tier approach here is complementary — routing by task type vs routing by model capability tier. **3. Suggestion: profile-based configuration** Instead of CLI flags only, consider a config file approach (e.g., `~/.ollama/tiers.json`): ```json { "tiers": { "large": "qwen3:32b", "medium": "qwen3:14b", "small": "qwen3:4b" }, "profiles": { "coding": {"large": "deepseek-r1:32b", "medium": "deepseek-r1:8b"}, "general": {"large": "qwen3:32b", "medium": "qwen3:14b", "small": "qwen3:4b"} } } ``` This lets users define profiles once and switch with `ollama launch claude --profile coding`. The CLI flags (`--large`, `--medium`, `--small`) would override profile defaults, giving both convenience and flexibility. **4. Memory awareness** For local users, it would be helpful if `ollama launch` could warn when the combined VRAM requirements of all tiers exceed available memory. Even a simple check (`ollama show <model> --size` for each tier) would prevent confusing OOM crashes mid-session.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15619
Analyzed: 2026-04-18T18:19:44.113181

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274305384 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15619 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15619 **Analyzed**: 2026-04-18T18:19:44.113181 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56478