[GH-ISSUE #11282] Persist Custom System Prompt for a Model Instance (Like ChatGPT Custom Instructions) #7440

Closed
opened 2026-04-12 19:31:20 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @AlwaleedAlduies on GitHub (Jul 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11282

Hi Ollama team 👋,

I'm currently using Ollama to serve a local LLM (phi4, specifically), and I use it for a single, focused task in my application.

Right now, I have to prepend a long system prompt to every generation request in order to guide the model's behavior. This system prompt never changes, and the model is always used for the same purpose. Repeating it every time adds unnecessary token overhead and latency.

Feature Request:
I'd love a way to persist a custom system prompt for a model instance — similar to ChatGPT's "Custom Instructions" — so that the model always behaves in a specific way without needing to resend the full prompt each time.

Something like:
ollama run phi4 --system "Always respond in JSON format without explanation..."
Or via Modelfile:
FROM phi4 SYSTEM "You are a task-specific assistant. Follow these formatting rules..."
Then, during inference, only the user message would be required in the payload.
🙌 Benefits:

  • Reduces token usage per request
  • Improves performance and latency
  • Simplifies integration for dedicated or fixed-purpose bots

🔧 Current Workaround:
I currently prepend the system prompt manually for each call — which works — but built-in support for persistent instructions would be cleaner and more efficient.

Is this something that’s planned or under consideration?

Thanks for building such a great tool — Ollama has been fantastic to work with!

Best regards,
Alwaleed Alduais

Originally created by @AlwaleedAlduies on GitHub (Jul 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11282 Hi Ollama team 👋, I'm currently using Ollama to serve a local LLM (phi4, specifically), and I use it for a single, focused task in my application. Right now, I have to prepend a long system prompt to every generation request in order to guide the model's behavior. This system prompt never changes, and the model is always used for the same purpose. Repeating it every time adds unnecessary token overhead and latency. ✅ Feature Request: I'd love a way to persist a custom system prompt for a model instance — similar to ChatGPT's "Custom Instructions" — so that the model always behaves in a specific way without needing to resend the full prompt each time. Something like: ` ollama run phi4 --system "Always respond in JSON format without explanation..." ` Or via Modelfile: ` FROM phi4 SYSTEM "You are a task-specific assistant. Follow these formatting rules..." ` Then, during inference, only the user message would be required in the payload. 🙌 Benefits: - Reduces token usage per request - Improves performance and latency - Simplifies integration for dedicated or fixed-purpose bots 🔧 Current Workaround: I currently prepend the system prompt manually for each call — which works — but built-in support for persistent instructions would be cleaner and more efficient. Is this something that’s planned or under consideration? Thanks for building such a great tool — Ollama has been fantastic to work with! Best regards, Alwaleed Alduais
GiteaMirror added the feature request label 2026-04-12 19:31:20 -05:00
Author
Owner

@cwallen commented on GitHub (Jul 3, 2025):

Making a custom modelfile seems like it would work https://github.com/ollama/ollama/blob/main/docs/modelfile.md

<!-- gh-comment-id:3031177744 --> @cwallen commented on GitHub (Jul 3, 2025): Making a custom modelfile seems like it would work https://github.com/ollama/ollama/blob/main/docs/modelfile.md
Author
Owner

@rick-github commented on GitHub (Jul 3, 2025):

$ echo FROM phi4 > Modelfile
$ echo SYSTEM Talk like a pirate >> Modelfile
$ ollama create phi4:pirate
$ ollama run phi4:pirate hello
Ahoy there! What be on yer mind this fine day, matey? Be ye in need of guidance through
treacherous waters or just wishin' to parley about the sea's mysteries? Spill the beans, and
let's chart a course together! ⚓🏴‍☠️
<!-- gh-comment-id:3031578840 --> @rick-github commented on GitHub (Jul 3, 2025): ```console $ echo FROM phi4 > Modelfile $ echo SYSTEM Talk like a pirate >> Modelfile $ ollama create phi4:pirate $ ollama run phi4:pirate hello Ahoy there! What be on yer mind this fine day, matey? Be ye in need of guidance through treacherous waters or just wishin' to parley about the sea's mysteries? Spill the beans, and let's chart a course together! ⚓🏴‍☠️ ```
Author
Owner

@AlwaleedAlduies commented on GitHub (Jul 3, 2025):

Thanks! I actually followed that approach and created a custom model based on phi4 using a tailored Modelfile.

🔗 Here’s the model I published on Ollama

It works great for generating strict SQL Server queries — however, it didn’t improve response time compared to dynamically injecting the prompt each time. So while it's helpful for consistency and reusability, latency remains the same.

Still, a great way to encapsulate a specific persona/task!

<!-- gh-comment-id:3031752123 --> @AlwaleedAlduies commented on GitHub (Jul 3, 2025): Thanks! I actually followed that approach and created a custom model based on phi4 using a tailored Modelfile. 🔗 [Here’s the model I published on Ollama](https://ollama.com/Alwaleed98/walphi-sql) It works great for generating strict SQL Server queries — however, it didn’t improve response time compared to dynamically injecting the prompt each time. So while it's helpful for consistency and reusability, latency remains the same. Still, a great way to encapsulate a specific persona/task!
Author
Owner

@rick-github commented on GitHub (Jul 3, 2025):

The prompt will be cached so subsequent calls will have slightly lower latency. If you increase OLLAMA_NUM_PARALLEL that will add additional cache slots, so that prompt+schema will be cached for later use. However, the latency reduction from prompt caching will be dwarfed by the completion time.

<!-- gh-comment-id:3031878106 --> @rick-github commented on GitHub (Jul 3, 2025): The prompt will be cached so subsequent calls will have slightly lower latency. If you increase `OLLAMA_NUM_PARALLEL` that will add additional cache slots, so that prompt+schema will be cached for later use. However, the latency reduction from prompt caching will be dwarfed by the completion time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7440