[GH-ISSUE #728] Dummy model for API testing #337

Closed
opened 2026-04-12 09:54:26 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @S1M0N38 on GitHub (Oct 7, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/728

Hi, I'm trying to developing a piece of software the interact with ollama API by querying api/generate. I like to test my http requests with out the need to load a real model in memory. It's okay for me to send a request like

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "dummy-model",
  "prompt":"Why is the sky blue?"
}'

and get back

{"model": "dummy-model", "created_at": "...", "response": "Why", "done": false}
{"model": "dummy-model", "created_at": "...", "response": "is", "done": false}
{"model": "dummy-model", "created_at": "...", "response": "the", "done": false}
{"model": "dummy-model", "created_at": "...", "response": "sky", "done": false}
{"model": "dummy-model", "created_at": "...", "response": "blue", "done": false}
{"model": "dummy-model", "created_at": "...", "response": "?", "done": false}
{"model": "dummy-model", "created_at": "...", "done": true, ...}

So in this case the "model tokenizer" split the prompt on whitespaces and the "predicted word" is just the input word. In this way api call to ollama endpoint could be tested without the need of load a full llm in memory (this will help speed, low spec system and CI). Of course the "hacking option" is to re-implement ollama API in a simple http server that mimic ollama API but this could be error prone and needed to be constantly update with the most recent version of the API.

Is there a way to define such "dummy-model"? Or do you have any other suggestion to test external code that query ollama API?

Right now I'm using this just to test call to /api/generate

Originally created by @S1M0N38 on GitHub (Oct 7, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/728 Hi, I'm trying to developing a piece of software the interact with ollama API by querying `api/generate`. I like to test my http requests with out the need to load a real model in memory. It's okay for me to send a request like ```bash curl -X POST http://localhost:11434/api/generate -d '{ "model": "dummy-model", "prompt":"Why is the sky blue?" }' ``` and get back ```bash {"model": "dummy-model", "created_at": "...", "response": "Why", "done": false} {"model": "dummy-model", "created_at": "...", "response": "is", "done": false} {"model": "dummy-model", "created_at": "...", "response": "the", "done": false} {"model": "dummy-model", "created_at": "...", "response": "sky", "done": false} {"model": "dummy-model", "created_at": "...", "response": "blue", "done": false} {"model": "dummy-model", "created_at": "...", "response": "?", "done": false} {"model": "dummy-model", "created_at": "...", "done": true, ...} ``` So in this case the "model tokenizer" split the prompt on whitespaces and the "predicted word" is just the input word. In this way api call to ollama endpoint could be tested without the need of load a full llm in memory (this will help speed, low spec system and CI). Of course the "hacking option" is to re-implement ollama API in a simple http server that mimic ollama API but this could be error prone and needed to be constantly update with the most recent version of the API. Is there a way to define such "dummy-model"? Or do you have any other suggestion to test external code that query ollama API? Right now I'm using [this](https://gist.github.com/S1M0N38/f861ca42e2899b198168e2724fadc1d8) just to test call to `/api/generate`
Author
Owner

@TahaScripts commented on GitHub (Oct 7, 2023):

In pseudo-code terms, your goal should be to keep accepting responses from the LLM until you get a JSON where "done" : true, then tie together all of the response chunks into a string.

In JavaScript w/ Langchain a simple loop typically helps solve this.

import { Ollama } from "langchain/llms/ollama";

const ollama = new Ollama({
    baseUrl: 'http://localhost:11434', // Default value
    model: "llama2", // Default value
});

export async function inject(prompt) {
    const stream = await ollama.stream(
        prompt
    );
    const chunks = [];
    for await (const chunk of stream) {
        chunks.push(chunk);
    }
    return chunks.join("")
}

let response = await inject('Ask an existential and philosophical, yet scientific, question.')
// prints something like "Why is the sky blue?"
<!-- gh-comment-id:1751802413 --> @TahaScripts commented on GitHub (Oct 7, 2023): In pseudo-code terms, your goal should be to keep accepting responses from the LLM until you get a JSON where `"done" : true`, then tie together all of the response chunks into a string. In JavaScript w/ Langchain a simple loop typically helps solve this. ``` import { Ollama } from "langchain/llms/ollama"; const ollama = new Ollama({ baseUrl: 'http://localhost:11434', // Default value model: "llama2", // Default value }); export async function inject(prompt) { const stream = await ollama.stream( prompt ); const chunks = []; for await (const chunk of stream) { chunks.push(chunk); } return chunks.join("") } let response = await inject('Ask an existential and philosophical, yet scientific, question.') // prints something like "Why is the sky blue?" ```
Author
Owner

@ishaan-jaff commented on GitHub (Oct 8, 2023):

@S1M0N38 our litellm proxy allows you to pass mock_completion https://docs.litellm.ai/docs/proxy_server

<!-- gh-comment-id:1751874525 --> @ishaan-jaff commented on GitHub (Oct 8, 2023): @S1M0N38 our litellm proxy allows you to pass mock_completion https://docs.litellm.ai/docs/proxy_server
Author
Owner

@mxyng commented on GitHub (Oct 11, 2023):

The best solution is to create a mock or stub that returns expected results. This approach allows faster iteration since it doesn't need to call out to any external service.

Here's an example with pytest and pytest-httpserver:

# test_ollama.py
import json
import requests

from werkzeug.wrappers import Response


def stream(host, port, model, prompt):
  r = requests.post(
    f'http://{host}:{port}/api/generate',
    json={
      'model': model,
      'prompt': prompt,
    },
    stream=True,
  )
  r.raise_for_status()

  for line in r.iter_lines():
    yield json.loads(line)


def test_generate(httpserver):

  expected = ['a', 'b', 'c']

  def responses():
    for a in ['a', 'b', 'c']:
      yield json.dumps({'response': a}) + '\n'

  def handler(request):
    return Response(responses())

  httpserver.expect_request('/api/generate').respond_with_handler(handler)

  actual = []
  for line in stream(httpserver.host, httpserver.port, 'llama2', 'hi'):
    actual.append(line['response'])

  assert actual == expected

test_generate calls the httpserver fixture from pytest-httpserver, sets some mock stream outputs then calls stream which calls the ollama /api/generate. Finally it checks the actual response is the same as the expectation. There's additional assertions you can add that checks input data to /api/generate.

<!-- gh-comment-id:1758397194 --> @mxyng commented on GitHub (Oct 11, 2023): The best solution is to create a mock or stub that returns expected results. This approach allows faster iteration since it doesn't need to call out to any external service. Here's an example with pytest and pytest-httpserver: ```python # test_ollama.py import json import requests from werkzeug.wrappers import Response def stream(host, port, model, prompt): r = requests.post( f'http://{host}:{port}/api/generate', json={ 'model': model, 'prompt': prompt, }, stream=True, ) r.raise_for_status() for line in r.iter_lines(): yield json.loads(line) def test_generate(httpserver): expected = ['a', 'b', 'c'] def responses(): for a in ['a', 'b', 'c']: yield json.dumps({'response': a}) + '\n' def handler(request): return Response(responses()) httpserver.expect_request('/api/generate').respond_with_handler(handler) actual = [] for line in stream(httpserver.host, httpserver.port, 'llama2', 'hi'): actual.append(line['response']) assert actual == expected ``` `test_generate` calls the `httpserver` fixture from `pytest-httpserver`, sets some mock stream outputs then calls `stream` which calls the ollama `/api/generate`. Finally it checks the actual response is the same as the expectation. There's additional assertions you can add that checks input data to `/api/generate`.
Author
Owner

@mxyng commented on GitHub (Oct 11, 2023):

FWIW your stub server works as well as the mock above

<!-- gh-comment-id:1758398601 --> @mxyng commented on GitHub (Oct 11, 2023): FWIW your stub server works as well as the mock above
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#337