[GH-ISSUE #3552] /api/generate gets hung that can be steadily reproduced #64230

Closed
opened 2026-05-03 16:41:04 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @peter-gz on GitHub (Apr 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3552

What is the issue?

I feel headache about ollama getting hung from time to time when running codellama:13b for code completion. I notice a few issues reported ollama gets hung, as mentioned in #1863, #1901, #2225, etc. but haven't got fixed.

Now I have got a test case that can steadily reproduce the issue, when num_predict is set to 160 or above (e.g. 200) with my prompt. Hope this can help maintaners debugging the problem.

I am running ollama 0.1.30 with a v100 GPU on linux.

What did you expect to see?

No response

Steps to reproduce

run command:
curl -s localhost:11434/api/generate -d @test_hung.json

put the following text into test_hung.json file. here i also hardcoded seed and temperature in order to be more reproducible.

{
  "model": "codellama:13b",
  "options": {
    "num_predict": 160,
    "stop": [
      "<END>",
      "<EOD>",
      "<EOT>"
    ],
    "seed":999,
    "temperature": 0.0
  },
  "prompt": "<PRE> # Language: Shell\n# Path: /Users/panwh24/work/code/xxxxx-xxxxx-xxxxx/gateway/monitor.sh\n# a script to keep sending health check requests to a local http server with curl\n# if the healtch check timed-out in 10 seconds, kill the process\n\n#!/bin/bash\n\nrequest_body='{\"model\":\"codellama:13b\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.\"},{\"role\":\"user\",\"content\":\"test\"}]}'\n\nwhile true; do\n    echo \"checking...\"\n    curl --max-time 10 localhost:11434/api/chat -d request_body > /dev/null\n    if [ $? != 0 ]; then\n        cpu_util=$(ps -p `pgrep ollama` -o pcpu | grep -v CPU)\n        echo \"server is down. cpu util is $cpu_util\"\n        # kill the process\n        # pkill -9 ollama\n        # ./start.sh &\n    fi\n    sleep 30\ndoneFlask==3.0.3\nFlask_Cors==4.0.0\nRequests==2.31.0\nulid_py==1.1.0\n# Language: Python\n# Path: /Users/panwh24/work/code/xxxxx-copilot-vscode/gateway/gateway.py\n# -----------------------------------------------------------------------------\n# An API gateway written in Python3.\n# It proxies chat/fim API request to Ollama server, and records all requests/response into log files and provide metrics for debugging purpose.\n# Command arguments include 1) port to listen on 2) target server's hostname:port\n#\n# Author: panwh24@xxxxx.com\n# -----------------------------------------------------------------------------\n\nimport sys\nimport json\nimport io\nfrom flask import Flask, request, Response, stream_with_context, g\nfrom flask_cors import CORS\nimport requests\nimport time\nimport logging\nfrom logging.handlers import TimedRotatingFileHandler\nimport ulid\n\napp = Flask(__name__)\nCORS(app)\n\n# -----------------------------------------------------------------------------\n# Set up access log file\naccess_log = logging.getLogger('werkzeug')\naccess_log.setLevel(logging.INFO)\naccess_log_file = 'access.log'\naccess_log_handler = TimedRotatingFileHandler('access.log', when='midnight')\n# access_log_formatter = logging.Formatter('%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]')\n# access_log_handler.setFormatter(access_log_formatter)\naccess_log.addHandler(access_log_handler)\n\n# Set up gateway log files\nchat_log = logging.getLogger('chat')\nchat_log.setLevel(logging.INFO)\nchat_log_handler = TimedRotatingFileHandler('chat.log', when='midnight')\nchat_log_formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(message)s')\nchat_log_handler.setFormatter(chat_log_formatter)\nchat_log.addHandler(chat_log_handler)\n\ngenerate_log = logging.getLogger('generate')\ngenerate_log.setLevel(logging.INFO)\ngenerate_log_handler = TimedRotatingFileHandler('generate.log', when='midnight')\ngenerate_log_formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(message)s')\ngenerate_log_handler.setFormatter(generate_log_formatter)\ngenerate_log.addHandler(generate_log_handler)\n\n# -----------------------------------------------------------------------------\n\n@app.route('/', methods=['GET'])\ndef index():\n    response = requests.get('http://{}/'.format(target))\n    return Response(response=response.text, status=response.status_code)\n\n@app.route('/api/tags', methods=['GET'])\ndef tags():\n    response = requests.get('http://{}/api/tags'.format(target))\n    return Response(response=response.text, status=response.status_code)\n\n@app.route('/api/chat', methods=['POST'])\ndef chat_api():\n    # fields for logging\n    remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n    g.remote = remote\n    request_id = request.headers.get('X-Request-Id', ulid.new())\n    g.request_id = request_id\n    g.api = 'chat'\n    g.start_time = time.time()\n    g.data = []\n\n    chat_log.info('[{}] [{}] > Request size:{}\\n{}'.\n                  format(remote, request_id, request.content_length, request.data.decode('utf-8')))\n    response = requests.post('http://{}/api/chat'.format(target), json=request.json, headers={'Content-Type': 'application/json'}, stream=True)\n\n    def generate():\n        # read the response in chunks. when hits a newline char, yield\n        for line in response.iter_lines():\n            if line:\n                # record the content\n                try:\n                    linedata = json.loads(line)\n                except json.decoder.JSONDecodeError as e:\n                    print('decode error: ' + line)\n                    continue\n                g.data.append(linedata['message']['content'])\n                if linedata['done']:\n                    g.stats = line.decode()\n                # return in stream\n                yield line + b'\\n'\n\n        # end of stream\n    \n    return stream_with_context(generate())\n\n@app.route('/api/generate', methods=['POST'])\ndef generate_api():\n    # fields for logging\n    remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n    request_id = request.headers.get('X-Request-Id', ulid.new())\n    client_info = request.headers.get('X-Client-Info', '')\n\n    g.remote = remote\n    g.request_id = request_id\n    g.api = 'generate'\n    g.start_time = time.time()\n    g.data = []\n\n    generate_log.info('[{}] [{}] > Request size:{}\\n> Client info: {}\\n> Request:\\n{}'.\n                      format(remote, request_id, request.content_length, client_info, request.data.decode('utf-8')))\n    response = requests.post('http://{}/api/generate'.format(target), json=request.json, headers={'Content-Type': 'application/json'}, stream=True)\n\n    def generate():\n        # read the response in chunks. when hits a newline char, yield\n        for line in response.iter_lines():\n            if line:\n                # record the content\n                # print('>', time.time(), line.decode())\n                try:\n                    linedata = json.loads(line)\n                except json.decoder.JSONDecodeError as e:\n                    print('decode error: ' + line)\n                    continue\n                g.data.append(linedata['response'])\n\n                if linedata['done']:\n                    g.stats = line.decode()\n                # return in stream\n                yield line + b'\\n'\n\n        # end of stream\n        # print('>>> end of stream')\n        # end = time.time()\n        # result = buf.getvalue()\n        # buf.close()\n        # generate_log.info('[{}] [{}] > Response cost:{}ms, tokens:{}, size:{}, status:{}\\n> Result:\\n{}\\n> Stats: {}'.\n        #               format(remote, request_id, int((end - start)*1000), count, len(result), response.status_code, result, stats))\n    \n    # 打印请求的body\n    print('>>> body', request.data)\n     <SUF> \n\n    return stream_with_context(generate())\n\n@app.route('/api/complete', methods=['POST'])\ndef complete_api():\n    # fields for logging\n    remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n    request_id = request.headers.get('X-Request-Id', ulid.new())\n    client_info = request.headers.get('X-Client-Info', '')\n    \n    \n    return stream_with_context(generate())\n\n@app.teardown_request\ndef log_result(exception=None):\n    if not hasattr(g, 'request_id') or not hasattr(g, 'api') or not hasattr(g, 'data'):\n        return\n    \n    print('>>> end of request', g.request_id, \"api=\"+g.api, \"exception=\"+str(exception))\n\n    result = ''.join(g.data)\n    if g.api == 'chat':\n        logger = chat_log\n    elif g.api == 'generate':\n        logger = generate_log\n    else:\n        return\n    \n    start = g.start_time\n    end = time.time()\n    logger.info('[{}] [{}] > Response cost:{}ms, tokens:{}, size:{}\\n> Result:\\n{}\\n> Stats: {}'.\n                format(g.remote, g.request_id, int((end - start)*1000), len(g.data), len(result), result, g.get('stats', 'none')))\n\n\n# 上报提示成功\n@app.route(\"/prompt\", methods=['GET'])\ndef prompt():\n    # fields for logging\n    remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n    request_id = request.headers.get('X-Request-Id', ulid.new())\n    client_info = request.headers.get('X-Client-Info', '')\n\n    generate_log.info('[{}] [{}] inline completion prompted\\n> Client info: {}'.format(remote, request_id, client_info))\n    return 'ok'\n\n# 上报用户接受提示\n@app.route(\"/accept\", methods=['GET'])\ndef accept():\n    # fields for logging\n    remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n    request_id = request.headers.get('X-Request-Id', ulid.new())\n    client_info = request.headers.get('X-Client-Info', '')\n\n    generate_log.info('[{}] [{}] [{}] user accepted\\n> Client info: {}'.format(remote, request_id, client_info))\n    return 'ok'\n\n# -----------------------------------------------------------------------------\n\nif __name__ == '__main__':\n    if len(sys.argv) != 3:\n        print('Usage: python gateway.py <port> <target hostname:port>')\n        sys.exit()\n\n    port = int(sys.argv[1])\n    target = sys.argv[2]\n    app.run(host='0.0.0.0', port=port, debug=True)\n\n    # http://10.16.112.219:8001/\n\n    # TODO: add graceful shutdown with SIGTERM signal handler\n    # TODO: add metrics for request/response time, error rate, etc.\n    # TODO: add authentication and authorization for requests to /chat/fim\n    # TODO: add logging for all requests and responses, including error messages\n    # TODO: add support for multiple target servers with load balancing\n    # TODO: add support for request throttling\n    # TODO: add support for response caching\n    # TODO: add support for metrics and monitoring\n    # TODO: add support for tracing and debugging\n <MID>",
  "raw": true
}

the output stream got stuck here and i have to pkill -9 ollama to recover.
image

when stuck, cpu utilization of ollama process is 100%, while gpu usage is 0%.
image

Everything works fine if I change num_predict to 150 in the request.

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

amd64

Platform

No response

Ollama version

0.1.30

GPU

Nvidia

GPU info

v100

CPU

No response

Other software

No response

Originally created by @peter-gz on GitHub (Apr 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3552 ### What is the issue? I feel headache about ollama getting hung from time to time when running `codellama:13b` for code completion. I notice a few issues reported ollama gets hung, as mentioned in #1863, #1901, #2225, etc. but haven't got fixed. Now I have got a test case that can steadily reproduce the issue, when `num_predict` is set to 160 or above (e.g. 200) with my prompt. Hope this can help maintaners debugging the problem. I am running ollama `0.1.30` with a v100 GPU on linux. ### What did you expect to see? _No response_ ### Steps to reproduce run command: `curl -s localhost:11434/api/generate -d @test_hung.json` put the following text into test_hung.json file. here i also hardcoded seed and temperature in order to be more reproducible. ``` { "model": "codellama:13b", "options": { "num_predict": 160, "stop": [ "<END>", "<EOD>", "<EOT>" ], "seed":999, "temperature": 0.0 }, "prompt": "<PRE> # Language: Shell\n# Path: /Users/panwh24/work/code/xxxxx-xxxxx-xxxxx/gateway/monitor.sh\n# a script to keep sending health check requests to a local http server with curl\n# if the healtch check timed-out in 10 seconds, kill the process\n\n#!/bin/bash\n\nrequest_body='{\"model\":\"codellama:13b\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.\"},{\"role\":\"user\",\"content\":\"test\"}]}'\n\nwhile true; do\n echo \"checking...\"\n curl --max-time 10 localhost:11434/api/chat -d request_body > /dev/null\n if [ $? != 0 ]; then\n cpu_util=$(ps -p `pgrep ollama` -o pcpu | grep -v CPU)\n echo \"server is down. cpu util is $cpu_util\"\n # kill the process\n # pkill -9 ollama\n # ./start.sh &\n fi\n sleep 30\ndoneFlask==3.0.3\nFlask_Cors==4.0.0\nRequests==2.31.0\nulid_py==1.1.0\n# Language: Python\n# Path: /Users/panwh24/work/code/xxxxx-copilot-vscode/gateway/gateway.py\n# -----------------------------------------------------------------------------\n# An API gateway written in Python3.\n# It proxies chat/fim API request to Ollama server, and records all requests/response into log files and provide metrics for debugging purpose.\n# Command arguments include 1) port to listen on 2) target server's hostname:port\n#\n# Author: panwh24@xxxxx.com\n# -----------------------------------------------------------------------------\n\nimport sys\nimport json\nimport io\nfrom flask import Flask, request, Response, stream_with_context, g\nfrom flask_cors import CORS\nimport requests\nimport time\nimport logging\nfrom logging.handlers import TimedRotatingFileHandler\nimport ulid\n\napp = Flask(__name__)\nCORS(app)\n\n# -----------------------------------------------------------------------------\n# Set up access log file\naccess_log = logging.getLogger('werkzeug')\naccess_log.setLevel(logging.INFO)\naccess_log_file = 'access.log'\naccess_log_handler = TimedRotatingFileHandler('access.log', when='midnight')\n# access_log_formatter = logging.Formatter('%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]')\n# access_log_handler.setFormatter(access_log_formatter)\naccess_log.addHandler(access_log_handler)\n\n# Set up gateway log files\nchat_log = logging.getLogger('chat')\nchat_log.setLevel(logging.INFO)\nchat_log_handler = TimedRotatingFileHandler('chat.log', when='midnight')\nchat_log_formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(message)s')\nchat_log_handler.setFormatter(chat_log_formatter)\nchat_log.addHandler(chat_log_handler)\n\ngenerate_log = logging.getLogger('generate')\ngenerate_log.setLevel(logging.INFO)\ngenerate_log_handler = TimedRotatingFileHandler('generate.log', when='midnight')\ngenerate_log_formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(message)s')\ngenerate_log_handler.setFormatter(generate_log_formatter)\ngenerate_log.addHandler(generate_log_handler)\n\n# -----------------------------------------------------------------------------\n\n@app.route('/', methods=['GET'])\ndef index():\n response = requests.get('http://{}/'.format(target))\n return Response(response=response.text, status=response.status_code)\n\n@app.route('/api/tags', methods=['GET'])\ndef tags():\n response = requests.get('http://{}/api/tags'.format(target))\n return Response(response=response.text, status=response.status_code)\n\n@app.route('/api/chat', methods=['POST'])\ndef chat_api():\n # fields for logging\n remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n g.remote = remote\n request_id = request.headers.get('X-Request-Id', ulid.new())\n g.request_id = request_id\n g.api = 'chat'\n g.start_time = time.time()\n g.data = []\n\n chat_log.info('[{}] [{}] > Request size:{}\\n{}'.\n format(remote, request_id, request.content_length, request.data.decode('utf-8')))\n response = requests.post('http://{}/api/chat'.format(target), json=request.json, headers={'Content-Type': 'application/json'}, stream=True)\n\n def generate():\n # read the response in chunks. when hits a newline char, yield\n for line in response.iter_lines():\n if line:\n # record the content\n try:\n linedata = json.loads(line)\n except json.decoder.JSONDecodeError as e:\n print('decode error: ' + line)\n continue\n g.data.append(linedata['message']['content'])\n if linedata['done']:\n g.stats = line.decode()\n # return in stream\n yield line + b'\\n'\n\n # end of stream\n \n return stream_with_context(generate())\n\n@app.route('/api/generate', methods=['POST'])\ndef generate_api():\n # fields for logging\n remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n request_id = request.headers.get('X-Request-Id', ulid.new())\n client_info = request.headers.get('X-Client-Info', '')\n\n g.remote = remote\n g.request_id = request_id\n g.api = 'generate'\n g.start_time = time.time()\n g.data = []\n\n generate_log.info('[{}] [{}] > Request size:{}\\n> Client info: {}\\n> Request:\\n{}'.\n format(remote, request_id, request.content_length, client_info, request.data.decode('utf-8')))\n response = requests.post('http://{}/api/generate'.format(target), json=request.json, headers={'Content-Type': 'application/json'}, stream=True)\n\n def generate():\n # read the response in chunks. when hits a newline char, yield\n for line in response.iter_lines():\n if line:\n # record the content\n # print('>', time.time(), line.decode())\n try:\n linedata = json.loads(line)\n except json.decoder.JSONDecodeError as e:\n print('decode error: ' + line)\n continue\n g.data.append(linedata['response'])\n\n if linedata['done']:\n g.stats = line.decode()\n # return in stream\n yield line + b'\\n'\n\n # end of stream\n # print('>>> end of stream')\n # end = time.time()\n # result = buf.getvalue()\n # buf.close()\n # generate_log.info('[{}] [{}] > Response cost:{}ms, tokens:{}, size:{}, status:{}\\n> Result:\\n{}\\n> Stats: {}'.\n # format(remote, request_id, int((end - start)*1000), count, len(result), response.status_code, result, stats))\n \n # 打印请求的body\n print('>>> body', request.data)\n <SUF> \n\n return stream_with_context(generate())\n\n@app.route('/api/complete', methods=['POST'])\ndef complete_api():\n # fields for logging\n remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n request_id = request.headers.get('X-Request-Id', ulid.new())\n client_info = request.headers.get('X-Client-Info', '')\n \n \n return stream_with_context(generate())\n\n@app.teardown_request\ndef log_result(exception=None):\n if not hasattr(g, 'request_id') or not hasattr(g, 'api') or not hasattr(g, 'data'):\n return\n \n print('>>> end of request', g.request_id, \"api=\"+g.api, \"exception=\"+str(exception))\n\n result = ''.join(g.data)\n if g.api == 'chat':\n logger = chat_log\n elif g.api == 'generate':\n logger = generate_log\n else:\n return\n \n start = g.start_time\n end = time.time()\n logger.info('[{}] [{}] > Response cost:{}ms, tokens:{}, size:{}\\n> Result:\\n{}\\n> Stats: {}'.\n format(g.remote, g.request_id, int((end - start)*1000), len(g.data), len(result), result, g.get('stats', 'none')))\n\n\n# 上报提示成功\n@app.route(\"/prompt\", methods=['GET'])\ndef prompt():\n # fields for logging\n remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n request_id = request.headers.get('X-Request-Id', ulid.new())\n client_info = request.headers.get('X-Client-Info', '')\n\n generate_log.info('[{}] [{}] inline completion prompted\\n> Client info: {}'.format(remote, request_id, client_info))\n return 'ok'\n\n# 上报用户接受提示\n@app.route(\"/accept\", methods=['GET'])\ndef accept():\n # fields for logging\n remote = f\"{request.remote_addr}:{request.environ['REMOTE_PORT']}\"\n request_id = request.headers.get('X-Request-Id', ulid.new())\n client_info = request.headers.get('X-Client-Info', '')\n\n generate_log.info('[{}] [{}] [{}] user accepted\\n> Client info: {}'.format(remote, request_id, client_info))\n return 'ok'\n\n# -----------------------------------------------------------------------------\n\nif __name__ == '__main__':\n if len(sys.argv) != 3:\n print('Usage: python gateway.py <port> <target hostname:port>')\n sys.exit()\n\n port = int(sys.argv[1])\n target = sys.argv[2]\n app.run(host='0.0.0.0', port=port, debug=True)\n\n # http://10.16.112.219:8001/\n\n # TODO: add graceful shutdown with SIGTERM signal handler\n # TODO: add metrics for request/response time, error rate, etc.\n # TODO: add authentication and authorization for requests to /chat/fim\n # TODO: add logging for all requests and responses, including error messages\n # TODO: add support for multiple target servers with load balancing\n # TODO: add support for request throttling\n # TODO: add support for response caching\n # TODO: add support for metrics and monitoring\n # TODO: add support for tracing and debugging\n <MID>", "raw": true } ``` the output stream got stuck here and i have to `pkill -9 ollama` to recover. ![image](https://github.com/ollama/ollama/assets/40975524/69ed618f-b807-4ee1-8251-67efc15db9fe) when stuck, cpu utilization of ollama process is 100%, while gpu usage is 0%. ![image](https://github.com/ollama/ollama/assets/40975524/ddac14d3-6cc9-4ea3-be2c-3e1239979740) Everything works fine if I change `num_predict` to 150 in the request. ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture amd64 ### Platform _No response_ ### Ollama version 0.1.30 ### GPU Nvidia ### GPU info v100 ### CPU _No response_ ### Other software _No response_
GiteaMirror added the bugneeds more info labels 2026-05-03 16:41:04 -05:00
Author
Owner

@dhiltgen commented on GitHub (Oct 23, 2024):

Please give the new 0.4.0 RC a try and see how it behaves in your scenario. We've changed the way we cache which should improve performance and reliability, which may resolve this hang.

https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2433139133 --> @dhiltgen commented on GitHub (Oct 23, 2024): Please give the new 0.4.0 RC a try and see how it behaves in your scenario. We've changed the way we cache which should improve performance and reliability, which may resolve this hang. https://github.com/ollama/ollama/releases
Author
Owner

@dhiltgen commented on GitHub (Oct 30, 2024):

If you still see the hang on 0.4.0 let us know and we'll reopen the issue. Please share updated server logs if so.

<!-- gh-comment-id:2448297576 --> @dhiltgen commented on GitHub (Oct 30, 2024): If you still see the hang on 0.4.0 let us know and we'll reopen the issue. Please share updated server logs if so.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64230