[GH-ISSUE #4680] Json Mode significantly decrease GPU usage #49456

Closed
opened 2026-04-28 11:52:47 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @LaetLanf on GitHub (May 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4680

What is the issue?

I am running Ollama Llama3:70b-instruct on an Azure Linux A100 VM.

I did a test with and without json mode, with the exact same prompt and python code. The only thing I changed is format='json' in the chat call.

WITHOUT json mode, I reached:
22-25 TPS for 1 chat call
The monitoring of the GPU (see attached) clearly show that the GPU is well used

WITH json mode:
6 TPS for 1 chat call
The monitoring of the GPU (see attached) clearly show that the GPU is NOT fully used

Capture d’écran 2024-05-28 à 10 54 30

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

0.1.38

Originally created by @LaetLanf on GitHub (May 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4680 ### What is the issue? I am running Ollama Llama3:70b-instruct on an Azure Linux A100 VM. I did a test with and without json mode, with the exact same prompt and python code. The only thing I changed is format='json' in the chat call. WITHOUT json mode, I reached: 22-25 TPS for 1 chat call The monitoring of the GPU (see attached) clearly show that the GPU is well used WITH json mode: 6 TPS for 1 chat call The monitoring of the GPU (see attached) clearly show that the GPU is NOT fully used ![Capture d’écran 2024-05-28 à 10 54 30](https://github.com/ollama/ollama/assets/131473617/0419ae0d-96d1-4ac0-8359-bfd4abc2bb41) ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version 0.1.38
GiteaMirror added the bug label 2026-04-28 11:52:48 -05:00
Author
Owner

@pdevine commented on GitHub (May 28, 2024):

Dupe of #3851

cc @jmorganca

<!-- gh-comment-id:2136070638 --> @pdevine commented on GitHub (May 28, 2024): Dupe of #3851 cc @jmorganca
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49456