[GH-ISSUE #13495] Token generation speed #34658

Closed
opened 2026-04-22 18:24:15 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Eb7CAPJi on GitHub (Dec 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13495

Could you explain why the token generation speed changes from version to version for the same model and the same question? I have a whole collection of these distributions and it's quite noticeable. Specifically, this version has, to put it mildly, a low token generation speed compared to the previous one.
ollama version is 0.13.4

Originally created by @Eb7CAPJi on GitHub (Dec 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13495 Could you explain why the token generation speed changes from version to version for the same model and the same question? I have a whole collection of these distributions and it's quite noticeable. Specifically, this version has, to put it mildly, a low token generation speed compared to the previous one. ollama version is 0.13.4
GiteaMirror added the feature requestneeds more info labels 2026-04-22 18:24:15 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 16, 2025):

Data?

<!-- gh-comment-id:3659811065 --> @rick-github commented on GitHub (Dec 16, 2025): Data?
Author
Owner

@pdevine commented on GitHub (Dec 16, 2025):

There is a bench command in cmd/bench which you can run which will give token generation speeds. We do have integration tests to check performance before we release.

Is there a particular model that you're seeing a slow down with, and can you describe the system you're using?

<!-- gh-comment-id:3661702919 --> @pdevine commented on GitHub (Dec 16, 2025): There is a `bench` command in `cmd/bench` which you can run which will give token generation speeds. We do have integration tests to check performance before we release. Is there a particular model that you're seeing a slow down with, and can you describe the system you're using?
Author
Owner

@Eb7CAPJi commented on GitHub (Dec 17, 2025):

I installed a new version of the Ollama program alongside the old one.
When I run a simple prompt, “Describe yourself,” the newer model version
generates tokens slowly, whereas the older version is very fast—so there
isn’t enough time to read the output before it finishes.

вт, 16 дек. 2025 г. в 20:46, Patrick Devine @.***>:

pdevine left a comment (ollama/ollama#13495)
https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919

There is a bench command in cmd/bench which you can run which will give
token generation speeds. We do have integration tests to check performance
before we release.

Is there a particular model that you're seeing a slow down with, and can
you describe the system you're using?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADHCCD4G56GPAUTYBKDV5KD4CBANPAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRRG4YDEOJRHE
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:3665991478 --> @Eb7CAPJi commented on GitHub (Dec 17, 2025): I installed a new version of the Ollama program alongside the old one. When I run a simple prompt, “Describe yourself,” the newer model version generates tokens slowly, whereas the older version is very fast—so there isn’t enough time to read the output before it finishes. вт, 16 дек. 2025 г. в 20:46, Patrick Devine ***@***.***>: > *pdevine* left a comment (ollama/ollama#13495) > <https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919> > > There is a bench command in cmd/bench which you can run which will give > token generation speeds. We do have integration tests to check performance > before we release. > > Is there a particular model that you're seeing a slow down with, and can > you describe the system you're using? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ADHCCD4G56GPAUTYBKDV5KD4CBANPAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRRG4YDEOJRHE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@Eb7CAPJi commented on GitHub (Dec 17, 2025):

another caveat is that token generation has been moved to the CPU instead
of the gpu. This was evident from the PROCESSOR recycling schedule.

ср, 17 дек. 2025 г., 18:53 Вадим @.***>:

I installed a new version of the Ollama program alongside the old one.
When I run a simple prompt, “Describe yourself,” the newer model version
generates tokens slowly, whereas the older version is very fast—so there
isn’t enough time to read the output before it finishes.

вт, 16 дек. 2025 г. в 20:46, Patrick Devine @.***>:

pdevine left a comment (ollama/ollama#13495)
https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919

There is a bench command in cmd/bench which you can run which will give
token generation speeds. We do have integration tests to check performance
before we release.

Is there a particular model that you're seeing a slow down with, and can
you describe the system you're using?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADHCCD4G56GPAUTYBKDV5KD4CBANPAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRRG4YDEOJRHE
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:3666431432 --> @Eb7CAPJi commented on GitHub (Dec 17, 2025): another caveat is that token generation has been moved to the CPU instead of the gpu. This was evident from the PROCESSOR recycling schedule. ср, 17 дек. 2025 г., 18:53 Вадим ***@***.***>: > I installed a new version of the Ollama program alongside the old one. > When I run a simple prompt, “Describe yourself,” the newer model version > generates tokens slowly, whereas the older version is very fast—so there > isn’t enough time to read the output before it finishes. > > вт, 16 дек. 2025 г. в 20:46, Patrick Devine ***@***.***>: > >> *pdevine* left a comment (ollama/ollama#13495) >> <https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919> >> >> There is a bench command in cmd/bench which you can run which will give >> token generation speeds. We do have integration tests to check performance >> before we release. >> >> Is there a particular model that you're seeing a slow down with, and can >> you describe the system you're using? >> >> — >> Reply to this email directly, view it on GitHub >> <https://github.com/ollama/ollama/issues/13495#issuecomment-3661702919>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ADHCCD4G56GPAUTYBKDV5KD4CBANPAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRRG4YDEOJRHE> >> . >> You are receiving this because you authored the thread.Message ID: >> ***@***.***> >> >
Author
Owner

@pdevine commented on GitHub (Dec 17, 2025):

Can you post the logs? What model are you trying to use? What kind of GPU? What is the output of ollama ps?

<!-- gh-comment-id:3667224501 --> @pdevine commented on GitHub (Dec 17, 2025): Can you post the logs? What model are you trying to use? What kind of GPU? What is the output of `ollama ps`?
Author
Owner

@Eb7CAPJi commented on GitHub (Dec 18, 2025):

The result is independent of the model type. The calculations have shifted
to burden the CPU instead of the GPU.

This is very noticeable on the 34B models. It's the maximum I can afford.

чт, 18 дек. 2025 г. в 00:21, Patrick Devine @.***>:

pdevine left a comment (ollama/ollama#13495)
https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501

Can you post the logs? What model are you trying to use? What kind of GPU?
What is the output of ollama ps?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADHCCD735ZXB6WXGJRRKZWD4CHCNXAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRXGIZDINJQGE
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:3669027167 --> @Eb7CAPJi commented on GitHub (Dec 18, 2025): The result is independent of the model type. The calculations have shifted to burden the CPU instead of the GPU. This is very noticeable on the 34B models. It's the maximum I can afford. чт, 18 дек. 2025 г. в 00:21, Patrick Devine ***@***.***>: > *pdevine* left a comment (ollama/ollama#13495) > <https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501> > > Can you post the logs? What model are you trying to use? What kind of GPU? > What is the output of ollama ps? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ADHCCD735ZXB6WXGJRRKZWD4CHCNXAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRXGIZDINJQGE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@Eb7CAPJi commented on GitHub (Dec 18, 2025):

Honestly, I have neither the time nor the particular desire to dive into
the code and figure out the root causes of the problems.
All I can say is that Ollama is an amazing tool—just wrap it in tests to
minimize the bugs.

чт, 18 дек. 2025 г. в 11:28, Вадим @.***>:

The result is independent of the model type. The calculations have shifted
to burden the CPU instead of the GPU.

This is very noticeable on the 34B models. It's the maximum I can afford.

чт, 18 дек. 2025 г. в 00:21, Patrick Devine @.***>:

pdevine left a comment (ollama/ollama#13495)
https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501

Can you post the logs? What model are you trying to use? What kind of
GPU? What is the output of ollama ps?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADHCCD735ZXB6WXGJRRKZWD4CHCNXAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRXGIZDINJQGE
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:3669069790 --> @Eb7CAPJi commented on GitHub (Dec 18, 2025): Honestly, I have neither the time nor the particular desire to dive into the code and figure out the root causes of the problems. All I can say is that Ollama is an amazing tool—just wrap it in tests to minimize the bugs. чт, 18 дек. 2025 г. в 11:28, Вадим ***@***.***>: > The result is independent of the model type. The calculations have shifted > to burden the CPU instead of the GPU. > > This is very noticeable on the 34B models. It's the maximum I can afford. > > чт, 18 дек. 2025 г. в 00:21, Patrick Devine ***@***.***>: > >> *pdevine* left a comment (ollama/ollama#13495) >> <https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501> >> >> Can you post the logs? What model are you trying to use? What kind of >> GPU? What is the output of ollama ps? >> >> — >> Reply to this email directly, view it on GitHub >> <https://github.com/ollama/ollama/issues/13495#issuecomment-3667224501>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ADHCCD735ZXB6WXGJRRKZWD4CHCNXAVCNFSM6AAAAACPFSLKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRXGIZDINJQGE> >> . >> You are receiving this because you authored the thread.Message ID: >> ***@***.***> >> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34658