[PR #15552] app/ui: reduce chat loading time for long conversations #20422

Open
opened 2026-04-16 07:37:21 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15552
Author: @matteocelani
Created: 4/13/2026
Status: 🔄 Open

Base: mainHead: fix/chat-loading-performance


📝 Commits (8)

  • 628c619 app/ui: reduce unnecessary work during chat loading
  • dc46b18 Merge branch 'main' into fix/chat-loading-performance
  • fc8509b app/ui: fix scroll-to-bottom not triggering when switching chats
  • 2e2fb85 app/ui: tokenize only the active theme in code blocks
  • 67293cc app/ui: minor performance fixes across chat rendering components
  • 880325d app/ui: defer Shiki tokenization for code blocks outside viewport
  • 86926f5 app/ui: defer markdown rendering for messages outside viewport
  • 42cb240 ci: retrigger tests

📊 Changes

8 files changed (+282 additions, -335 deletions)

View changed files

📝 app/ui/app/src/components/Chat.tsx (+2 -2)
📝 app/ui/app/src/components/Message.tsx (+41 -180)
📝 app/ui/app/src/components/MessageList.tsx (+11 -7)
📝 app/ui/app/src/components/StreamingMarkdownContent.tsx (+186 -127)
📝 app/ui/app/src/components/Thinking.tsx (+18 -16)
app/ui/app/src/hooks/useTheme.ts (+18 -0)
📝 app/ui/app/src/lib/highlighter.ts (+5 -2)
📝 app/ui/app/src/routes/c.$chatId.tsx (+1 -1)

📄 Description

Switching to a long chat (400+ messages with code) blocks the main thread for ~1259ms. The primary bottleneck is Shiki syntax highlighting, which runs TextMate grammar regex for every code block on mount. Streamdown markdown parsing adds to this for every message. Both run for all messages including those outside the viewport.

After this change, the long task drops to ~285ms (77% reduction).

Changes

StreamingMarkdownContent.tsx: added a deferRendering prop. When true, the component renders plain text instead of running Streamdown. An IntersectionObserver upgrades to full markdown when the element enters the viewport. Only the last 10 messages (EAGER_RENDER_COUNT in MessageList.tsx) render Streamdown immediately. All DOM nodes stay mounted, only internal content changes, so useMessageAutoscroll and scroll position are not affected.

StreamingMarkdownContent.tsx: code blocks now tokenize only the active theme (prefers-color-scheme) instead of both light and dark. A per-theme cache stores both results lazily so theme switches after the first are instant. Code blocks outside the viewport defer Shiki tokenization via IntersectionObserver using a callback ref. Unregistered languages are skipped before calling codeToTokensBase to avoid throwing per block. Streamdown's internal code component is overridden to prevent duplicate Shiki highlighting.

useTheme.ts: new hook using useSyncExternalStore to read the system theme.

highlighter.ts: exported THEME_LIGHT/THEME_DARK constants, used in theme definitions.

Chat.tsx: prevChatIdRef initialized as null instead of chatId so scroll-to-bottom works on page refresh. Ref update moved inside the scroll branch to fix a race condition when messages load asynchronously.

c.$chatId.tsx: added await to ensureQueryData in route loader.

Message.tsx: added deferMarkdown prop, passed through to StreamingMarkdownContent. parsedArgs in ToolCallDisplay wrapped in useMemo. Removed unreachable code after early returns. Added lastToolQuery and deferMarkdown to memo comparator.

MessageList.tsx: EAGER_RENDER_COUNT constant controls how many messages from the bottom render full markdown. lastToolQueries useMemo uses messages.length instead of messages to avoid recalculation during streaming. Typed browserToolResult prop.

Thinking.tsx: ResizeObserver effect uses [] deps instead of [thinking]. StreamingMarkdownContent is not rendered when collapsed and finished thinking.

Benchmark

Measured on Chrome using PerformanceObserver longtask entries, switching from a short chat. Each value is the median of 3 runs.

Chat type Messages Before After Reduction
Code (Python, TypeScript) 430 ~1350ms ~262ms -81%
Web search with code ~80 ~780ms ~166ms -79%
PDF attachments (no code) 184 ~143ms ~143ms no change

The PDF chat is 3.7x heavier in data (52MB attachments vs 14MB) but 9.4x faster to load. This confirms the bottleneck is Shiki/Streamdown rendering code blocks, not data transfer or DOM size. The improvement scales with the amount of code in the conversation. Also tested on macOS desktop app (WebKit) where the improvement is visually noticeable.

Investigated but not included

Following the review feedback from @hoyyeva on #15265:

Attachment byte loading (ui.go): tested ChatWithOptions(cid, false), API time went from 213ms/70MB to 4ms/16KB for a 52MB attachment chat. Not included because it breaks image previews. A 67MB attachment chat loads in ~130ms while a lighter 18MB code chat takes ~1259ms, so the bottleneck is rendering, not data transfer.

Tool result duplication (ui.go): message.content and message.tool_result contain the same data for tool messages. The duplication is intentional because content is sent to the LLM as conversation context (ui.go:1743). The frontend already ignores content when tool_result exists. Omitting content from the chat API response would require backend serialization changes.

Batch DB queries (database.go): rewrote getMessages with JOIN queries (861 to 3 for 430 messages). Impact: 5ms to 4ms. SQLite local on SSD makes this negligible in a single-user desktop context.

Progressive message loading: tried mounting only the last N messages with batch loading in idle frames. Conflicted with useMessageAutoscroll (ResizeObserver, MutationObserver, scroll compensation). The deferred rendering approach achieves similar results by keeping all nodes mounted.

Also implemented from the same review: lazy render for collapsed tool results (skip JSON.stringify when hidden) and await in the route loader.


Fixes #12959


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15552 **Author:** [@matteocelani](https://github.com/matteocelani) **Created:** 4/13/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/chat-loading-performance` --- ### 📝 Commits (8) - [`628c619`](https://github.com/ollama/ollama/commit/628c619d1c4c9d03ab5cbb53c85d2d980ac5b269) app/ui: reduce unnecessary work during chat loading - [`dc46b18`](https://github.com/ollama/ollama/commit/dc46b18d4d52a9cf6c1cb3c5bfdc3e0cb85beed7) Merge branch 'main' into fix/chat-loading-performance - [`fc8509b`](https://github.com/ollama/ollama/commit/fc8509b532814d459899ae8f3e7fe59b95f2dbf2) app/ui: fix scroll-to-bottom not triggering when switching chats - [`2e2fb85`](https://github.com/ollama/ollama/commit/2e2fb855f5449a543a023839937e237f7b8abff3) app/ui: tokenize only the active theme in code blocks - [`67293cc`](https://github.com/ollama/ollama/commit/67293ccceaf4c786bb8a0d266f15ad4d329238fe) app/ui: minor performance fixes across chat rendering components - [`880325d`](https://github.com/ollama/ollama/commit/880325d72f177566c0460e6ec3f4bd75ebdaf306) app/ui: defer Shiki tokenization for code blocks outside viewport - [`86926f5`](https://github.com/ollama/ollama/commit/86926f5c07f31cc71d804643e2a18f99e3b9b243) app/ui: defer markdown rendering for messages outside viewport - [`42cb240`](https://github.com/ollama/ollama/commit/42cb240ae3774bc682a7c7e38ed4aa81ce03c6a6) ci: retrigger tests ### 📊 Changes **8 files changed** (+282 additions, -335 deletions) <details> <summary>View changed files</summary> 📝 `app/ui/app/src/components/Chat.tsx` (+2 -2) 📝 `app/ui/app/src/components/Message.tsx` (+41 -180) 📝 `app/ui/app/src/components/MessageList.tsx` (+11 -7) 📝 `app/ui/app/src/components/StreamingMarkdownContent.tsx` (+186 -127) 📝 `app/ui/app/src/components/Thinking.tsx` (+18 -16) ➕ `app/ui/app/src/hooks/useTheme.ts` (+18 -0) 📝 `app/ui/app/src/lib/highlighter.ts` (+5 -2) 📝 `app/ui/app/src/routes/c.$chatId.tsx` (+1 -1) </details> ### 📄 Description Switching to a long chat (400+ messages with code) blocks the main thread for ~1259ms. The primary bottleneck is Shiki syntax highlighting, which runs TextMate grammar regex for every code block on mount. Streamdown markdown parsing adds to this for every message. Both run for all messages including those outside the viewport. After this change, the long task drops to ~285ms (77% reduction). ### Changes `StreamingMarkdownContent.tsx`: added a `deferRendering` prop. When true, the component renders plain text instead of running Streamdown. An IntersectionObserver upgrades to full markdown when the element enters the viewport. Only the last 10 messages (`EAGER_RENDER_COUNT` in `MessageList.tsx`) render Streamdown immediately. All DOM nodes stay mounted, only internal content changes, so `useMessageAutoscroll` and scroll position are not affected. `StreamingMarkdownContent.tsx`: code blocks now tokenize only the active theme (`prefers-color-scheme`) instead of both light and dark. A per-theme cache stores both results lazily so theme switches after the first are instant. Code blocks outside the viewport defer Shiki tokenization via IntersectionObserver using a callback ref. Unregistered languages are skipped before calling `codeToTokensBase` to avoid throwing per block. Streamdown's internal `code` component is overridden to prevent duplicate Shiki highlighting. `useTheme.ts`: new hook using `useSyncExternalStore` to read the system theme. `highlighter.ts`: exported `THEME_LIGHT`/`THEME_DARK` constants, used in theme definitions. `Chat.tsx`: `prevChatIdRef` initialized as `null` instead of `chatId` so scroll-to-bottom works on page refresh. Ref update moved inside the scroll branch to fix a race condition when messages load asynchronously. `c.$chatId.tsx`: added `await` to `ensureQueryData` in route loader. `Message.tsx`: added `deferMarkdown` prop, passed through to `StreamingMarkdownContent`. `parsedArgs` in `ToolCallDisplay` wrapped in `useMemo`. Removed unreachable code after early returns. Added `lastToolQuery` and `deferMarkdown` to memo comparator. `MessageList.tsx`: `EAGER_RENDER_COUNT` constant controls how many messages from the bottom render full markdown. `lastToolQueries` useMemo uses `messages.length` instead of `messages` to avoid recalculation during streaming. Typed `browserToolResult` prop. `Thinking.tsx`: ResizeObserver effect uses `[]` deps instead of `[thinking]`. `StreamingMarkdownContent` is not rendered when collapsed and finished thinking. ### Benchmark Measured on Chrome using PerformanceObserver longtask entries, switching from a short chat. Each value is the median of 3 runs. | Chat type | Messages | Before | After | Reduction | |---|---|---|---|---| | Code (Python, TypeScript) | 430 | ~1350ms | ~262ms | -81% | | Web search with code | ~80 | ~780ms | ~166ms | -79% | | PDF attachments (no code) | 184 | ~143ms | ~143ms | no change | The PDF chat is 3.7x heavier in data (52MB attachments vs 14MB) but 9.4x faster to load. This confirms the bottleneck is Shiki/Streamdown rendering code blocks, not data transfer or DOM size. The improvement scales with the amount of code in the conversation. Also tested on macOS desktop app (WebKit) where the improvement is visually noticeable. ### Investigated but not included Following the review feedback from @hoyyeva on #15265: **Attachment byte loading** (`ui.go`): tested `ChatWithOptions(cid, false)`, API time went from 213ms/70MB to 4ms/16KB for a 52MB attachment chat. Not included because it breaks image previews. A 67MB attachment chat loads in ~130ms while a lighter 18MB code chat takes ~1259ms, so the bottleneck is rendering, not data transfer. **Tool result duplication** (`ui.go`): `message.content` and `message.tool_result` contain the same data for tool messages. The duplication is intentional because `content` is sent to the LLM as conversation context (`ui.go:1743`). The frontend already ignores `content` when `tool_result` exists. Omitting `content` from the chat API response would require backend serialization changes. **Batch DB queries** (`database.go`): rewrote `getMessages` with JOIN queries (861 to 3 for 430 messages). Impact: 5ms to 4ms. SQLite local on SSD makes this negligible in a single-user desktop context. **Progressive message loading**: tried mounting only the last N messages with batch loading in idle frames. Conflicted with `useMessageAutoscroll` (ResizeObserver, MutationObserver, scroll compensation). The deferred rendering approach achieves similar results by keeping all nodes mounted. Also implemented from the same review: lazy render for collapsed tool results (skip `JSON.stringify` when hidden) and `await` in the route loader. --- Fixes #12959 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:37:21 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#20422