[GH-ISSUE #10085] Feature Request: Add gRPC API Support #6610

Open
opened 2026-04-12 18:16:29 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @matusbielik on GitHub (Apr 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10085

Description

I'd like to request adding a gRPC API alongside the existing REST API in Ollama. This would provide significant benefits for microservice architectures and high-performance applications that need to work with embedding models.

Use Cases

  • Microservice architectures where services communicate via gRPC
  • Applications requiring high-throughput embedding generation
  • Systems that need to minimize latency and overhead when working with embeddings
  • Services that need to stream large batches of embedding requests

Benefits of gRPC for Ollama

Performance Improvements

  • Binary Protocol: gRPC uses Protocol Buffers (protobuf) which is a binary format, significantly reducing payload size compared to JSON, especially for embedding vectors which contain hundreds or thousands of floating-point values
  • HTTP/2: gRPC leverages HTTP/2 for multiplexing, header compression, and streaming
  • Connection Reuse: Maintains persistent connections between client and server, reducing handshake overhead

Developer Experience

  • Strongly Typed Interfaces: Automatic client code generation from .proto definitions
  • Bidirectional Streaming: Allows for efficient streaming of requests and responses
  • Language Agnostic: Clients can be generated for many programming languages

Specific to Embeddings

  • Efficient Vector Transfer: Embedding vectors can be transferred as native binary data rather than being serialized to strings and back
  • Reduced Parsing Overhead: No need to parse JSON for large embedding arrays
  • Batch Processing Support: gRPC streaming would enable efficient batch processing patterns

Proposed Implementation

The gRPC API could mirror the existing REST API endpoints but with the advantages of Protocol Buffers:

syntax = "proto3";
package ollama;

service Ollama {
  // Generate embeddings for the provided input text
  rpc Embed(EmbedRequest) returns (EmbedResponse) {}
  
  // Generate completions for the provided prompt
  rpc Generate(GenerateRequest) returns (stream GenerateResponse) {}
  
  // Chat completion endpoint
  rpc Chat(ChatRequest) returns (stream ChatResponse) {}
  
  // Model management endpoints
  rpc ListModels(ListModelsRequest) returns (ListModelsResponse) {}
  rpc PullModel(PullModelRequest) returns (stream PullModelResponse) {}
  // Other endpoints...
}

message EmbedRequest {
  string model = 1;
  repeated string input = 2;
}

message EmbedResponse {
  message Embeddings {
    repeated float values = 1;
  }
  repeated Embeddings embeddings = 1;
}

// Other message definitions...

Additional Considerations

  • The gRPC API could run alongside the existing REST API (e.g., on a different port)
  • It would be valuable to support batch embedding operations in the gRPC API
  • Authentication mechanisms should be consistent with the REST API

This feature would make Ollama even more versatile for production deployments and high-performance applications, particularly those working with embedding models at scale.

Originally created by @matusbielik on GitHub (Apr 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10085 ## Description I'd like to request adding a gRPC API alongside the existing REST API in Ollama. This would provide significant benefits for microservice architectures and high-performance applications that need to work with embedding models. ## Use Cases - Microservice architectures where services communicate via gRPC - Applications requiring high-throughput embedding generation - Systems that need to minimize latency and overhead when working with embeddings - Services that need to stream large batches of embedding requests ## Benefits of gRPC for Ollama ### Performance Improvements - **Binary Protocol**: gRPC uses Protocol Buffers (protobuf) which is a binary format, significantly reducing payload size compared to JSON, especially for embedding vectors which contain hundreds or thousands of floating-point values - **HTTP/2**: gRPC leverages HTTP/2 for multiplexing, header compression, and streaming - **Connection Reuse**: Maintains persistent connections between client and server, reducing handshake overhead ### Developer Experience - **Strongly Typed Interfaces**: Automatic client code generation from .proto definitions - **Bidirectional Streaming**: Allows for efficient streaming of requests and responses - **Language Agnostic**: Clients can be generated for many programming languages ### Specific to Embeddings - **Efficient Vector Transfer**: Embedding vectors can be transferred as native binary data rather than being serialized to strings and back - **Reduced Parsing Overhead**: No need to parse JSON for large embedding arrays - **Batch Processing Support**: gRPC streaming would enable efficient batch processing patterns ## Proposed Implementation The gRPC API could mirror the existing REST API endpoints but with the advantages of Protocol Buffers: ```protobuf syntax = "proto3"; package ollama; service Ollama { // Generate embeddings for the provided input text rpc Embed(EmbedRequest) returns (EmbedResponse) {} // Generate completions for the provided prompt rpc Generate(GenerateRequest) returns (stream GenerateResponse) {} // Chat completion endpoint rpc Chat(ChatRequest) returns (stream ChatResponse) {} // Model management endpoints rpc ListModels(ListModelsRequest) returns (ListModelsResponse) {} rpc PullModel(PullModelRequest) returns (stream PullModelResponse) {} // Other endpoints... } message EmbedRequest { string model = 1; repeated string input = 2; } message EmbedResponse { message Embeddings { repeated float values = 1; } repeated Embeddings embeddings = 1; } // Other message definitions... ``` ## Additional Considerations - The gRPC API could run alongside the existing REST API (e.g., on a different port) - It would be valuable to support batch embedding operations in the gRPC API - Authentication mechanisms should be consistent with the REST API This feature would make Ollama even more versatile for production deployments and high-performance applications, particularly those working with embedding models at scale.
GiteaMirror added the feature request label 2026-04-12 18:16:29 -05:00
Author
Owner

@akshaymishra-xavg commented on GitHub (Aug 21, 2025):

Yes - Currently all my micro-services are just facade layers of Python and FastApi to internally call different models in ollama by rest-api. If I get a grpc end-point in ollama then I will not need any of these microservices - I can just install and use ollama (may be even ollama turbo for paid users - so that I pay only for the GPU compute) and call it by my web-services - this will eliminate the need for all my facade layer micro-services.

<!-- gh-comment-id:3209315461 --> @akshaymishra-xavg commented on GitHub (Aug 21, 2025): Yes - Currently all my micro-services are just facade layers of Python and FastApi to internally call different models in ollama by rest-api. If I get a grpc end-point in ollama then I will not need any of these microservices - I can just install and use ollama (may be even ollama turbo for paid users - so that I pay only for the GPU compute) and call it by my web-services - this will eliminate the need for all my facade layer micro-services.
Author
Owner

@zhaohan-dong commented on GitHub (Jan 22, 2026):

We do have gRPC API at xAI. Would anyone be down to adopting these as protocol? If so, we could probably discuss and work something out.
https://github.com/xai-org/xai-proto

<!-- gh-comment-id:3785852551 --> @zhaohan-dong commented on GitHub (Jan 22, 2026): We do have gRPC API at xAI. Would anyone be down to adopting these as protocol? If so, we could probably discuss and work something out. https://github.com/xai-org/xai-proto
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6610