Using and Consuming Ollama Server API

What is the Ollama API?

The Ollama API is a RESTful interface that provides programmatic access to Ollama's large language model capabilities. Running on port 11434 by default, this HTTP-based API allows developers to integrate Ollama's local LLM functionality directly into their applications. The API supports text generation, chat completions, embeddings, and model management, making it easy to build AI-powered applications that run entirely on your own infrastructure without cloud dependencies.

Getting Started

  • API Base URL

    Default endpoint is http://localhost:11434 or your server address

  • Authentication

    None by default; secure access through network controls

  • Request Format

    JSON payloads with Content-Type: application/json

  • Response Format

    JSON responses with streaming support via HTTP chunked encoding

  • Cross-Origin

    CORS enabled for browser-based applications

  • Rate Limiting

    Limited by local hardware capabilities, not artificial limits

API Endpoints

Generate text completions from your model with fine-grained control:

  • • Send prompts and receive completions with streaming support
  • • Control temperature, top_p, and other generation parameters
  • • Set maximum token limits for responses
  • • Include system prompts for context setting
  • • Stream responses for real-time display
  • • Format: POST with JSON body containing prompt and parameters

Create conversational interactions with chat-optimized models:

  • • Send and receive messages in a conversational format
  • • Maintain conversation history with message arrays
  • • Distinguish between system, user, and assistant messages
  • • Control response characteristics with temperature settings
  • • Stream responses for interactive chat interfaces
  • • Format: POST with JSON body containing messages array

Ollama with Third-Party Applications

Connect Ollama to Chatbox for an enhanced chat interface:

  • • Setup: Go to Settings > Custom API > Add Custom
  • • Set Name to "Ollama" and Base URL to "http://localhost:11434"
  • • Set Model Field to "model" and enable streaming
  • • Click Save and select your Ollama model from the dropdown
  • • Enjoy advanced UI features while using local Ollama models

Integrate Ollama into your development workflow:

  • • Install "Continue" extension for VS Code
  • • Configure extension to use Ollama URL (http://localhost:11434)
  • • Access Ollama models directly within VS Code
  • • Get code completions, explanations, and refactoring suggestions
  • • Use slash commands in comments for contextual assistance
  • • Configure model preferences in the extension settings
  • • Maintain privacy with code never leaving your machine

Example Code

Basic Generation Request (JavaScript)

fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3',
    prompt: 'Explain how to consume the Ollama API',
    stream: false
  })
})
.then(response => response.json())
.then(data => console.log(data.response))

Chat Completion with History (Python)

import requests
import json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how do I use the Ollama API?"}
    ],
    "stream": False
}

response = requests.post(url, json=payload)
print(json.loads(response.text)["message"]["content"])

Remote Access Configuration

By default, Ollama only accepts connections from localhost. To use Ollama with remote applications or devices, you need to configure it to accept external connections:

Linux and macOS

# Create or edit the Ollama service file
OLLAMA_HOST=0.0.0.0 ollama serve

Windows

# Set environment variable and start Ollama
set OLLAMA_HOST=0.0.0.0
ollama serve

Docker

docker run -d -p 11434:11434 -e OLLAMA_HOST=0.0.0.0 -v ollama:/root/.ollama ollama/ollama

Best Practices

  • Streaming Responses: Enable streaming for real-time feedback on longer generations by setting stream: true and processing the chunked HTTP response.
  • Error Handling: Implement robust error handling for cases where the model is not loaded or the server is under heavy load.
  • Connection Management: For high-throughput applications, implement connection pooling and retry logic to handle occasional timeouts.
  • Resource Monitoring: Track GPU/CPU usage and memory consumption when making API calls to optimize performance.
  • Parameter Tuning: Experiment with temperature, top_p, and other parameters to achieve the desired balance between creativity and determinism.

Ollama API provides a straightforward way to integrate locally-running large language models into applications and third-party tools. With its simple RESTful interface, developers can quickly connect their favorite applications like Chatbox, VS Code, or Obsidian to leverage powerful AI capabilities while maintaining full control over data privacy and infrastructure. Whether you're using off-the-shelf applications or building custom solutions, Ollama's standardized API makes it easy to incorporate state-of-the-art language models into any workflow while keeping all processing on your own hardware.