HOME·SELF-HOSTED AI ASSISTANT WITH MULTIPLE LLM PROVIDERS ON THE BACKEND

Self-Hosted AI assistant with multiple LLM providers on the backend

Requirements

  1. Linux-based system or WSL installed
  2. Docker installed

Problem Statement

The rise of large language models (LLMs) has enabled numerous applications ranging from automated customer support to content generation. However, leveraging multiple LLM providers and managing traffic, cost, and model selection has become increasingly complex. Organizations often need a unified interface to route queries intelligently, enforce request limits, and aggregate responses from various LLMs. Without such a system, developers must hardcode integrations, handle retries manually, and monitor multiple endpoints, making scaling challenging.

Summary

This article explores a practical approach to building a unified LLM chatbot helper using LiteLLM Gateway, a lightweight middleware designed to abstract interactions with multiple LLM providers. We'll examine traffic routing, request capping, and model orchestration, highlighting how LiteLLM can centralize operations. We'll also discuss advantages, limitations, and alternatives, along with a practical checklist to ensure smooth deployment.

Scope of the Article

This article focuses on:

  1. Unifying LLM access behind a single API.
  2. Implementing traffic routing and load balancing across multiple LLMs.
  3. Enforcing request limits and cost controls.
  4. Demonstrating a small-scale chatbot helper agent as a use case.
  5. Comparing gateway logic with direct API integrations.

Note: It does not cover advanced LLM fine-tuning or custom model training, but it lays the foundation for scalable multi-provider LLM deployment.

Introduction to LiteLLM

LiteLLM is a lightweight API gateway designed to manage multiple LLM providers. It allows you to:

  • Serve multiple models behind a unified endpoint.
  • Enforce request limits and routing rules.
  • Track usage and optionally persist responses in a database.

LiteLLM is provider-agnostic and supports popular models like OpenAI GPT, LLaMA variants, and others through standardized interfaces.

Self-Hosted AI Assistant using LiteLLM

For this guide, start by opening the Visual Studio Code.

Warning: All commands and Docker workflows in this guide assume you are running inside a Linux environment (WSL on Windows). If you are using VS Code, ensure it is opened in WSL mode. For other IDEs, run them against the same WSL/Linux context.

  1. create a config file for LiteLLM as follows config.yml

Important: Due to an existing bug, Nebius models need to be prefixed with an extra openai/.

model_list:
  
  # --- Server ---
  - model_name: llm
    litellm_params:
      model: openai/gpt-oss:20b
      api_base: http://10.3.0.42:11434/v1
      api_key: EMPTY
      drop_params: true
      "weight": 1
      "max_parallel_requests": 10

  # --- Nebius ---
  - model_name: llm
    litellm_params:
      model: openai/openai/gpt-oss-20b
      api_base: https://api.tokenfactory.nebius.com/v1
      api_key: os.environ/NEBIUS_API_KEY
      temperature: 0.2
      "weight": 2
      "max_parallel_requests": 10



general_settings:
  master_key: sk-litellm-admin
  drop_params: true
  usage_limits:
    sk-litellm-admin:
      max_tokens: 1000
      max_requests: 3

for this guide, the model used is “gpt-oss-20b” from openai and this model is used from two different providers. One hosted on a server (10.3.0.42) and other is taken from a provider Nebius.

  1. next, create a docker compose file docker-compose.yaml

All docker images are up-to-date as of the date of this article

services:
  litellm:
    image: ghcr.io/berriai/litellm:v1.80.10.rc.3
    container_name: litellm-proxy
    ports:
      - "4000:4000"
      - "4001:4001"
    volumes:
      - ./config/config.yml:/app/config.yaml
    environment:
      NEBIUS_API_KEY: ${NEBIUS_API_KEY}
      DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm
      REDIS_URL: redis://redis:6379
      STORE_MODEL_IN_DB: "True"
      LITELLM_MASTER_KEY: sk-litellm-admin
      # UI_ENABLED: "true"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    command:
      - "--config"
      - "/app/config.yaml"
      - "--port"
      - "4000"
      # - "--port"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
      # - "4001"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
    restart: unless-stopped

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 5s
      timeout: 5s
      retries: 10

  redis:
    image: redis:7
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10

  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: openwebui
    ports:
      - "3000:8080"
    environment:
      - OPENAI_API_BASE_URL=http://litellm:4000/v1
      - OPENAI_API_KEY=sk-litellm-admin
    depends_on:
      - litellm
    restart: unless-stopped

volumes:
  postgres_data:

This docker-compose.yaml defines LiteLLM, Postgres, Redis, OpenWebUI for a light-weight unified local LLM workflow.

  • LiteLLM handles model routing, request parallelism, and usage limits.
  • Postgres and Redis provide state persistence and caching.
  • OpenWebUI gives a web interface to interact with LLMs.
  • The setup supports both local and remote models, with clear separation between backend (LiteLLM), storage (Postgres/Redis), and frontend (OpenWebUI).
  • More explanation of the docker-compose:

LiteLLM service

  • Image & container: ghcr.io/berriai/litellm:v1.80.10.rc.3, container named litellm-proxy.
  • Ports: 4000 exposes the API
  • Configuration: Mounted from ./config/config.yml to /app/config.yaml to allow easy editing.
  • Environment variables:
    • NEBIUS_API_KEY for accessing Nebius models.
    • DATABASE_URL and REDIS_URL point to Postgres and Redis services for storage, caching, and parallel request handling.
    • STORE_MODEL_IN_DB ensures model metadata is persisted.
    • LITELLM_MASTER_KEY is the master authentication key for API access.
  • Dependencies: LiteLLM waits for both Postgres and Redis to be healthy before starting (depends_on with health checks).
  • Command: Uses --config to load the YAML file and runs on port 4000. Restart policy is unless-stopped for reliability.

Postgres service

  • Image: postgres:15 with a dedicated database litellm.
  • Credentials: Username and password are both litellm.
  • Volumes: Persists data in postgres_data to survive container restarts.
  • Healthcheck: Uses pg_isready to ensure readiness before LiteLLM starts.

Redis service

  • Image: redis:7 for fast in-memory caching, often used for request queuing and parallelism.
  • Healthcheck: Simple ping test, interval 5s, timeout 3s, retries 10.

OpenWebUI service

  • Image: ghcr.io/berriai/litellm:v1.80.10.rc.3 provides a web-based interface for LLM interaction.
  • Ports: Maps container 8080 to host 3000.
  • Environment: Configured to use LiteLLM as the backend API with the master key for authentication.
  • Dependency: Starts only after LiteLLM is running. Restart policy is unless-stopped.

Volumes

  • postgres_data ensures database persistence across container restarts.



3. next, make sure you have a correct .env file .env

Make sure that the API Keys used have access to the models that will be called

# --- Providers ---
NEBIUS_API_KEY=

# --- OpenAI-compatible Base URL ---
NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/

Under Providers, add API Keys from any OpenAI compatible LLM Providers. and add the Provider URL under Base URL

  1. run the solution in terminal with

Detached mode (-d) is important to run everything in the background and current terminal session stays interactive

docker compose up -d

Once the services are up, OpenWebUI starts and it can be checked by going to http://localhost:3000/

Here, OpenWebUI loads and under “models” dropdown, model name is populated taken from LiteLLM’s config as “llm”

LiteLLM endpoints ("http://litellm:4000/v1") can also be used in custom applications just as OpenAI endpoints would be used. Instead of using a generated API Key, LiteLLM master key will be used as defined in the docker-compose.yaml.

To test this end point, execute the following cURL

curl http://localhost:4000/health \
  -H "Authorization: Bearer sk-litellm-admin"

this should return a json with status:

{"status":"ok"}

To test what models are available, execute the following cURL

curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-litellm-admin"

Minimal version in our github: https://github.com/krafteq/krafteq.dev/tree/main/self-hosted-ai-gateway-intro

After cloning the repository, open it with Visual Studio Code.

If the Operating System is Windows, ensure that Visual Studio Code is opened from the Ubuntu system.

Go to the root directory of the project that contains docker-compose.yaml and run it in detached mode:

docker compose up -d

Comparison to other available solutions

GatewayOpen SourceMulti-ProviderTraffic RoutingUsage LimitsRequest CappingLocal HostingParallelismDB LoggingUI SupportCostNotes
LiteLLMLow / Self-hostedLightweight Python gateway, flexible for multiple LLMs
OllamaMedium / ProprietaryProprietary, local-first models
NebiusMedium / CloudCloud-first provider, multiple models
OpenRouterLow / Open-sourceOpen-source cloud gateway for multiple providers
HeliconeMedium-High / CloudCloud-first, usage analytics and cost tracking
BifrostMedium / Self-hostedFlexible Python gateway, designed for multi-provider orchestration
PortkeyMedium / Self-hostedFocused on orchestration, parallelism, and cost monitoring
Tensor0Medium / Self-hostedPython-first gateway with parallelism, usage capping, and multi-provider support

Advantages and Disadvantages

Advantages of Using a Gateway

  • Centralized Management: One API endpoint to rule them all.
  • Provider Abstraction: Swap providers without changing client code.
  • Monitoring & Analytics: Track usage, latency, and costs across all models.
  • Traffic Control: Implement rate-limits, capping, and retries.

Disadvantages / Considerations

  • Single Point of Failure: If the gateway goes down, all requests fail.
  • Latency Overhead: Additional network hop may add a few milliseconds per request.
  • Configuration Complexity: Requires proper mapping of models, APIs, and keys.
  • Provider Limitations: Some providers may not be fully compatible, requiring careful testing.

Conclusion

A unified LLM based assistant using LiteLLM Gateway provides a scalable, maintainable, and extensible solution to manage multiple models, providers, and usage limits. While gateways add a small latency and configuration overhead, the benefits of centralized control, monitoring, and traffic orchestration outweigh the drawbacks in production systems.

References and Further Reading

  1. LiteLLM GitHub – official repository.
  2. Brown, T. et al., Language Models are Few-Shot Learners, 2020. arXiv:2005.14165
  3. Vaswani, A. et al., Attention Is All You Need, 2017. arXiv:1706.03762
  4. OpenAI GPT-4 Technical Report, 2023.
  5. “Scalable Multi-Provider LLM Orchestration,” Internal LiteLLM Documentation.