Self-Hosted AI assistant with multiple LLM providers on the backend

Requirements

Linux-based system or WSL installed
Docker installed

Problem Statement

The rise of large language models (LLMs) has enabled numerous applications ranging from automated customer support to content generation. However, leveraging multiple LLM providers and managing traffic, cost, and model selection has become increasingly complex. Organizations often need a unified interface to route queries intelligently, enforce request limits, and aggregate responses from various LLMs. Without such a system, developers must hardcode integrations, handle retries manually, and monitor multiple endpoints, making scaling challenging.

Summary

This article explores a practical approach to building a unified LLM chatbot helper using LiteLLM Gateway, a lightweight middleware designed to abstract interactions with multiple LLM providers. We'll examine traffic routing, request capping, and model orchestration, highlighting how LiteLLM can centralize operations. We'll also discuss advantages, limitations, and alternatives, along with a practical checklist to ensure smooth deployment.

Scope of the Article

This article focuses on:

Unifying LLM access behind a single API.
Implementing traffic routing and load balancing across multiple LLMs.
Enforcing request limits and cost controls.
Demonstrating a small-scale chatbot helper agent as a use case.
Comparing gateway logic with direct API integrations.

Note: It does not cover advanced LLM fine-tuning or custom model training, but it lays the foundation for scalable multi-provider LLM deployment.

Introduction to LiteLLM

LiteLLM is a lightweight API gateway designed to manage multiple LLM providers. It allows you to:

Serve multiple models behind a unified endpoint.
Enforce request limits and routing rules.
Track usage and optionally persist responses in a database.

LiteLLM is provider-agnostic and supports popular models like OpenAI GPT, LLaMA variants, and others through standardized interfaces.

Self-Hosted AI Assistant using LiteLLM

For this guide, start by opening the Visual Studio Code.

Warning: All commands and Docker workflows in this guide assume you are running inside a Linux environment (WSL on Windows). If you are using VS Code, ensure it is opened in WSL mode. For other IDEs, run them against the same WSL/Linux context.

create a config file for LiteLLM as follows
config.yml

Important: Due to an existing bug, Nebius models need to be prefixed with an extra openai/.

model_list:    # --- Server ---  - model_name: llm    litellm_params:      model: openai/gpt-oss:20b      api_base: http://10.3.0.42:11434/v1      api_key: EMPTY      drop_params: true      "weight": 1      "max_parallel_requests": 10  # --- Nebius ---  - model_name: llm    litellm_params:      model: openai/openai/gpt-oss-20b      api_base: https://api.tokenfactory.nebius.com/v1      api_key: os.environ/NEBIUS_API_KEY      temperature: 0.2      "weight": 2      "max_parallel_requests": 10general_settings:  master_key: sk-litellm-admin  drop_params: true  usage_limits:    sk-litellm-admin:      max_tokens: 1000      max_requests: 3

for this guide, the model used is “gpt-oss-20b” from openai and this model is used from two different providers. One hosted on a server (10.3.0.42) and other is taken from a provider Nebius.

next, create a docker compose file
docker-compose.yaml

All docker images are up-to-date as of the date of this article

services:  litellm:    image: ghcr.io/berriai/litellm:v1.80.10.rc.3    container_name: litellm-proxy    ports:      - "4000:4000"      - "4001:4001"    volumes:      - ./config/config.yml:/app/config.yaml    environment:      NEBIUS_API_KEY: ${NEBIUS_API_KEY}      DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm      REDIS_URL: redis://redis:6379      STORE_MODEL_IN_DB: "True"      LITELLM_MASTER_KEY: sk-litellm-admin      # UI_ENABLED: "true"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE    depends_on:      postgres:        condition: service_healthy      redis:        condition: service_healthy    command:      - "--config"      - "/app/config.yaml"      - "--port"      - "4000"      # - "--port"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE      # - "4001"  # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE    restart: unless-stopped  postgres:    image: postgres:15    environment:      POSTGRES_DB: litellm      POSTGRES_USER: litellm      POSTGRES_PASSWORD: litellm    volumes:      - postgres_data:/var/lib/postgresql/data    healthcheck:      test: ["CMD-SHELL", "pg_isready -U litellm"]      interval: 5s      timeout: 5s      retries: 10  redis:    image: redis:7    healthcheck:      test: ["CMD", "redis-cli", "ping"]      interval: 5s      timeout: 3s      retries: 10  openwebui:    image: ghcr.io/open-webui/open-webui:main    container_name: openwebui    ports:      - "3000:8080"    environment:      - OPENAI_API_BASE_URL=http://litellm:4000/v1      - OPENAI_API_KEY=sk-litellm-admin    depends_on:      - litellm    restart: unless-stoppedvolumes:  postgres_data:

This docker-compose.yaml defines LiteLLM, Postgres, Redis, OpenWebUI for a light-weight unified local LLM workflow.

LiteLLM handles model routing, request parallelism, and usage limits.
Postgres and Redis provide state persistence and caching.
OpenWebUI gives a web interface to interact with LLMs.
The setup supports both local and remote models, with clear separation between backend (LiteLLM), storage (Postgres/Redis), and frontend (OpenWebUI).
More explanation of the docker-compose:

LiteLLM service

Image & container: ghcr.io/berriai/litellm:v1.80.10.rc.3, container named litellm-proxy.
Ports: 4000 exposes the API
Configuration: Mounted from ./config/config.yml to /app/config.yaml to allow easy editing.
Environment variables:
- NEBIUS_API_KEY for accessing Nebius models.
- DATABASE_URL and REDIS_URL point to Postgres and Redis services for storage, caching, and parallel request handling.
- STORE_MODEL_IN_DB ensures model metadata is persisted.
- LITELLM_MASTER_KEY is the master authentication key for API access.
Dependencies: LiteLLM waits for both Postgres and Redis to be healthy before starting (depends_on with health checks).
Command: Uses --config to load the YAML file and runs on port 4000. Restart policy is unless-stopped for reliability.

Postgres service

Image: postgres:15 with a dedicated database litellm.
Credentials: Username and password are both litellm.
Volumes: Persists data in postgres_data to survive container restarts.
Healthcheck: Uses pg_isready to ensure readiness before LiteLLM starts.

Redis service

Image: redis:7 for fast in-memory caching, often used for request queuing and parallelism.
Healthcheck: Simple ping test, interval 5s, timeout 3s, retries 10.

OpenWebUI service

Image: ghcr.io/berriai/litellm:v1.80.10.rc.3 provides a web-based interface for LLM interaction.
Ports: Maps container 8080 to host 3000.
Environment: Configured to use LiteLLM as the backend API with the master key for authentication.
Dependency: Starts only after LiteLLM is running. Restart policy is unless-stopped.

Volumes

postgres_data ensures database persistence across container restarts.

next, make sure you have a correct .env file

.env

Make sure that the API Keys used have access to the models that will be called

# --- Providers ---NEBIUS_API_KEY=# --- OpenAI-compatible Base URL ---NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/

Under Providers, add API Keys from any OpenAI compatible LLM Providers. and add the Provider URL under Base URL

run the solution in terminal with

Detached mode (-d) is important to run everything in the background and current terminal session stays interactive

docker compose up -d

Once the services are up, OpenWebUI starts and it can be checked by going to http://localhost:3000/

Here, OpenWebUI loads and under “models” dropdown, model name is populated taken from LiteLLM’s config as “llm”

LiteLLM endpoints ("http://litellm:4000/v1") can also be used in custom applications just as OpenAI endpoints would be used. Instead of using a generated API Key, LiteLLM master key will be used as defined in the docker-compose.yaml.

To test this end point, execute the following cURL

curl http://localhost:4000/health \  -H "Authorization: Bearer sk-litellm-admin"

this should return a json with status:

{"status":"ok"}

To test what models are available, execute the following cURL

curl http://localhost:4000/v1/models \  -H "Authorization: Bearer sk-litellm-admin"

Minimal version in our github: https://github.com/krafteq/krafteq.dev/tree/main/self-hosted-ai-gateway-intro

After cloning the repository, open it with Visual Studio Code.

If the Operating System is Windows, ensure that Visual Studio Code is opened from the Ubuntu system.

Go to the root directory of the project that contains docker-compose.yaml and run it in detached mode:

docker compose up -d

Comparison to other available solutions

Advantages and Disadvantages

Advantages of Using a Gateway

Centralized Management: One API endpoint to rule them all.
Provider Abstraction: Swap providers without changing client code.
Monitoring & Analytics: Track usage, latency, and costs across all models.
Traffic Control: Implement rate-limits, capping, and retries.

Disadvantages / Considerations

Single Point of Failure: If the gateway goes down, all requests fail.
Latency Overhead: Additional network hop may add a few milliseconds per request.
Configuration Complexity: Requires proper mapping of models, APIs, and keys.
Provider Limitations: Some providers may not be fully compatible, requiring careful testing.

Conclusion

A unified LLM based assistant using LiteLLM Gateway provides a scalable, maintainable, and extensible solution to manage multiple models, providers, and usage limits. While gateways add a small latency and configuration overhead, the benefits of centralized control, monitoring, and traffic orchestration outweigh the drawbacks in production systems.

References and Further Reading

LiteLLM GitHub – official repository.
Brown, T. et al., Language Models are Few-Shot Learners, 2020. arXiv:2005.14165
Vaswani, A. et al., Attention Is All You Need, 2017. arXiv:1706.03762
OpenAI GPT-4 Technical Report, 2023.
“Scalable Multi-Provider LLM Orchestration,” Internal LiteLLM Documentation.