Self-Hosted AI assistant with multiple LLM providers on the backend
Requirements
Problem Statement
The rise of large language models (LLMs) has enabled numerous applications ranging from automated customer support to content generation. However, leveraging multiple LLM providers and managing traffic, cost, and model selection has become increasingly complex. Organizations often need a unified interface to route queries intelligently, enforce request limits, and aggregate responses from various LLMs. Without such a system, developers must hardcode integrations, handle retries manually, and monitor multiple endpoints, making scaling challenging.
Summary
This article explores a practical approach to building a unified LLM chatbot helper using LiteLLM Gateway, a lightweight middleware designed to abstract interactions with multiple LLM providers. We'll examine traffic routing, request capping, and model orchestration, highlighting how LiteLLM can centralize operations. We'll also discuss advantages, limitations, and alternatives, along with a practical checklist to ensure smooth deployment.
Scope of the Article
This article focuses on:
Unifying LLM access behind a single API.
Implementing traffic routing and load balancing across multiple LLMs.
Enforcing request limits and cost controls.
Demonstrating a small-scale chatbot helper agent as a use case.
Comparing gateway logic with direct API integrations.
Note: It does not cover advanced LLM fine-tuning or custom model training, but it lays the foundation for scalable multi-provider LLM deployment.
Introduction to LiteLLM
LiteLLM is a lightweight API gateway designed to manage multiple LLM providers. It allows you to:
Serve multiple models behind a unified endpoint.
Enforce request limits and routing rules.
Track usage and optionally persist responses in a database.
LiteLLM is provider-agnostic and supports popular models like OpenAI GPT, LLaMA variants, and others through standardized interfaces.
Self-Hosted AI Assistant using LiteLLM
For this guide, start by opening the Visual Studio Code.
Warning: All commands and Docker workflows in this guide assume you are running inside a Linux environment (WSL on Windows). If you are using VS Code, ensure it is opened in WSL mode. For other IDEs, run them against the same WSL/Linux context.
create a config file for LiteLLM as follows
config.yml
Important: Due to an existing bug, Nebius models need to be prefixed with an extra
openai/.
model_list: # --- Server --- - model_name: llm litellm_params: model: openai/gpt-oss:20b api_base: http://10.3.0.42:11434/v1 api_key: EMPTY drop_params: true "weight": 1 "max_parallel_requests": 10 # --- Nebius --- - model_name: llm litellm_params: model: openai/openai/gpt-oss-20b api_base: https://api.tokenfactory.nebius.com/v1 api_key: os.environ/NEBIUS_API_KEY temperature: 0.2 "weight": 2 "max_parallel_requests": 10general_settings: master_key: sk-litellm-admin drop_params: true usage_limits: sk-litellm-admin: max_tokens: 1000 max_requests: 3for this guide, the model used is “gpt-oss-20b” from openai and this model is used from two different providers. One hosted on a server (10.3.0.42) and other is taken from a provider Nebius.
next, create a docker compose file
docker-compose.yaml
All docker images are up-to-date as of the date of this article
services: litellm: image: ghcr.io/berriai/litellm:v1.80.10.rc.3 container_name: litellm-proxy ports: - "4000:4000" - "4001:4001" volumes: - ./config/config.yml:/app/config.yaml environment: NEBIUS_API_KEY: ${NEBIUS_API_KEY} DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm REDIS_URL: redis://redis:6379 STORE_MODEL_IN_DB: "True" LITELLM_MASTER_KEY: sk-litellm-admin # UI_ENABLED: "true" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE depends_on: postgres: condition: service_healthy redis: condition: service_healthy command: - "--config" - "/app/config.yaml" - "--port" - "4000" # - "--port" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE # - "4001" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE restart: unless-stopped postgres: image: postgres:15 environment: POSTGRES_DB: litellm POSTGRES_USER: litellm POSTGRES_PASSWORD: litellm volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U litellm"] interval: 5s timeout: 5s retries: 10 redis: image: redis:7 healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 10 openwebui: image: ghcr.io/open-webui/open-webui:main container_name: openwebui ports: - "3000:8080" environment: - OPENAI_API_BASE_URL=http://litellm:4000/v1 - OPENAI_API_KEY=sk-litellm-admin depends_on: - litellm restart: unless-stoppedvolumes: postgres_data:This docker-compose.yaml defines LiteLLM, Postgres, Redis, OpenWebUI for a light-weight unified local LLM workflow.
LiteLLM handles model routing, request parallelism, and usage limits.
Postgres and Redis provide state persistence and caching.
OpenWebUI gives a web interface to interact with LLMs.
The setup supports both local and remote models, with clear separation between backend (LiteLLM), storage (Postgres/Redis), and frontend (OpenWebUI).
More explanation of the docker-compose:
LiteLLM service
Image & container: ghcr.io/berriai/litellm:v1.80.10.rc.3, container named litellm-proxy.
Ports: 4000 exposes the API
Configuration: Mounted from
./config/config.ymlto/app/config.yamlto allow easy editing.Environment variables:
NEBIUS_API_KEYfor accessing Nebius models.DATABASE_URLandREDIS_URLpoint to Postgres and Redis services for storage, caching, and parallel request handling.STORE_MODEL_IN_DBensures model metadata is persisted.LITELLM_MASTER_KEYis the master authentication key for API access.
Dependencies: LiteLLM waits for both Postgres and Redis to be healthy before starting (depends_on with health checks).
Command: Uses
--configto load the YAML file and runs on port 4000. Restart policy is unless-stopped for reliability.
Postgres service
Image:
postgres:15with a dedicated database litellm.Credentials: Username and password are both litellm.
Volumes: Persists data in postgres_data to survive container restarts.
Healthcheck: Uses
pg_isreadyto ensure readiness before LiteLLM starts.
Redis service
Image:
redis:7for fast in-memory caching, often used for request queuing and parallelism.Healthcheck: Simple ping test, interval 5s, timeout 3s, retries 10.
OpenWebUI service
Image: ghcr.io/berriai/litellm:v1.80.10.rc.3 provides a web-based interface for LLM interaction.
Ports: Maps container
8080to host3000.Environment: Configured to use LiteLLM as the backend API with the master key for authentication.
Dependency: Starts only after LiteLLM is running. Restart policy is unless-stopped.
Volumes
postgres_dataensures database persistence across container restarts.
next, make sure you have a correct
.envfile
.env
Make sure that the API Keys used have access to the models that will be called
# --- Providers ---NEBIUS_API_KEY=# --- OpenAI-compatible Base URL ---NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/Under Providers, add API Keys from any OpenAI compatible LLM Providers. and add the Provider URL under Base URL
run the solution in terminal with
Detached mode (-d) is important to run everything in the background and current terminal session stays interactive
docker compose up -dOnce the services are up, OpenWebUI starts and it can be checked by going to http://localhost:3000/
Here, OpenWebUI loads and under “models” dropdown, model name is populated taken from LiteLLM’s config as “llm”
LiteLLM endpoints ("http://litellm:4000/v1") can also be used in custom applications just as OpenAI endpoints would be used. Instead of using a generated API Key, LiteLLM master key will be used as defined in the docker-compose.yaml.
To test this end point, execute the following cURL
curl http://localhost:4000/health \ -H "Authorization: Bearer sk-litellm-admin" this should return a json with status:
{"status":"ok"}To test what models are available, execute the following cURL
curl http://localhost:4000/v1/models \ -H "Authorization: Bearer sk-litellm-admin" Minimal version in our github: https://github.com/krafteq/krafteq.dev/tree/main/self-hosted-ai-gateway-intro
After cloning the repository, open it with Visual Studio Code.
If the Operating System is Windows, ensure that Visual Studio Code is opened from the Ubuntu system.
Go to the root directory of the project that contains docker-compose.yaml and run it in detached mode:
docker compose up -dComparison to other available solutions
Advantages and Disadvantages
Advantages of Using a Gateway
Centralized Management: One API endpoint to rule them all.
Provider Abstraction: Swap providers without changing client code.
Monitoring & Analytics: Track usage, latency, and costs across all models.
Traffic Control: Implement rate-limits, capping, and retries.
Disadvantages / Considerations
Single Point of Failure: If the gateway goes down, all requests fail.
Latency Overhead: Additional network hop may add a few milliseconds per request.
Configuration Complexity: Requires proper mapping of models, APIs, and keys.
Provider Limitations: Some providers may not be fully compatible, requiring careful testing.
Conclusion
A unified LLM based assistant using LiteLLM Gateway provides a scalable, maintainable, and extensible solution to manage multiple models, providers, and usage limits. While gateways add a small latency and configuration overhead, the benefits of centralized control, monitoring, and traffic orchestration outweigh the drawbacks in production systems.
References and Further Reading
LiteLLM GitHub – official repository.
Brown, T. et al., Language Models are Few-Shot Learners, 2020. arXiv:2005.14165
Vaswani, A. et al., Attention Is All You Need, 2017. arXiv:1706.03762
OpenAI GPT-4 Technical Report, 2023.
“Scalable Multi-Provider LLM Orchestration,” Internal LiteLLM Documentation.
Join the conversation
Comments and reactions are powered by GitHub (USA) via Giscus. Loading them connects your browser to GitHub servers.