Self-Hosted AI assistant with multiple LLM providers on the backend
Requirements
Problem Statement
The rise of large language models (LLMs) has enabled numerous applications ranging from automated customer support to content generation. However, leveraging multiple LLM providers and managing traffic, cost, and model selection has become increasingly complex. Organizations often need a unified interface to route queries intelligently, enforce request limits, and aggregate responses from various LLMs. Without such a system, developers must hardcode integrations, handle retries manually, and monitor multiple endpoints, making scaling challenging.
Summary
This article explores a practical approach to building a unified LLM chatbot helper using LiteLLM Gateway, a lightweight middleware designed to abstract interactions with multiple LLM providers. We'll examine traffic routing, request capping, and model orchestration, highlighting how LiteLLM can centralize operations. We'll also discuss advantages, limitations, and alternatives, along with a practical checklist to ensure smooth deployment.
Scope of the Article
This article focuses on:
- Unifying LLM access behind a single API.
- Implementing traffic routing and load balancing across multiple LLMs.
- Enforcing request limits and cost controls.
- Demonstrating a small-scale chatbot helper agent as a use case.
- Comparing gateway logic with direct API integrations.
Note: It does not cover advanced LLM fine-tuning or custom model training, but it lays the foundation for scalable multi-provider LLM deployment.
Introduction to LiteLLM
LiteLLM is a lightweight API gateway designed to manage multiple LLM providers. It allows you to:
- Serve multiple models behind a unified endpoint.
- Enforce request limits and routing rules.
- Track usage and optionally persist responses in a database.
LiteLLM is provider-agnostic and supports popular models like OpenAI GPT, LLaMA variants, and others through standardized interfaces.
Self-Hosted AI Assistant using LiteLLM
For this guide, start by opening the Visual Studio Code.
Warning: All commands and Docker workflows in this guide assume you are running inside a Linux environment (WSL on Windows). If you are using VS Code, ensure it is opened in WSL mode. For other IDEs, run them against the same WSL/Linux context.
- create a config file for LiteLLM as follows
config.yml
Important: Due to an existing bug, Nebius models need to be prefixed with an extra
openai/.
model_list:
# --- Server ---
- model_name: llm
litellm_params:
model: openai/gpt-oss:20b
api_base: http://10.3.0.42:11434/v1
api_key: EMPTY
drop_params: true
"weight": 1
"max_parallel_requests": 10
# --- Nebius ---
- model_name: llm
litellm_params:
model: openai/openai/gpt-oss-20b
api_base: https://api.tokenfactory.nebius.com/v1
api_key: os.environ/NEBIUS_API_KEY
temperature: 0.2
"weight": 2
"max_parallel_requests": 10
general_settings:
master_key: sk-litellm-admin
drop_params: true
usage_limits:
sk-litellm-admin:
max_tokens: 1000
max_requests: 3
for this guide, the model used is “gpt-oss-20b” from openai and this model is used from two different providers. One hosted on a server (10.3.0.42) and other is taken from a provider Nebius.
- next, create a docker compose file
docker-compose.yaml
All docker images are up-to-date as of the date of this article
services:
litellm:
image: ghcr.io/berriai/litellm:v1.80.10.rc.3
container_name: litellm-proxy
ports:
- "4000:4000"
- "4001:4001"
volumes:
- ./config/config.yml:/app/config.yaml
environment:
NEBIUS_API_KEY: ${NEBIUS_API_KEY}
DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm
REDIS_URL: redis://redis:6379
STORE_MODEL_IN_DB: "True"
LITELLM_MASTER_KEY: sk-litellm-admin
# UI_ENABLED: "true" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
command:
- "--config"
- "/app/config.yaml"
- "--port"
- "4000"
# - "--port" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
# - "4001" # --- NEEDED FOR UI BUT NOT IMPORTANT FOR THIS GUIDE
restart: unless-stopped
postgres:
image: postgres:15
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: litellm
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm"]
interval: 5s
timeout: 5s
retries: 10
redis:
image: redis:7
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
ports:
- "3000:8080"
environment:
- OPENAI_API_BASE_URL=http://litellm:4000/v1
- OPENAI_API_KEY=sk-litellm-admin
depends_on:
- litellm
restart: unless-stopped
volumes:
postgres_data:
This docker-compose.yaml defines LiteLLM, Postgres, Redis, OpenWebUI for a light-weight unified local LLM workflow.
- LiteLLM handles model routing, request parallelism, and usage limits.
- Postgres and Redis provide state persistence and caching.
- OpenWebUI gives a web interface to interact with LLMs.
- The setup supports both local and remote models, with clear separation between backend (LiteLLM), storage (Postgres/Redis), and frontend (OpenWebUI).
- More explanation of the docker-compose:
LiteLLM service
- Image & container: ghcr.io/berriai/litellm:v1.80.10.rc.3, container named litellm-proxy.
- Ports: 4000 exposes the API
- Configuration: Mounted from
./config/config.ymlto/app/config.yamlto allow easy editing. - Environment variables:
NEBIUS_API_KEYfor accessing Nebius models.DATABASE_URLandREDIS_URLpoint to Postgres and Redis services for storage, caching, and parallel request handling.STORE_MODEL_IN_DBensures model metadata is persisted.LITELLM_MASTER_KEYis the master authentication key for API access.
- Dependencies: LiteLLM waits for both Postgres and Redis to be healthy before starting (depends_on with health checks).
- Command: Uses
--configto load the YAML file and runs on port 4000. Restart policy is unless-stopped for reliability.
Postgres service
- Image:
postgres:15with a dedicated database litellm. - Credentials: Username and password are both litellm.
- Volumes: Persists data in postgres_data to survive container restarts.
- Healthcheck: Uses
pg_isreadyto ensure readiness before LiteLLM starts.
Redis service
- Image:
redis:7for fast in-memory caching, often used for request queuing and parallelism. - Healthcheck: Simple ping test, interval 5s, timeout 3s, retries 10.
OpenWebUI service
- Image: ghcr.io/berriai/litellm:v1.80.10.rc.3 provides a web-based interface for LLM interaction.
- Ports: Maps container
8080to host3000. - Environment: Configured to use LiteLLM as the backend API with the master key for authentication.
- Dependency: Starts only after LiteLLM is running. Restart policy is unless-stopped.
Volumes
postgres_dataensures database persistence across container restarts.
3. next, make sure you have a correct .env file
.env
Make sure that the API Keys used have access to the models that will be called
# --- Providers ---
NEBIUS_API_KEY=
# --- OpenAI-compatible Base URL ---
NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/
Under Providers, add API Keys from any OpenAI compatible LLM Providers. and add the Provider URL under Base URL
- run the solution in terminal with
Detached mode (-d) is important to run everything in the background and current terminal session stays interactive
docker compose up -d
Once the services are up, OpenWebUI starts and it can be checked by going to http://localhost:3000/
Here, OpenWebUI loads and under “models” dropdown, model name is populated taken from LiteLLM’s config as “llm”
LiteLLM endpoints ("http://litellm:4000/v1") can also be used in custom applications just as OpenAI endpoints would be used. Instead of using a generated API Key, LiteLLM master key will be used as defined in the docker-compose.yaml.
To test this end point, execute the following cURL
curl http://localhost:4000/health \
-H "Authorization: Bearer sk-litellm-admin"
this should return a json with status:
{"status":"ok"}
To test what models are available, execute the following cURL
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-admin"
Minimal version in our github: https://github.com/krafteq/krafteq.dev/tree/main/self-hosted-ai-gateway-intro
After cloning the repository, open it with Visual Studio Code.
If the Operating System is Windows, ensure that Visual Studio Code is opened from the Ubuntu system.
Go to the root directory of the project that contains docker-compose.yaml and run it in detached mode:
docker compose up -d
Comparison to other available solutions
| Gateway | Open Source | Multi-Provider | Traffic Routing | Usage Limits | Request Capping | Local Hosting | Parallelism | DB Logging | UI Support | Cost | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| LiteLLM | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Low / Self-hosted | Lightweight Python gateway, flexible for multiple LLMs |
| Ollama | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | Medium / Proprietary | Proprietary, local-first models |
| Nebius | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | Medium / Cloud | Cloud-first provider, multiple models |
| OpenRouter | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | Low / Open-source | Open-source cloud gateway for multiple providers |
| Helicone | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | Medium-High / Cloud | Cloud-first, usage analytics and cost tracking |
| Bifrost | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium / Self-hosted | Flexible Python gateway, designed for multi-provider orchestration |
| Portkey | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium / Self-hosted | Focused on orchestration, parallelism, and cost monitoring |
| Tensor0 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium / Self-hosted | Python-first gateway with parallelism, usage capping, and multi-provider support |
Advantages and Disadvantages
Advantages of Using a Gateway
- Centralized Management: One API endpoint to rule them all.
- Provider Abstraction: Swap providers without changing client code.
- Monitoring & Analytics: Track usage, latency, and costs across all models.
- Traffic Control: Implement rate-limits, capping, and retries.
Disadvantages / Considerations
- Single Point of Failure: If the gateway goes down, all requests fail.
- Latency Overhead: Additional network hop may add a few milliseconds per request.
- Configuration Complexity: Requires proper mapping of models, APIs, and keys.
- Provider Limitations: Some providers may not be fully compatible, requiring careful testing.
Conclusion
A unified LLM based assistant using LiteLLM Gateway provides a scalable, maintainable, and extensible solution to manage multiple models, providers, and usage limits. While gateways add a small latency and configuration overhead, the benefits of centralized control, monitoring, and traffic orchestration outweigh the drawbacks in production systems.
References and Further Reading
- LiteLLM GitHub – official repository.
- Brown, T. et al., Language Models are Few-Shot Learners, 2020. arXiv:2005.14165
- Vaswani, A. et al., Attention Is All You Need, 2017. arXiv:1706.03762
- OpenAI GPT-4 Technical Report, 2023.
- “Scalable Multi-Provider LLM Orchestration,” Internal LiteLLM Documentation.

