Self-Hosted LLM Council Architecture for evaluating code generating LLM Models

1. Requirements

Linux based system or WSL installed
Docker installed

2. Problem Statement

As the number of high-quality large language models (LLMs) continues to grow, choosing which model to trust for a given task has become increasingly difficult. Raw benchmarks are often insufficient, provider-specific APIs complicate comparisons, and subjective quality varies significantly across prompts.

It becomes particularly important to evaluate LLMs that are dedicated to code generation. To be able to evaluate language specific models with language specific tasks or evaluation strategies prove crucial in choosing the right model for any task.

This article presents a council-based evaluation framework: a system that queries multiple LLMs in parallel using a unified HTTP interface, collects their responses, and evaluates them using structured criteria or a dedicated judge model. This approach enables systematic, provider-agnostic model evaluation under real-world conditions.

3. Summary

This article presents a self-hosted LLM Council for evaluating code-generating AI, turning multiple perspectives into reliable, high-quality solutions. Instead of relying on one model, the council queries several LLMs in parallel, has them critique and rank each other, and uses a “Chairman” model to synthesize the final answer. Docker-based and provider-agnostic, it ensures correctness, robustness, and maintainability, showcasing how ensemble intelligence outperforms any single code model in real-world software development.

4. Scope of the Article

This article focuses on:

Implementing an LLM Council Architecture inspired by Andrej Karpathy
Configure a MultiProvider Council
Quick introduction to Prompt Engineering
Design Code Evaluation Prompts for council
Testing and Validating generated Code
Exploring other applications of LLM Council

Note: This article does not demonstrate training of any new LLM model but uses already available Open Source models

5. Introduction to LLM Council Architecture

The LLM Council, proposed by Andrej Karpathy, is an inference-time orchestration pattern that improves LLM outputs by consulting multiple models instead of relying on one. The same prompt is sent to a group of independent LLMs, whose responses are then compared, evaluated, and synthesized into a final answer. By leveraging diversity across models and using LLMs themselves as critics and editors, the council exposes disagreement, reduces blind spots, and produces more robust results. The approach is lightweight, model-agnostic, and particularly effective for tasks where correctness matters, such as code generation and technical reasoning.

6. Why would you want to use a council and not a single code tuned LLM model

Relying on a single LLM for code generation can limit both reliability and creativity. Even advanced models may overlook edge cases or produce solutions that are incomplete or suboptimal. A council of models—multiple LLMs generating, evaluating, and synthesizing responses—offers a way to harness the collective intelligence of several systems to improve code quality, clarity, and usability.

Key advantages of a council approach:

Diverse Perspectives – Each model brings its own reasoning style, heuristics, and training biases. Combining outputs increases the chance of discovering the best solution.
Peer Evaluation – Models can review and rank each other's outputs, providing structured feedback that improves the final answer.
Goal Alignment – By aggregating multiple responses, the council ensures the synthesized code better meets the user's intent rather than just producing technically "clean" code.
Enhanced Creativity – Different models may approach the same problem in unique ways, encouraging innovative or optimized solutions.
Synthesis of Best Practices – A dedicated "Chairman" model consolidates insights, producing a final output that merges correctness, clarity, and efficiency.
Cross-Language Flexibility – When working with multiple programming languages, council-based evaluation can balance strengths across languages, ensuring robust solutions regardless of syntax or conventions.

In essence, a council leverages ensemble intelligence to generate code that is not only syntactically correct but also functional, maintainable, and aligned with the intended problem-solving goals. By combining multiple expert perspectives, it creates a more robust, reliable, and high-quality coding assistant than any single model alone.

7. Introduction to Prompt Engineering

Prompt engineering is the practice of designing structured inputs that guide large language models (LLMs) toward producing correct, reliable, and useful outputs. In the context of code generation, prompt engineering is especially important because small ambiguities can lead to syntactic errors, incorrect logic, insecure patterns, or code that fails to meet requirements.

Modern code LLMs are highly capable, but they are probabilistic systems. They do not inherently understand intent, constraints, or correctness unless these are clearly communicated. Prompt engineering bridges this gap by turning informal human intent into precise instructions the model can follow.

Effective prompts for coding typically consist of several components:

Context – Sets the environment and assumptions.

Example: "You are coding in Python 3.11, using only standard libraries."

Task Specification – Clearly states what to do.

Example: "Write a function that checks if a string is a palindrome."

Role Definition – Assigns a persona to guide style and rigor.

Example: "You are an experienced software engineer specializing in clean, efficient code."

Constraints & Formatting Rules – Enforces correctness and style.

Example: "Return only valid Python code. Follow PEP8. Do not include explanations."

Examples / Tests – Provides concrete input-output pairs to anchor behaviour.

is_palindrome("racecar") -> Trueis_palindrome("hello") -> False

Optional Reasoning Guidance – Suggests how the model should approach complex tasks.

Example: "Break the problem into helper functions if necessary."

8. Self-Hosted LLM Council Architecture for evaluating code generating LLM Models

Code generation is one of the most demanding and practically important uses of LLMs. Unlike generic natural language tasks, generating correct code requires:

Syntactic correctness — code must compile/run
Semantic correctness — code must logically solve the problem
Cross-language understanding — ability to generate multi-language solutions
Testability — ability to pass unit tests like HumanEval, MBPP or BigCodeBench

A council designed for code generation evaluation must therefore score models on compilability, correctness, and robustness, not just NLP-style fluency.

This article will demonstrate a small example of how the LLM council architecture can be implemented as a first step. The purpose of this article is to give you, the reader, full control of the code and unlock further possibilities.

Follow along the steps to implement this tutorial from scratch; start by opening the Visual Studio Code.

Warning: All commands and Docker workflows in this guide assume you are running inside a Linux environment (WSL on Windows). If you are using VS Code, ensure it is opened in WSL mode. For other IDEs, run them against the same WSL/Linux context.

1. Clone the repository

Karpathy's LLM Council

git clone https://github.com/karpathy/llm-council.git

2. Open in Visual Studio Code

Launch VS Code and go to File → Open Folder… and select the cloned llm-council folder.

VS Code will automatically detect the project folder.

3. Create two Dockerfile, one for Frontend and one for Backend

Dockerfile.backend

FROM python:3.11-slim# System depsRUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*# Install uvRUN pip install --no-cache-dir uvWORKDIR /app# Copy only dependency files first (better caching)COPY pyproject.toml uv.lock ./# Install dependenciesRUN uv sync --frozen# Copy backend codeCOPY backend ./backendEXPOSE 8000CMD ["uv", "run", "python", "-m", "backend.main"]

We use python:3.11-slim to get a lightweight, up-to-date Python 3.11 environment with minimal bloat. The Dockerfile sets up system dependencies, installs the project's Python packages via Astral's uv for reproducible environments, copies the backend code, exposes port 8000, and runs the backend app.

Dockerfile.frontend

FROM node:20-slimWORKDIR /app# Install deps first (cache-friendly)COPY frontend/package.json ./RUN npm install# Copy sourceCOPY frontend .EXPOSE 5173CMD ["npm", "run", "dev", "--", "--host"]

We use node:20-slim to get a lightweight, up-to-date Node.js 20 environment. The Dockerfile installs frontend dependencies in a cache-friendly way, copies the source code, exposes port 5173, and runs the Vite development server.

4. Create a docker-compose

docker-compose.yml

Note: All docker images are up-to-date as of the date of this article

version: "3.9"services:  backend:    build:      context: .      dockerfile: Dockerfile.backend    env_file:      - .env    ports:      - "8001:8001"    volumes:      - ./backend:/app/backend      - ./data/conversations:/app/data/conversations    restart: unless-stopped  frontend:    build:      context: .      dockerfile: Dockerfile.frontend    ports:      - "5173:5173"    depends_on:      - backend    restart: unless-stopped

This docker-compose.yml defines two services: backend and frontend. The backend builds from Dockerfile.backend, maps environment variables, exposes port 8001, mounts local code and data for live updates, and restarts automatically. The frontend builds from Dockerfile.frontend, exposes port 5173, depends on the backend, and also restarts automatically. This setup enables a fully containerized development environment where code changes are reflected immediately, ports are mapped for host access, and services are orchestrated together.

5. Modify config.py

This is where all the configuration for LLM Council will be defined. All the models and Chairman model will be defined here.

Warning: Modifying this code may break the LLM Council workflow — proceed with caution!

config.py

import osfrom dotenv import load_dotenvload_dotenv()# OpenRouter API keyOPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")# OpenRouter API endpointOPENROUTER_API_URL = "https://openrouter.ai/api/v1/chat/completions"# Nebius API keyNEBIUS_API_KEY = os.getenv("NEBIUS_API_KEY")# Nebius API endpointNEBIUS_API_URL = "https://api.tokenfactory.nebius.com/v1/chat/completions"# Local API keyLOCAL_API_KEY = os.getenv("LOCAL_API_KEY")# Local API endpointLOCAL_API_URL = "http://host.docker.internal:11434/v1/chat/completions"# Council members - list of OpenRouter model identifiersCOUNCIL_MODELS = [    {        "id": "nebius-Qwen3-Coder-480B-A35B-Instruct",        "provider": "nebius",        "model":"Qwen/Qwen3-Coder-480B-A35B-Instruct"    },    {        "id": "local-deepseek-coder:6.7b",        "provider": "local",        "model": "deepseek-coder:6.7b",    }]# Chairman model - synthesizes final responseCHAIRMAN_MODEL = {    "id": "gpt-oss:20b",    "provider": "nebius",    "model":"openai/gpt-oss-120b"}# Data directory for conversation storageDATA_DIR = "data/conversations"

We replaced the original fixed list of model strings with flexible dictionaries for council members and the chairman. This allows specifying the provider, model ID, and other metadata, making the system capable of using any provider or local model. The change enables seamless switching between OpenRouter, NeBius, Ollama, or locally hosted LLMs while keeping the council orchestration logic consistent.

6. Modify and rename openrouter.py

Warning: Modifying this code may break the LLM Council workflow — proceed with caution!

openrouter.py to router.py

import httpxfrom typing import List, Dict, Any, Optionalimport asynciofrom .config import NEBIUS_API_KEY, NEBIUS_API_URL, LOCAL_API_KEY, LOCAL_API_URLasync def query_model(    model: Dict[str, Any],    messages: List[Dict[str, str]],    timeout: float = 120.0) -> Optional[Dict[str, Any]]:    if (model["provider"] == "local"):        headers = {            "Authorization": f"Bearer {LOCAL_API_KEY}",            "Content-Type": "application/json",        }        payload = {            "model": model["model"],            "messages": messages,        }        try:            async with httpx.AsyncClient(timeout=timeout) as client:                response = await client.post(                    LOCAL_API_URL,                    headers=headers,                    json=payload                )                response.raise_for_status()                data = response.json()                message = data['choices'][0]['message']                return {                    'content': message.get('content'),                    'reasoning_details': message.get('reasoning_details')                }        except Exception as e:            model_name = model["model"]            print(f"Error querying model {model_name}: {e}")            return None    if (model["provider"] == "nebius"):        headers = {            "Authorization": f"Bearer {NEBIUS_API_KEY}",            "Content-Type": "application/json",        }        payload = {            "model": model["model"],            "messages": messages,        }        try:            async with httpx.AsyncClient(timeout=timeout) as client:                response = await client.post(                    NEBIUS_API_URL,                    headers=headers,                    json=payload                )                response.raise_for_status()                data = response.json()                message = data['choices'][0]['message']                return {                    'content': message.get('content'),                    'reasoning_details': message.get('reasoning_details')                }        except Exception as e:            model_name = model["model"]            print(f"Error querying model {model_name}: {e}")            return Noneasync def query_models_parallel(    models: List[Dict[str, Any]],    messages: List[Dict[str, str]]) -> Dict[str, Optional[Dict[str, Any]]]:    # Create tasks for all models    tasks = [query_model(model, messages) for model in models]    # Wait for all to complete    responses = await asyncio.gather(*tasks)    # Map models to their responses    return {model["model"]: response for model, response in zip(models, responses)}

A provider-based branching so the query logic dynamically selects the correct API endpoint and authentication for each model. This allows the same function to handle local models, NeBius, or any future providers, making the system fully flexible while keeping a single unified interface for sending messages and retrieving responses.

For a new provider, add a new if block as follows:

if (model["provider"] == "Provider_Name"):    headers = {        "Authorization": f"Bearer {PROVIDER_API_KEY}",        "Content-Type": "application/json",    }    payload = {        "model": model["model"],        "messages": messages,    }    try:        async with httpx.AsyncClient(timeout=timeout) as client:            response = await client.post(                LOCAL_API_URL,                headers=headers,                json=payload            )            response.raise_for_status()            data = response.json()            message = data['choices'][0]['message']            return {                'content': message.get('content'),                'reasoning_details': message.get('reasoning_details')            }

7. Modify council.py

Warning: Modifying this code may break the LLM Council workflow — proceed with caution!

council.py

from typing import List, Dict, Any, Tuplefrom .router import query_models_parallel, query_modelimport refrom .config import COUNCIL_MODELS, CHAIRMAN_MODELasync def stage1_collect_responses(user_query: str) -> List[Dict[str, Any]]:    messages = [{"role": "user", "content": user_query}]    # Query all models in parallel    responses = await query_models_parallel(COUNCIL_MODELS, messages)    # Format results    stage1_results = []    for model, response in responses.items():        if response is not None:  # Only include successful responses            stage1_results.append({                "model": model,                "response": response.get('content', '')            })    return stage1_resultsasync def stage2_collect_rankings(    user_query: str,    stage1_results: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:    # Create anonymized labels for responses (Response A, Response B, etc.)    labels = [chr(65 + i) for i in range(len(stage1_results))]  # A, B, C, ...    # Create mapping from label to model name    label_to_model = {        f"Response {label}": result['model']        for label, result in zip(labels, stage1_results)    }    # Build the ranking prompt    responses_text = "\n\n".join([        f"Response {label}:\n{result['response']}"        for label, result in zip(labels, stage1_results)    ])    ranking_prompt = f"""You are an expert software engineer.You write correct, efficient, and idiomatic code.Prefer clarity over cleverness.Do not include explanations unless asked.You are evaluating different responses to the following question:Question: {user_query}Here are the responses from different models (anonymized):{responses_text}Your task:1. First, evaluate each response individually. For each response, explain what it does well and what it does poorly.2. Then, at the very end of your response, provide a final ranking.IMPORTANT: Your final ranking MUST be formatted EXACTLY as follows:- Start with the line "FINAL RANKING:" (all caps, with colon)- Then list the responses from best to worst as a numbered list- Each line should be: number, period, space, then ONLY the response label (e.g., "1. Response A")- Do not add any other text or explanations in the ranking sectionExample of the correct format for your ENTIRE response:Response A provides good detail on X but misses Y...Response B is accurate but lacks depth on Z...Response C offers the most comprehensive answer...FINAL RANKING:1. Response C2. Response A3. Response BNow provide your evaluation and ranking:"""    messages = [{"role": "user", "content": ranking_prompt}]    # Get rankings from all council models in parallel    responses = await query_models_parallel(COUNCIL_MODELS, messages)    # Format results    stage2_results = []    for model, response in responses.items():        if response is not None:            full_text = response.get('content', '')            parsed = parse_ranking_from_text(full_text)            stage2_results.append({                "model": model,                "ranking": full_text,                "parsed_ranking": parsed            })    return stage2_results, label_to_modelasync def stage3_synthesize_final(    user_query: str,    stage1_results: List[Dict[str, Any]],    stage2_results: List[Dict[str, Any]]) -> Dict[str, Any]:    # Build comprehensive context for chairman    stage1_text = "\n\n".join([        f"Model: {result['model']}\nResponse: {result['response']}"        for result in stage1_results    ])    stage2_text = "\n\n".join([        f"Model: {result['model']}\nRanking: {result['ranking']}"        for result in stage2_results    ])    chairman_prompt = f"""You are the Chairman of an LLM Council of Expert Software Engineer. Multiple AI models have provided responses to a user's question, and then ranked each other's responses.Original Question: {user_query}STAGE 1 - Individual Responses:{stage1_text}STAGE 2 - Peer Rankings:{stage2_text}Your task as Chairman is to synthesize all of this information into a single, comprehensive, accurate answer to the user's original question. Consider:- The individual responses and their insights- The peer rankings and what they reveal about response quality- Any patterns of agreement or disagreementProvide a clear, well-reasoned final answer that represents the council's collective wisdom:"""    messages = [{"role": "user", "content": chairman_prompt}]    # Query the chairman model    response = await query_model(CHAIRMAN_MODEL, messages)    if response is None:        # Fallback if chairman fails        return {            "model": CHAIRMAN_MODEL["model"],            "response": "Error: Unable to generate final synthesis."        }    return {        "model": CHAIRMAN_MODEL["model"],        "response": response.get('content', '')    }def parse_ranking_from_text(ranking_text: str) -> List[str]:    # Look for "FINAL RANKING:" section    if "FINAL RANKING:" in ranking_text:        # Extract everything after "FINAL RANKING:"        parts = ranking_text.split("FINAL RANKING:")        if len(parts) >= 2:            ranking_section = parts[1]            # Try to extract numbered list format (e.g., "1. Response A")            # This pattern looks for: number, period, optional space, "Response X"            numbered_matches = re.findall(r'\d+\.\s*Response [A-Z]', ranking_section)            if numbered_matches:                # Extract just the "Response X" part                return [re.search(r'Response [A-Z]', m).group() for m in numbered_matches]            # Fallback: Extract all "Response X" patterns in order            matches = re.findall(r'Response [A-Z]', ranking_section)            return matches    # Fallback: try to find any "Response X" patterns in order    matches = re.findall(r'Response [A-Z]', ranking_text)    return matchesdef calculate_aggregate_rankings(    stage2_results: List[Dict[str, Any]],    label_to_model: Dict[str, str]) -> List[Dict[str, Any]]:    from collections import defaultdict    # Track positions for each model    model_positions = defaultdict(list)    for ranking in stage2_results:        ranking_text = ranking['ranking']        # Parse the ranking from the structured format        parsed_ranking = parse_ranking_from_text(ranking_text)        for position, label in enumerate(parsed_ranking, start=1):            if label in label_to_model:                model_name = label_to_model[label]                model_positions[model_name].append(position)    # Calculate average position for each model    aggregate = []    for model, positions in model_positions.items():        if positions:            avg_rank = sum(positions) / len(positions)            aggregate.append({                "model": model,                "average_rank": round(avg_rank, 2),                "rankings_count": len(positions)            })    # Sort by average rank (lower is better)    aggregate.sort(key=lambda x: x['average_rank'])    return aggregateasync def generate_conversation_title(user_query: str) -> str:    title_prompt = f"""Generate a very short title (3-5 words maximum) that summarizes the following question.The title should be concise and descriptive. Do not use quotes or punctuation in the title.Question: {user_query}Title:"""    messages = [{"role": "user", "content": title_prompt}]    # Use gemini-2.5-flash for title generation (fast and cheap)    response = await query_model(CHAIRMAN_MODEL["model"], messages, timeout=30.0)    if response is None:        # Fallback to a generic title        return "New Conversation"    title = response.get('content', 'New Conversation').strip()    # Clean up the title - remove quotes, limit length    title = title.strip('"\'')    # Truncate if too long    if len(title) > 50:        title = title[:47] + "..."    return titleasync def run_full_council(user_query: str) -> Tuple[List, List, Dict, Dict]:    # Stage 1: Collect individual responses    stage1_results = await stage1_collect_responses(user_query)    # If no models responded successfully, return error    if not stage1_results:        return [], [], {            "model": "error",            "response": "All models failed to respond. Please try again."        }, {}    # Stage 2: Collect rankings    stage2_results, label_to_model = await stage2_collect_rankings(user_query, stage1_results)    # Calculate aggregate rankings    aggregate_rankings = calculate_aggregate_rankings(stage2_results, label_to_model)    # Stage 3: Synthesize final answer    stage3_result = await stage3_synthesize_final(        user_query,        stage1_results,        stage2_results    )    # Prepare metadata    metadata = {        "label_to_model": label_to_model,        "aggregate_rankings": aggregate_rankings    }    return stage1_results, stage2_results, stage3_result, metadata

Karpathy's 3-stage LLM Council was enhanced to focus on code generation by customizing the prompts at each stage for code generation. In stage 1, models receive a clear, structured user query. Stage 2 includes anonymized peer-review prompts that explicitly instruct models to evaluate each response and produce a strictly formatted ranking, ensuring consistent outputs. Stage 3 uses a Chairman prompt that synthesizes individual responses and peer rankings into a final, comprehensive answer. By taking prompts out of hard-coded text and structuring them carefully, we make the system more robust, reproducible, and adaptable to any coding task or model type.

8. Configure the .env file

.env

Warning: Make sure that the API Keys used have access to the models that will be called

# === Providers ===NEBIUS_API_KEY=LOCAL_API_KEY=dummy# Nebius OpenAI-compatible base URLNEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/chat/completionsLOCAL_BASE_URL=http://host.docker.internal:11434/v1/chat/completions

Under Providers, add API Keys from any LLM Providers.

And add the Provider URL under Base URL

9. Build the solution in terminal

Note: No Cache mode (--no-cache) forces Docker to rebuild every layer from scratch, ignoring any previously cached layers

docker compose build --no-cache

10. Run the solution in terminal

Note: Detached mode (-d) is important to run everything in the background and current terminal session stays interactive

docker compose up -d

Once the services are up, The Frontend starts on http://localhost:5173/

Minimal version in our git: GitHub - krafteq/llm-council: LLM Council works together to answer your hardest questions

After cloning the repository, open it with Visual Studio Code.

If the Operating System is Windows, ensure that Visual Studio Code is opened from the Ubuntu system.

Go to the root directory of the project that contains docker-compose.yaml, build and run it in detached mode.

to build:

docker compose build --no-cache

to run:

docker compose up -d

9. Challenges for the future

Evaluating code isn't just about comparing text tokens. Key requirements include:

Execution correctness — verify via test suites
Edge case handling — does the model solve corner cases?
Readability & maintainability — code quality matters
Performance — complexity and optimization

Many benchmarks like HumanEval, MBPP, and BigCodeBench have emerged to measure these properties in standardized ways. Recent academic work also highlights that benchmarks aligned to real code repositories produce more meaningful evaluations than toy problems used in early research.

Creating an LLM Council for code generation involves several challenges:

Correctness vs. Style: Different models may produce valid but stylistically different solutions.
Language-Specific Issues: Models may handle Python, Java, or C++ differently, introducing language-specific bugs or quirks.
Goal Misalignment: A model may generate clean, efficient code that misses the intended functionality or objective.
Evaluation Complexity: Peer-review prompts may miss subtle bugs; testing and linting are often required.
Efficiency: Measuring runtime or memory efficiency is difficult across different environments.
Vulnerability & Security: Models may not reliably detect security flaws or support robust risk management.
Resource Constraints: Running multiple large models in parallel consumes significant compute and memory.
Synthesis Challenges: Combining diverse outputs risks incompatibility or hallucinated code.
Prompt Standardization & Reproducibility: Small changes can drastically alter outputs and rankings.

10. Conclusion

Using a council of code-generating LLMs turns multiple perspectives into better, safer, and more reliable code. Peer evaluation plus a Chairman synthesis ensures outputs are correct, clear, and goal-aligned. Ensemble reasoning beats any single model—making collaborative AI coding a practical reality.

11. References and Further Reading

LLM Council, GitHub - karpathy/llm-council: LLM Council works together to answer your hardest questions
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques, arXiv:2406.06608
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories, arXiv:2404.00599
LLMSurvey, GitHub - RUCAIBox/LLMSurvey: The official GitHub page for the survey paper "A Survey of Large Language Models".