Category: Core Version: 4.1 Last Updated: 2026-03-11
A provider-agnostic LLM abstraction layer that switches between a locally-running Ollama instance and the Google Gemini cloud API. The switch is controlled by a single env var (GEMINI_ENABLED) and takes effect without restarting the server.
LlmFactory.create_llm(agent_type)
│
├── GEMINI_ENABLED=true → GeminiLlm → Google Gemini API (cloud)
└── GEMINI_ENABLED=false → OllamaLlm → http://localhost:11434 (local)
Critical: Set in
backend/.env. Do NOT wrap in quotes. Correct:GEMINI_ENABLED=trueIncorrect:GEMINI_ENABLED='true'← Python reads the quotes as part of the value
backend/llms/ollama_llm.py)Extends google.adk.models.BaseLlm. Sends prompts to Ollama’s /api/generate REST endpoint over HTTP using a shared async client.
| Optimisation | Before | After | Latency Saved |
|---|---|---|---|
| HTTP client | New httpx.AsyncClient per request |
Singleton _shared_client with connection pooling |
−0.5s/call |
| Context window | num_ctx=4096 |
num_ctx=2048 |
−2–5s |
| Max output tokens | num_predict=2000 |
num_predict=400 |
−2–8s |
| Model auto-pull | Runs on every request (blocks 3–10s) | Removed; startup warm-up only | −3–10s |
| Conversation history | Unlimited turns | Truncated to last 10 turns | −1–3s |
| Retry delays | 1s → 2s → 4s | 0.5s → 1s | −3–7s on failure |
| Method | Purpose |
|---|---|
_get_shared_client(base_url) |
Returns or creates singleton httpx.AsyncClient with keep-alive |
generate_content_async() |
Main generation method. Builds Ollama payload, calls API, truncates response at sentence boundary |
warm_up_model() |
Sends 1-token request at startup to pre-load model into GPU VRAM |
_extract_text_from_contents() |
Delegates to llm_utils.extract_text_from_contents() — shared with GeminiLlm |
{
"model": "gemma3:4b",
"prompt": "<system>...</system>\n<turn>...</turn>",
"stream": false,
"options": {
"num_ctx": 2048,
"num_predict": 400,
"temperature": 0.7
}
}
backend/llms/gemini_llm.py)Extends google.adk.models.BaseLlm. Uses the google.genai SDK with streaming enabled to reduce first-token latency.
__init__)max_output_tokens: int = 800 # optimised from 2048
timeout: float = 45.0 # reduced from 60s
max_retries: int = 2 # reduced from 3
temperature: float = 0.7
top_p: float = 0.95
top_k: int = 40
| Method | Purpose |
|---|---|
_generate_with_stream(prompt) |
Calls generate_content_stream(), collects all chunks, joins to final text |
_retry_with_backoff(func) |
Exponential backoff retry for 429 / 503 / timeout errors |
_get_generation_config() |
Returns GenerateContentConfig with all params |
_get_safety_settings() |
Disables all safety filters (required for scam simulation content) |
All safety categories are set to BLOCK_NONE. This is intentional — the platform simulates scam conversations which would otherwise be blocked by default Gemini safety filters.
genai.Client is created once in __init__ and reused across all calls (stored as _client PrivateAttr). API key is read from GEMINI_API_KEY env var.
llm_utils.pybackend/llms/llm_utils.py provides extract_text_from_contents(contents) — previously duplicated in both ollama_llm.py and gemini_llm.py (DUP-001 fix). Both adapters now import from here.
# Both adapters use:
from llms.llm_utils import extract_text_from_contents
Converts ADK Content list (with role + parts) into a flat prompt string, truncated to the last 10 conversation turns.
backend/llms/llm_factory.py)Static factory class. Single entry point for all agent LLM creation.
| Method | Signature | Purpose |
|---|---|---|
create_llm |
(agent_type, scam_type, context, use_gemini) |
Creates and returns OllamaLlm or GeminiLlm with injected RAG context |
get_rag_context |
(scam_type, context) |
Queries ChromaDB, returns formatted context string |
get_current_provider |
() |
Returns 'gemini' or 'ollama' |
get_provider_info |
() |
Returns full config dict (model names, URLs, token limits) |
validate_gemini_config |
() |
Checks GEMINI_API_KEY is set and model IDs are valid |
Different models can be assigned per agent type via env vars:
# Gemini per-agent overrides
GEMINI_MODEL_SCAMMER=gemini-2.0-flash-exp
GEMINI_MODEL_EXPERT=gemini-2.0-flash-exp
# Ollama per-agent overrides (after Tools Center fine-tune)
AGENT_MODEL_SCAMMER=scammer-sim
AGENT_MODEL_EXPERT=anti-fraud-expert
AGENT_MODEL=gemma3:4b # default for all others
POST /api/model/switch (model_switch_routes.py) updates the GEMINI_ENABLED flag at runtime. No restart required. The change is reflected immediately in the next LlmFactory.create_llm() call.
See model-switch-api.md for full API reference.
backend/llms/
ollama_llm.py Ollama adapter
gemini_llm.py Gemini adapter
llm_factory.py Provider selection + RAG injection
llm_utils.py Shared extract_text_from_contents()
rag_integration.py GeminiRAGHelper singleton
rag_diagnostics.py RAG health check
gemini_file_manager.py Gemini file upload (unused in current HK deployment)
backend/.env GEMINI_ENABLED, API keys, model names