Intelligent Model Routing

Open Model Prism routes incoming requests to the optimal model automatically when clients send "model": "auto". This document describes the full routing pipeline, all signal sources, override rules, and how classifier context limits are handled.

Overview

Incoming Request  ("model": "auto")
        │
        ▼
┌───────────────────────┐
│  Signal Extractor     │  0ms — no LLM call
│  Token count          │
│  Keywords / patterns  │
│  System prompt role   │
│  Content type         │
│  Conversation turns   │
└──────────┬────────────┘
           │ signals
           ▼
┌───────────────────────┐
│  Override Rules       │  0ms — rule engine
│  Vision upgrade       │
│  Domain gate          │
│  Security escalation  │
│  Budget cap           │
│  Confidence fallback  │
└──────────┬────────────┘
           │ category candidate + confidence
           ▼
    confidence >= threshold?
      YES ─────────────────────────────────────→ Model Selector
      NO
           │
           ▼
┌───────────────────────┐
│  LLM Classifier       │  ~200-800ms — only called when needed
│  Prompt summary       │
│  Context truncation   │
│  JSON output          │
└──────────┬────────────┘
           │ category + metadata
           ▼
┌───────────────────────┐
│  Model Selector       │  0ms
│  Category -> model    │
│  Capability matching  │
│  Context window check │
└──────────┬────────────┘
           │
           ▼
       Target Model

The goal is to call the expensive LLM classifier as infrequently as possible. Many requests can be pre-classified from structural signals alone.

Cost Tiers

Every routing category belongs to one of four cost tiers. The tier determines which class of model handles the request.

Tier	Use cases	Example categories
`minimal`	Translation, formatting, simple Q&A, smalltalk	`translation`, `smalltalk_simple`, `format_convert`
`low`	Drafting, summarisation, function calls	`summarization_short`, `instruction_following`, `function_calling`
`medium`	Analysis, long context, data extraction	`data_analysis`, `long_context_processing`, `api_integration`
`high`	Formal reasoning, security review, agentic coding	`reasoning_formal`, `code_security_review`, `swe_agentic`

Each tenant maps tiers to specific models via their routing category configuration. A category's defaultModel field overrides the tier default.

Signal Extraction

Before the LLM classifier is invoked, structural signals are extracted from the raw request. These signals are cheap (microseconds, no API calls) and often sufficient to make a routing decision.

Token Count

The total token count of all messages (system + history + current user message) is the strongest single signal for tier selection.

< 500 tokens    ->  minimal tier candidate
500-2 000       ->  low tier candidate
2 000-15 000    ->  medium tier candidate
> 15 000        ->  high tier candidate (long_context_processing)
> 50 000        ->  always high tier, regardless of other signals

Token count is estimated offline using a character-based heuristic (~3.5 chars/token, code-aware) — no tokenizer dependency.

System Prompt Role Detection

The system prompt defines the "mode" of the entire session and overrides most other signals.

"You are a senior security auditor..."     ->  code_security_review, high tier
"You are a customer support agent..."      ->  customer_support, low tier
"You are a legal compliance advisor..."    ->  legal_analysis, domain=legal, medium tier min
"You are a data scientist..."              ->  data_analysis, medium tier

If the system prompt matches a known role pattern, the category is set directly and the LLM classifier is skipped.

Keyword Rules (configurable)

Keyword rules scan the full message content for domain-specific terms. Each rule specifies:

keywords — list of strings to search for (case-insensitive)
match — any (one keyword sufficient) or all (all must appear)
minMatches — minimum number of keyword hits required
effect — what happens: override category, set a minimum tier, set a domain flag

Built-in examples:

Security Escalation:
  keywords: [private key, jwt, secret, vulnerability, CVE, exploit, crypto]
  match: any, minMatches: 2
  effect: category=code_security_review, tierMin=high

Legal Domain:
  keywords: [GDPR, NDA, liability, compliance, contract, Article]
  match: any, minMatches: 1
  effect: tierMin=medium, domain=legal

Medical Domain:
  keywords: [diagnosis, ICD, treatment, medication, symptoms, clinical]
  match: any, minMatches: 1
  effect: tierMin=medium, domain=medical

Keyword rules are stored in the database and fully editable via the admin UI — no deployment required.

Code Language Detection

The content is scanned for language-specific patterns to enable capability-based model selection:

Patterns detected:  .sol, pragma solidity, contract   ->  blockchain
                    def , import , .py                ->  python
                    SELECT, JOIN, CREATE TABLE         ->  sql
                    BEGIN CERTIFICATE, -----BEGIN     ->  crypto/certificates
                    func , go mod, :=                 ->  go

A detected language influences which model is selected within the target tier — for example, routing Python code tasks to codestral or deepseek-coder-v2 instead of a general-purpose model of the same tier.

Content Type

images in messages          ->  vision model required (see Vision Upgrade override)
tool_calls in messages      ->  function-calling capable model required
structured output schema    ->  JSON-mode capable model preferred
streaming: false            ->  no streaming constraint — any model eligible

Conversation Turn Count

Longer conversations carry accumulated context and often involve follow-up complexity:

turns 1-3   ->  no effect
turns 4-7   ->  +1 tier upgrade (configurable)
turns 8+    ->  +1 tier upgrade, long_context_processing flag

Turn-based upgrades are suppressed for categories that are inherently stateless (e.g. summarization_short, translation).

Override Rules

After signal extraction, a set of override rules adjusts the routing result. Overrides are applied in order; the first matching override wins (or they can stack — configurable).

Override	Condition	Effect
Vision Upgrade	Images present, category doesn't require vision	Upgrade tier by 1
Security Escalation	Security keywords >= threshold	Force `code_security_review`, `high` tier
Domain Gate	Domain = legal / medical / finance	Tier minimum `medium`
Confidence Fallback	Classifier confidence < threshold (default 0.65)	Force `medium` tier
Conversation Turn Upgrade	Turns >= 4	+1 tier
Frustration Upgrade	User frustration signal detected	+1 tier
Output Length Upgrade	Estimated output = long, tier = minimal	Upgrade to `low`
Budget Cap	Tenant daily spend >= alert threshold	Downgrade to configured max tier

All overrides are individually toggleable and threshold-adjustable per tenant via the admin UI.

LLM Classifier

When pre-routing signals produce a confidence below the configured threshold (default: 0.65), the LLM classifier is called. It receives a structured summary of the request — not the full content — and returns a JSON routing decision.

What the classifier receives

[System]
You are a precise model router. Classify the request into one of:
- code_generation [low] — Examples: write function, implement class, ...
- data_analysis [medium] — Examples: analyze dataset, find patterns, ...
- reasoning_formal [high] — Examples: prove theorem, formal logic, ...
... (all 45 categories with tier and examples)

Reply with ONLY valid JSON, no markdown:
{"category":"...","confidence":0-1,"complexity":"simple|medium|complex",
 "has_image":bool,"language":"en|de|other","estimated_output_length":"short|medium|long",
 "domain":"general|legal|medical|finance|tech|science","conversation_turn":int,
 "user_frustration_signal":bool,"cost_tier":"minimal|low|medium|high","reasoning":"..."}

[User]
[System]: You are a coding assistant...  (truncated to 500 chars)
[user]: Analyse this repository and identify security vulnerabilities
[CONTEXT_SIGNALS: tokens=82000, languages=[python,yaml], security_keywords=3, turns=1]

Note that context signals are injected as metadata — the classifier never sees the full file contents. This keeps the classifier call small and fast regardless of how large the actual payload is.

Classifier Context Limit Handling

Different classifier models have vastly different context windows:

Model	Context window	Recommended strategy
`gpt-4o-mini`	128,000 tokens	`truncate`
`claude-haiku-4-5`	200,000 tokens	`truncate`
`llama-3.1-8b`	8,192 tokens	`metadata_only`
`gemini-flash-2.0`	1,000,000 tokens	`truncate`
`phi-4`	16,384 tokens	`summary`

Three strategies are supported:

metadata_only — the classifier never sees message content, only extracted metadata. Safest for small-context models (< 16k tokens). Lowest classification quality for ambiguous prompts.

truncate — the last user message and a truncated system prompt are included up to contextLimit x 0.6 tokens (leaving headroom for the category list and output). Best trade-off for most models.

summary — a cheap, fast model first generates a 200-token summary of the full context, then the classifier receives that summary. Highest quality for long inputs, but adds one extra API hop.

The strategy and context limit are configured per tenant in the Routing Config UI.

Routing Categories

Open Model Prism ships with 45 built-in routing categories across four cost tiers. Categories are stored in MongoDB and fully editable — add, remove, rename, or adjust defaults without code changes.

Minimal tier (simple, fast, cheap)

Key	Description
`smalltalk_simple`	Greetings, casual conversation
`translation`	Language translation
`format_convert`	Convert between formats (Markdown → HTML, JSON → YAML, etc.)
`brainstorming`	Quick idea generation, simple lists
`proofreading`	Grammar and spelling correction
`summarization_short`	Short text summarisation (< 2 pages)

Low tier (standard tasks)

Key	Description
`summarization_long`	Long document summarisation
`instruction_following`	Step-by-step task completion
`function_calling`	Tool use, function call generation
`qa_simple`	Simple factual Q&A
`classification_extraction`	Entity extraction, labelling
`creative_writing`	Stories, poems, marketing copy
`sentiment_analysis`	Tone and sentiment detection
`devops_infrastructure`	Infrastructure scripts, CI/CD, Docker, Kubernetes
`qa_testing`	Test case generation, QA scenarios

Medium tier (complex tasks)

Key	Description
`code_generation`	Write code in any language
`code_review`	Review and critique code
`code_debugging`	Identify and fix bugs
`data_analysis`	Analyse datasets, find patterns
`api_integration`	API client code, integration logic
`long_context_processing`	Tasks requiring large context windows (> 15k tokens)
`stem_science`	Science, engineering, technical calculations
`question_answering_complex`	Multi-step reasoning Q&A
`customer_support`	Support ticket handling, escalation
`document_understanding`	PDF, contract, or report comprehension
`research_synthesis`	Summarise multiple sources

High tier (hardest tasks)

Key	Description
`reasoning_formal`	Mathematical proofs, formal logic
`code_security_review`	Security audit, vulnerability analysis
`swe_agentic`	Agentic software engineering, multi-step coding
`legal_analysis`	Legal document analysis, compliance
`medical_analysis`	Clinical/medical content (escalated by domain gate)
`system_design`	Architecture, high-level design docs
`multimodal_analysis`	Image + text combined analysis

Preset Profiles

Preset profiles are named bundles of routing categories. When applied, they automatically assign the best available model (ranked by benchmark score for the category's primary capability axis) to each category in the bundle.

The assignment is non-destructive: categories that already have a defaultModel configured are skipped.

Profile	Category focus
`software_development`	Code generation, debugging, refactoring, security review, DevOps
`customer_support`	FAQ, sentiment analysis, summarisation, instruction following
`research_analysis`	Data analysis, STEM, long context, formal reasoning
`creative_content`	Brainstorming, copywriting, proofreading, format conversion
`data_operations`	SQL, data transformation, API integration, QA testing
`agentic_workflows`	Agentic SWE, function calling, multi-step tool use
`enterprise_general`	All 45 categories — full coverage

Profiles are selectable in the Setup Wizard (step 2 of 4) and re-applicable at any time via POST /api/prism/admin/categories/apply-preset.

Model Selection

Once a category and tier are determined, the model is selected as follows:

Category defaultModel — if set, use it directly (highest priority)
Capability matching — within the target tier, rank models by their benchmark score for the category's primary capability axis (e.g. coding for code_generation, math for reasoning_formal)
Tenant routing defaultModel — fallback if no category default and no benchmark data
Context window check — if the selected model's context window is smaller than the estimated token count, escalate to the next larger model (findLargerContextModel)

Response Enrichment

Every auto-routed response includes routing metadata in the response body:

{
  "choices": [...],
  "auto_routing": {
    "category": "code_security_review",
    "confidence": 0.91,
    "complexity": "complex",
    "cost_tier": "high",
    "model_id": "claude-opus-4-6",
    "override_applied": "security_escalation",
    "analysis_time_ms": 312,
    "domain": "tech",
    "reasoning": "Request contains 3 security-related keyword patterns and 82k token codebase"
  },
  "cost_info": {
    "actual_cost": 0.0187,
    "baseline_cost": 0.0210,
    "saved": 0.0023,
    "input_tokens": 82140,
    "output_tokens": 1820
  }
}

If a context fallback occurred (model was upgraded due to context overflow), a context_fallback field is also included:

"context_fallback": {
  "original_model": "claude-sonnet-4-6",
  "fallback_model": "claude-opus-4-6",
  "reason": "context_overflow"
}