Intelligent Model Routing

Open Model Prism routes incoming requests to the optimal model automatically when clients send "model": "auto". This document describes the full routing pipeline, all signal sources, override rules, and how classifier context limits are handled.

Overview

Incoming Request  ("model": "auto")
        │
        ▼
┌───────────────────────┐
│  Signal Extractor     │  0ms — no LLM call
│  Token count          │
│  Keywords / patterns  │
│  System prompt role   │
│  Content type         │
│  Conversation turns   │
└──────────┬────────────┘
           │ signals
           ▼
┌───────────────────────┐
│  Override Rules       │  0ms — rule engine
│  Vision upgrade       │
│  Domain gate          │
│  Security escalation  │
│  Budget cap           │
│  Confidence fallback  │
└──────────┬────────────┘
           │ category candidate + confidence
           ▼
    confidence >= threshold?
      YES ─────────────────────────────────────→ Model Selector
      NO
           │
           ▼
┌───────────────────────┐
│  LLM Classifier       │  ~200-800ms — only called when needed
│  Prompt summary       │
│  Context truncation   │
│  JSON output          │
└──────────┬────────────┘
           │ category + metadata
           ▼
┌───────────────────────┐
│  Model Selector       │  0ms
│  Category -> model    │
│  Capability matching  │
│  Context window check │
└──────────┬────────────┘
           │
           ▼
       Target Model

The goal is to call the expensive LLM classifier as infrequently as possible. Many requests can be pre-classified from structural signals alone.

Cost Tiers

Every routing category belongs to one of four cost tiers. The tier determines which class of model handles the request.

Tier Use cases Example categories
minimal Translation, formatting, simple Q&A, smalltalk translation, smalltalk_simple, format_convert
low Drafting, summarisation, function calls summarization_short, instruction_following, function_calling
medium Analysis, long context, data extraction data_analysis, long_context_processing, api_integration
high Formal reasoning, security review, agentic coding reasoning_formal, code_security_review, swe_agentic

Each tenant maps tiers to specific models via their routing category configuration. A category's defaultModel field overrides the tier default.

Signal Extraction

Before the LLM classifier is invoked, structural signals are extracted from the raw request. These signals are cheap (microseconds, no API calls) and often sufficient to make a routing decision.

Token Count

The total token count of all messages (system + history + current user message) is the strongest single signal for tier selection.

< 500 tokens    ->  minimal tier candidate
500-2 000       ->  low tier candidate
2 000-15 000    ->  medium tier candidate
> 15 000        ->  high tier candidate (long_context_processing)
> 50 000        ->  always high tier, regardless of other signals

Token count is estimated offline using a character-based heuristic (~3.5 chars/token, code-aware) — no tokenizer dependency.

System Prompt Role Detection

The system prompt defines the "mode" of the entire session and overrides most other signals.

"You are a senior security auditor..."     ->  code_security_review, high tier
"You are a customer support agent..."      ->  customer_support, low tier
"You are a legal compliance advisor..."    ->  legal_analysis, domain=legal, medium tier min
"You are a data scientist..."              ->  data_analysis, medium tier

If the system prompt matches a known role pattern, the category is set directly and the LLM classifier is skipped.

Keyword Rules (configurable)

Keyword rules scan the full message content for domain-specific terms. Each rule specifies:

  • keywords — list of strings to search for (case-insensitive)
  • matchany (one keyword sufficient) or all (all must appear)
  • minMatches — minimum number of keyword hits required
  • effect — what happens: override category, set a minimum tier, set a domain flag

Built-in examples:

Security Escalation:
  keywords: [private key, jwt, secret, vulnerability, CVE, exploit, crypto]
  match: any, minMatches: 2
  effect: category=code_security_review, tierMin=high

Legal Domain:
  keywords: [GDPR, NDA, liability, compliance, contract, Article]
  match: any, minMatches: 1
  effect: tierMin=medium, domain=legal

Medical Domain:
  keywords: [diagnosis, ICD, treatment, medication, symptoms, clinical]
  match: any, minMatches: 1
  effect: tierMin=medium, domain=medical

Keyword rules are stored in the database and fully editable via the admin UI — no deployment required.

Code Language Detection

The content is scanned for language-specific patterns to enable capability-based model selection:

Patterns detected:  .sol, pragma solidity, contract   ->  blockchain
                    def , import , .py                ->  python
                    SELECT, JOIN, CREATE TABLE         ->  sql
                    BEGIN CERTIFICATE, -----BEGIN     ->  crypto/certificates
                    func , go mod, :=                 ->  go

A detected language influences which model is selected within the target tier — for example, routing Python code tasks to codestral or deepseek-coder-v2 instead of a general-purpose model of the same tier.

Content Type

images in messages          ->  vision model required (see Vision Upgrade override)
tool_calls in messages      ->  function-calling capable model required
structured output schema    ->  JSON-mode capable model preferred
streaming: false            ->  no streaming constraint — any model eligible

Conversation Turn Count

Longer conversations carry accumulated context and often involve follow-up complexity:

turns 1-3   ->  no effect
turns 4-7   ->  +1 tier upgrade (configurable)
turns 8+    ->  +1 tier upgrade, long_context_processing flag

Turn-based upgrades are suppressed for categories that are inherently stateless (e.g. summarization_short, translation).

Override Rules

After signal extraction, a set of override rules adjusts the routing result. Overrides are applied in order; the first matching override wins (or they can stack — configurable).

Override Condition Effect
Vision UpgradeImages present, category doesn't require visionUpgrade tier by 1
Security EscalationSecurity keywords >= thresholdForce code_security_review, high tier
Domain GateDomain = legal / medical / financeTier minimum medium
Confidence FallbackClassifier confidence < threshold (default 0.65)Force medium tier
Conversation Turn UpgradeTurns >= 4+1 tier
Frustration UpgradeUser frustration signal detected+1 tier
Output Length UpgradeEstimated output = long, tier = minimalUpgrade to low
Budget CapTenant daily spend >= alert thresholdDowngrade to configured max tier

All overrides are individually toggleable and threshold-adjustable per tenant via the admin UI.

LLM Classifier

When pre-routing signals produce a confidence below the configured threshold (default: 0.65), the LLM classifier is called. It receives a structured summary of the request — not the full content — and returns a JSON routing decision.

What the classifier receives

[System]
You are a precise model router. Classify the request into one of:
- code_generation [low] — Examples: write function, implement class, ...
- data_analysis [medium] — Examples: analyze dataset, find patterns, ...
- reasoning_formal [high] — Examples: prove theorem, formal logic, ...
... (all 45 categories with tier and examples)

Reply with ONLY valid JSON, no markdown:
{"category":"...","confidence":0-1,"complexity":"simple|medium|complex",
 "has_image":bool,"language":"en|de|other","estimated_output_length":"short|medium|long",
 "domain":"general|legal|medical|finance|tech|science","conversation_turn":int,
 "user_frustration_signal":bool,"cost_tier":"minimal|low|medium|high","reasoning":"..."}

[User]
[System]: You are a coding assistant...  (truncated to 500 chars)
[user]: Analyse this repository and identify security vulnerabilities
[CONTEXT_SIGNALS: tokens=82000, languages=[python,yaml], security_keywords=3, turns=1]

Note that context signals are injected as metadata — the classifier never sees the full file contents. This keeps the classifier call small and fast regardless of how large the actual payload is.

Classifier Context Limit Handling

Different classifier models have vastly different context windows:

Model Context window Recommended strategy
gpt-4o-mini128,000 tokenstruncate
claude-haiku-4-5200,000 tokenstruncate
llama-3.1-8b8,192 tokensmetadata_only
gemini-flash-2.01,000,000 tokenstruncate
phi-416,384 tokenssummary

Three strategies are supported:

metadata_only — the classifier never sees message content, only extracted metadata. Safest for small-context models (< 16k tokens). Lowest classification quality for ambiguous prompts.

truncate — the last user message and a truncated system prompt are included up to contextLimit x 0.6 tokens (leaving headroom for the category list and output). Best trade-off for most models.

summary — a cheap, fast model first generates a 200-token summary of the full context, then the classifier receives that summary. Highest quality for long inputs, but adds one extra API hop.

The strategy and context limit are configured per tenant in the Routing Config UI.

Routing Categories

Open Model Prism ships with 45 built-in routing categories across four cost tiers. Categories are stored in MongoDB and fully editable — add, remove, rename, or adjust defaults without code changes.

Minimal tier (simple, fast, cheap)

KeyDescription
smalltalk_simpleGreetings, casual conversation
translationLanguage translation
format_convertConvert between formats (Markdown → HTML, JSON → YAML, etc.)
brainstormingQuick idea generation, simple lists
proofreadingGrammar and spelling correction
summarization_shortShort text summarisation (< 2 pages)

Low tier (standard tasks)

KeyDescription
summarization_longLong document summarisation
instruction_followingStep-by-step task completion
function_callingTool use, function call generation
qa_simpleSimple factual Q&A
classification_extractionEntity extraction, labelling
creative_writingStories, poems, marketing copy
sentiment_analysisTone and sentiment detection
devops_infrastructureInfrastructure scripts, CI/CD, Docker, Kubernetes
qa_testingTest case generation, QA scenarios

Medium tier (complex tasks)

KeyDescription
code_generationWrite code in any language
code_reviewReview and critique code
code_debuggingIdentify and fix bugs
data_analysisAnalyse datasets, find patterns
api_integrationAPI client code, integration logic
long_context_processingTasks requiring large context windows (> 15k tokens)
stem_scienceScience, engineering, technical calculations
question_answering_complexMulti-step reasoning Q&A
customer_supportSupport ticket handling, escalation
document_understandingPDF, contract, or report comprehension
research_synthesisSummarise multiple sources

High tier (hardest tasks)

KeyDescription
reasoning_formalMathematical proofs, formal logic
code_security_reviewSecurity audit, vulnerability analysis
swe_agenticAgentic software engineering, multi-step coding
legal_analysisLegal document analysis, compliance
medical_analysisClinical/medical content (escalated by domain gate)
system_designArchitecture, high-level design docs
multimodal_analysisImage + text combined analysis

Preset Profiles

Preset profiles are named bundles of routing categories. When applied, they automatically assign the best available model (ranked by benchmark score for the category's primary capability axis) to each category in the bundle.

The assignment is non-destructive: categories that already have a defaultModel configured are skipped.

ProfileCategory focus
software_developmentCode generation, debugging, refactoring, security review, DevOps
customer_supportFAQ, sentiment analysis, summarisation, instruction following
research_analysisData analysis, STEM, long context, formal reasoning
creative_contentBrainstorming, copywriting, proofreading, format conversion
data_operationsSQL, data transformation, API integration, QA testing
agentic_workflowsAgentic SWE, function calling, multi-step tool use
enterprise_generalAll 45 categories — full coverage

Profiles are selectable in the Setup Wizard (step 2 of 4) and re-applicable at any time via POST /api/prism/admin/categories/apply-preset.

Model Selection

Once a category and tier are determined, the model is selected as follows:

  1. Category defaultModel — if set, use it directly (highest priority)
  2. Capability matching — within the target tier, rank models by their benchmark score for the category's primary capability axis (e.g. coding for code_generation, math for reasoning_formal)
  3. Tenant routing defaultModel — fallback if no category default and no benchmark data
  4. Context window check — if the selected model's context window is smaller than the estimated token count, escalate to the next larger model (findLargerContextModel)

Response Enrichment

Every auto-routed response includes routing metadata in the response body:

{
  "choices": [...],
  "auto_routing": {
    "category": "code_security_review",
    "confidence": 0.91,
    "complexity": "complex",
    "cost_tier": "high",
    "model_id": "claude-opus-4-6",
    "override_applied": "security_escalation",
    "analysis_time_ms": 312,
    "domain": "tech",
    "reasoning": "Request contains 3 security-related keyword patterns and 82k token codebase"
  },
  "cost_info": {
    "actual_cost": 0.0187,
    "baseline_cost": 0.0210,
    "saved": 0.0023,
    "input_tokens": 82140,
    "output_tokens": 1820
  }
}

If a context fallback occurred (model was upgraded due to context overflow), a context_fallback field is also included:

"context_fallback": {
  "original_model": "claude-sonnet-4-6",
  "fallback_model": "claude-opus-4-6",
  "reason": "context_overflow"
}