The open-source gap has closed. Open-weight models now trail proprietary frontier models by only ~3 months on average. DeepSeek, Llama, Qwen, and Mistral all compete with or beat GPT-5 and Claude on specific benchmarks.
Reasoning powerhouse
671B MoE model (37B active). R1 scored 97.3% on MATH-500 — highest of any open model. Gold medal at IMO 2025, IOI 2025, ICPC World Finals. Trained at dramatically lower cost than industry norms. V3.2 is the first model to integrate thinking directly into tool-use workflows.
Best For
Complex reasoning, math, competitive coding, legal analysis
Risks / Gaps
Geopolitical concerns for US enterprise; verbose outputs; China-based data handling questions
tap to expand
Open-weight ecosystem king
Scout: 10M token context window (industry-leading — 7,500+ pages). Maverick: 400B MoE for quality. Largest open-source community. Runs on consumer to enterprise hardware with multiple size options.
Best For
Self-hosted enterprise, long-document analysis, privacy-first deployments, fine-tuning
Risks / Gaps
700M MAU license cap and EU restrictions; requires infrastructure expertise to self-host
tap to expand
Multilingual & cost champion
1T+ parameters via MoE. Supports 119–201 languages. 92.3% on AIME25. Integrates vision and language for unified multimodal reasoning. Scores 74.1% on LiveCodeBench v6. 8.6×–19× higher decoding throughput vs previous gen.
Best For
Multilingual deployments, RAG pipelines, coding, global enterprise summarization
Risks / Gaps
Smaller Western dev community; compliance concerns for US-regulated industries; API stability
tap to expand
EU data sovereignty & speed
80+ languages. Small delivers 90% of Large's performance at 1/8 the cost. Building "Mistral Sovereign" for air-gapped government networks. Fastest inference at 7B scale. Strong French, German, Spanish NLP.
Best For
EU-regulated industries, real-time chatbots, GDPR-compliant deployments, European NLP
Risks / Gaps
Larger compute than Llama at comparable sizes; smaller global community; niche positioning
tap to expand
Citation-first AI search
Not a single LLM — an AI-native search platform combining multiple models with retrieval systems. Every answer includes citations and source links. Strong adoption among knowledge workers.
Best For
Competitive research, academic analysis, journalism, market intelligence, fact-checking
Risks / Gaps
Can't fine-tune or self-host; dependent on underlying model providers; not a standalone LLM
tap to expand
Agentic coding at insane efficiency
Outperforms DeepSeek-V3.2 and Kimi-K2 on SWE benchmarks at 1/2–1/3 the parameters. 150 tok/sec. $0.10/M input tokens. Trained for agentic tool-calling via Multi-Teacher Online Policy Distillation (MOPD).
Best For
High-throughput code agents, tool-use workflows, cost-sensitive production inference
Risks / Gaps
Very new entrant; limited enterprise track record; China-origin concerns for regulated buyers
tap to expand
Vertical alignment matters more than benchmarks. Domain-specific APIs in banking and healthcare are already displacing generic models by reducing hallucination risk and easing compliance.
Regulated environment demands safety alignment + reasoning depth. Healthcare LLM market growing at 25.95% CAGR. Constitutional AI and enterprise compliance are table stakes.
Multi-step reasoning for risk analysis, compliance docs. Domain-specific APIs reducing hallucination risk. DeepSeek's math prowess relevant for quantitative analysis.
Long-context document analysis (10M tokens for Llama Scout). Claude's reasoning for contract review and case law. Privacy-first self-hosting critical for client data.
Codex agent, Claude Code, DeepSeek's competitive coding dominance. Agentic workflows are the differentiator. MiMo emerging as efficiency leader.
GPT's brand + Gemini's multimodal power. Integration with Workspace/Copilot for content at scale. Gemini's video understanding opens new creative workflows.
Apache 2.0/self-hosted. Mistral Sovereign designed for air-gapped government networks. EU AI Act compliance requires regionally trained or on-premise models.
Qwen covers 119–201 languages. Mistral strong in European languages. Both open-weight with permissive licensing for global deployment.
Grok's real-time X data for sentiment and trends. Perplexity's citation-first research for journalism and competitive intelligence.
Agentic AI governance, MCP protocol for agent interop, self-hosted options for IDPs. The diversity of models is exactly why governance platforms matter.
Different roles need different models. When you're in a deal, you're not selling to "an industry" — you're selling to a VP of Engineering, a CISO, a CMO, and they each evaluate AI through a completely different lens.
Cares about agentic workflows, code quality, and developer productivity. Claude Code and Codex are the key differentiators. Llama 4 for self-hosted control and fine-tuning. Evaluates on: benchmark performance, context window, tool-use reliability, and MCP support.
Constitutional AI and safety alignment are table stakes. Mistral Sovereign for air-gapped gov networks. Llama for on-prem, no-data-leaves-the-building deployments. Evaluates on: data residency, PII handling, audit logging, and compliance certifications.
Content at scale: copywriting, campaign generation, multimodal assets. Gemini's video/image understanding opens new creative workflows. GPT-5 has the strongest brand and broadest integration with marketing tools via Copilot. Evaluates on: creative quality, multimodal support, speed, and integration with existing stack.
Fine-tuning and self-hosting are non-negotiable for serious ML teams. DeepSeek's reasoning dominates math/science benchmarks. Qwen's Apache 2.0 license means zero restrictions. Evaluates on: fine-tuning flexibility, benchmark performance on domain tasks, cost per experiment, and open weights.
Coding benchmarks matter here: Claude Code, Codex, and DeepSeek all lead. MCP protocol support is becoming a differentiator for agent interop. The IDP/AEP layer governs which models developers can access and how they're routed. Evaluates on: code generation quality, IDE integration, agentic tool-use, and API reliability.
Long-context document analysis is critical — Llama 4 Scout's 10M tokens can process entire case files. Claude's multi-step reasoning excels at contract review and regulatory analysis. Evaluates on: reasoning accuracy, hallucination rate, context window, and privacy controls.
Executives want synthesized intelligence, not raw output. Perplexity's citation-first research is ideal for board prep and competitive intel. GPT-5 and Claude for strategic synthesis across long documents. Evaluates on: accuracy of synthesis, source attribution, speed, and polish of output.
How token pricing works: LLMs charge per million tokens (roughly 750K words). Every price shown is input / output — what you send vs. what the model generates. Output is always more expensive because it requires more compute. The cost driver isn't the model's answer — it's the context you send with every request.
Claude Opus 4.6: $5 in / $25 out · GPT-5.4 Pro: $30 in / $180 out · GPT-5.2: $1.75 in / $14 out
Use sparingly — complex reasoning, mission-critical analysis, high-stakes legal/medical review. A single Opus conversation with full 200K context costs ~$4.50. Run that 100x/day = $13,500/month on one workflow.
Claude Sonnet 4.6: $3 in / $15 out · GPT-5.4: $2.50 in / $10 out · GPT-4o: $2.50 in / $10 out · Gemini 3.1 Pro: $2 in / $12 out · Grok 3: $3 in / $15 out
The production workhorse tier. Handles 80%+ of real enterprise tasks — coding, analysis, content generation, customer support.
Claude Haiku 4.5: $1 in / $5 out · GPT-5.4 Mini: $0.75 in / $3 out · Gemini 3 Flash: $0.50 in / $3 out · Mistral Small: $0.20 in / $0.60 out
High-volume workloads: chatbots, classification, content moderation, bulk processing. Flash and Haiku deliver 90% of frontier quality at 10% of the cost.
DeepSeek V3.2: $0.14 in / $0.28 out · Gemini Flash-Lite: $0.075 in / $0.30 out · Gemini 2.0 Flash: $0.10 in / $0.40 out
100x cheaper than frontier. Genuine alternatives for RAG, summarization, and routine queries. DeepSeek is MIT licensed; Gemini has a generous free tier.
Llama 4, DeepSeek, Qwen, Mistral — free to download, but GPU infrastructure costs $0.50–$5/hour depending on model size. Running Llama 70B requires 2x A100 GPUs (~$2,160/month). Eliminates per-token fees at scale; 5–25x cheaper than API equivalents for high-volume use cases.
💡
The 70/20/10 rule
Best practice at scale: route 70% of queries to budget models, 20% to mid-tier, and 10% to frontier. This tiered approach reduces average per-query cost by 60–80%. Stack with prompt caching (~90% savings on repeated inputs) and batch APIs (50% off async workloads) for maximum efficiency. Realistic enterprise budgets should be ~1.7x base token cost to account for growth, infrastructure overhead, and experimentation.
LLM gateways are the new enterprise infrastructure layer. They sit between your applications and LLM providers, handling intelligent routing, automatic failover, cost governance, and security — so teams don't call APIs directly.
Routing
Classify task complexity, send simple queries to cheap models, complex ones to frontier — automatically
Failover
If OpenAI goes down, traffic switches to Anthropic or Google instantly — 99.99% uptime
Cost controls
Per-team budgets, dollar-based quotas, semantic caching to eliminate redundant calls
Governance
RBAC, PII sanitization, prompt templates, audit logs, compliance automation
Gateway landscape
Multi-modal LLM support with fine-tuning
Enterprise-grade AI gateway connecting developers to multiple LLMs. Intelligent routing, failover, cost optimization. Supports text, image, audio, and vision models. Active cost management across providers.
tap to expand
Full-stack AI platform for enterprises
Combines orchestration, governance, and scalability. MCP and Agents Registry for centralized tool management. GPU orchestration with up to 80% higher utilization. Prompt lifecycle management with versioning and monitoring.
tap to expand
Semantic routing + PII sanitization
Extends Kong's API management platform to LLM routing. Semantic caching saves on redundant prompts. Built-in PII stripping and prompt guards. Pre-built dashboards for AI-specific analytics. Best for teams already on Kong.
tap to expand
11μs overhead per request
Open-source AI gateway built in Go. 20+ providers through a single OpenAI-compatible API. MCP integration, Prometheus metrics, distributed tracing. Drop-in replacement — switch from direct SDK calls with one line of code.
tap to expand
Global edge + unified billing
Leverages Cloudflare's global edge network. Request caching, rate limiting, usage analytics. Unified billing across OpenAI, Anthropic, and Google. No infrastructure setup required — dashboard-managed.
tap to expand
Fastest path to multi-model
OpenRouter: single API key for 100+ models, automatic fallback, unified billing. Best for quick multi-model access. LiteLLM: open-source Python proxy translating all providers to OpenAI-compatible format. Virtual keys, spend tracking per team. Best for Python-centric prototyping.
tap to expand
🏛
The IDP layer: governance on top
The LLM gateway is infrastructure. The Internal Developer Portal is the governance and developer experience layer on top. It's what turns a multi-model mess into a managed platform: model catalogs, RBAC policies, self-service workflows, compliance audit trails, budget controls by team, and MCP/agent governance. Teams can switch models, introduce fine-tuned versions, or enforce new governance rules — without modifying application code. The diversity of models is exactly why the governance platform matters.