Cloud2 / OPEA Enterprise-Inference
Enterprise inference infrastructure + open source contributions
Cloud2 Labs
Problem
Problem
Deploying LLMs in enterprise environments involves more than running a model — it requires high-throughput serving, cost-optimized model routing, full observability, repeatable build pipelines, and integration with existing enterprise auth and networking stacks. Cloud2 Labs needed an inference platform that could handle multiple models, track token usage per tenant, and route queries intelligently based on complexity and cost.
Approach
Approach
I built and maintain the inference infrastructure stack at Cloud2 Labs, contributing upstream to Intel's OPEA (Open Platform for Enterprise AI) project. The stack uses Docker/BuildKit for reproducible multi-stage image builds, vLLM for high-throughput transformer serving (continuous batching, PagedAttention), TGI as a secondary serving backend, RouteLLM for intelligent query routing between model tiers based on complexity classification, Langfuse for LLM observability and trace analysis, and Grafana for infrastructure metrics dashboards. Contributions to OPEA Enterprise-Inference include pipeline components and deployment configurations.
Architecture
Architecture
Cloud2 / OPEA Enterprise-Inference — system diagram
Key Technical Decisions
Key Technical Decisions
RouteLLM for cost-optimized routing
Not every query needs the most capable model. RouteLLM classifies queries by complexity and routes simple queries to a cheaper/faster model tier, reserving the full-capability model for complex reasoning tasks. This reduced inference costs significantly without measurable quality degradation for the majority of queries.
vLLM + PagedAttention for throughput
vLLM's continuous batching and PagedAttention memory management achieve 3–5× higher throughput than naive HuggingFace Transformers serving for concurrent requests. Critical for a multi-tenant enterprise environment where request bursts are common.
Langfuse for LLM-specific observability
Traditional APM tools don't understand token counts, prompt templates, or model version differences. Langfuse provides trace-level visibility into every LLM call — input/output tokens, latency by model, cost per trace — which is essential for both debugging and cost attribution.
Results
Results
- ✓3–5× throughput improvement over naive serving via vLLM continuous batching
- ✓Cost reduction via RouteLLM complexity-based routing across model tiers
- ✓Full trace-level observability on every inference call via Langfuse
- ✓Reproducible builds via Docker/BuildKit multi-stage pipeline
- ✓Open source contributions to Intel's OPEA Enterprise-Inference project
Tech Stack
Tech Stack
Links
Links