Infrastructure

Cloud2 / OPEA Enterprise-Inference

Enterprise inference infrastructure + open source contributions

Cloud2 Labs

8 technologies
3 key decisions
5 results

Problem

Problem

Deploying LLMs in enterprise environments involves more than running a model — it requires high-throughput serving, cost-optimized model routing, full observability, repeatable build pipelines, and integration with existing enterprise auth and networking stacks. Cloud2 Labs needed an inference platform that could handle multiple models, track token usage per tenant, and route queries intelligently based on complexity and cost.

Approach

Approach

I built and maintain the inference infrastructure stack at Cloud2 Labs, contributing upstream to Intel's OPEA (Open Platform for Enterprise AI) project. The stack uses Docker/BuildKit for reproducible multi-stage image builds, vLLM for high-throughput transformer serving (continuous batching, PagedAttention), TGI as a secondary serving backend, RouteLLM for intelligent query routing between model tiers based on complexity classification, Langfuse for LLM observability and trace analysis, and Grafana for infrastructure metrics dashboards. Contributions to OPEA Enterprise-Inference include pipeline components and deployment configurations.

Architecture

Architecture

Cloud2 / OPEA Enterprise-Inference — system diagram

Client / API GatewayRouteLLM RoutervLLM (primary)TGI (secondary)Langfuse (traces)Grafana (metrics)Docker / BuildKit

Key Technical Decisions

Key Technical Decisions

Assembly Instructions — 3 Steps
01

RouteLLM for cost-optimized routing

Not every query needs the most capable model. RouteLLM classifies queries by complexity and routes simple queries to a cheaper/faster model tier, reserving the full-capability model for complex reasoning tasks. This reduced inference costs significantly without measurable quality degradation for the majority of queries.

02

vLLM + PagedAttention for throughput

vLLM's continuous batching and PagedAttention memory management achieve 3–5× higher throughput than naive HuggingFace Transformers serving for concurrent requests. Critical for a multi-tenant enterprise environment where request bursts are common.

03

Langfuse for LLM-specific observability

Traditional APM tools don't understand token counts, prompt templates, or model version differences. Langfuse provides trace-level visibility into every LLM call — input/output tokens, latency by model, cost per trace — which is essential for both debugging and cost attribution.

Results

Results

  • 3–5× throughput improvement over naive serving via vLLM continuous batching
  • Cost reduction via RouteLLM complexity-based routing across model tiers
  • Full trace-level observability on every inference call via Langfuse
  • Reproducible builds via Docker/BuildKit multi-stage pipeline
  • Open source contributions to Intel's OPEA Enterprise-Inference project

Tech Stack

Tech Stack

DockerBuildKitvLLMTGILangfuseGrafanaRouteLLMPython

Links