Empty server rack glowing red with one isolated AI accelerator card, representing IBM RITS and vLLM inference infrastructure.

IBM Puts vLLM at the Heart of RITS, Serving 1,300 Researchers

IBM Research has rebuilt the engine room of its internal AI work around vLLM, the PyTorch Foundation’s open-source inference and serving engine, turning its Research Inference and Tuning Service into a centralized, vendor-neutral hub. As of April 2026, the RITS Platform hosts more than 100 large language models at any given moment and serves over 1,300 active researchers across IBM, with vLLM acting as the default serving runtime, OpenAI-compatible API layer, and the route by which IBM’s own Spyre accelerator gets first-class support.

The choice is not just technical plumbing. It is IBM’s clearest signal yet that the inference layer of enterprise AI, the part that actually serves answers to users, will be open source by default and heterogeneous by design.

What RITS Actually Is, In Numbers

RITS is a shared platform that lets any IBM researcher hit a model endpoint without standing up their own GPU stack. According to the PyTorch Foundation’s April 2026 case study on the platform, RITS launched in mid-November 2024 and now exposes more than 100 concurrent model endpoints to a community of 1,300-plus active users.

The plumbing underneath is specific. vLLM serves every model. Red Hat OpenShift AI plus KServe handles orchestration. IBM’s Turbonomic product manages 1-to-n autoscaling, while serverless logic handles cold starts from zero. Custom metrics like vLLM’s “Requests Waiting” counter, rather than basic requests-per-second, drive scaling decisions, because GPU jobs do not behave like web traffic.

The core editorial point: RITS is not a research toy. It is a production-shaped platform that IBM is using to standardize how its own teams ship AI, and the design choices map almost one to one onto what mid-sized enterprises will face in the next 18 months.

Why IBM Picked vLLM Over a Proprietary Stack

The math made it hard to choose anything else. vLLM’s PagedAttention algorithm cuts wasted KV cache memory from the 60 to 80 percent typical of older serving stacks down to under 4 percent, which translates directly into bigger batch sizes and lower cost per token. A November 2025 arXiv comparative study of vLLM versus HuggingFace TGI found vLLM delivered up to 24 times higher throughput under high-concurrency loads.

Priya Nagpurkar, Vice President for AI Platform at IBM Research, framed the choice in community terms.

“The vLLM community is vibrant and responsive, and with collaborative expertise, we are able to do great things both upstream and internally by leveraging and contributing to this project. vLLM has been critical to democratizing access to our research community to the latest and greatest LLMs as they release,” said Priya Nagpurkar, Vice President, AI Platform, IBM Research.

That last phrase carries weight. Inside RITS, a researcher who wants to run a 70B parameter model that was released this morning does not have to file a hardware ticket. The model lands as a vLLM endpoint, behind an API key, behind a gateway, often the same day.

The Cost Story Other Enterprises Already Know

Stripe’s machine-learning platform team has reported a 73 percent cut in inference costs after migrating from Hugging Face Transformers to vLLM, processing the same 50 million daily API calls on roughly a third of the prior GPU fleet. LinkedIn has reported a 7 percent improvement in time-per-output-token after a similar move.

Industry estimates put more than 400,000 GPUs worldwide running vLLM concurrently as of early 2026. That is the install base IBM is now betting its internal research throughput on.

The Spyre Card Hidden In the Stack

The most strategic line in IBM’s RITS announcement has nothing to do with throughput. It is the sentence committing Spyre, IBM’s in-house AI accelerator, to vLLM as a first-class backend.

By using vLLM as the abstraction, IBM gets to slot its own silicon under the same API surface that already serves Nvidia and AMD GPUs. The community-maintained vllm-spyre plugin on the vLLM project’s GitHub already exposes paged attention on Spyre hardware, with deeper torch.compile and multi-card collective work scheduled for the first half of 2026.

The implication is concrete. Every workload IBM eventually shifts from rented Nvidia capacity to owned Spyre capacity is a workload that does not need a code rewrite, because vLLM hides the hardware boundary. That is the same playbook AMD has used to make ROCm a credible Nvidia alternative on Instinct GPUs.

A Kubernetes Donation That Reshapes the Field

RITS does not stand alone. At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the Cloud Native Computing Foundation as a sandbox project, joined by NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.

llm-d is the cluster-scale layer that sits above vLLM. It splits inference into separate prefill and decode pods, routes requests by KV-cache state and pod load, and handles disaggregated scheduling across heterogeneous accelerators. The llm-d project’s own April 2026 release notes claim a 40 percent reduction in per-output-token latency for DeepSeek V3.1 served on Nvidia H200 GPUs, with new disaggregation support for Intel XPU and Google TPU silicon.

Read together, the picture is coherent. vLLM is the engine. llm-d is the cluster scheduler. RITS is IBM’s reference deployment. And the CNCF donation pushes the whole thing out of any one vendor’s gravity well.

The Open-Source Power Play Hyperscalers Now Have to Answer

Microsoft, Google, and Amazon all sell managed inference. Each has a clear interest in keeping enterprise customers inside a single cloud’s billing surface. The vLLM plus llm-d combination is, in effect, a portable replacement for that lock-in.

Inferact, the company spun out of the vLLM core team, raised 150 million dollars in seed funding in January 2026 to build a commercial layer on top of the project, on a deliberate Linux-style open-core model. That gives risk-averse procurement teams a paid support contract while keeping the runtime itself vendor-neutral.

Thomas Parnell, a Principal Research Scientist at IBM, captured the portability thesis bluntly in IBM Research’s October 2025 update on its open-source AI work: “You can write a Triton kernel and it will work on Nvidia GPUs, AMD GPUs, Intel GPUs.” That sentence is the entire enterprise pitch for vLLM compressed into 16 words.

The competitive context is broader than IBM. Coverage of the Big Nine technology firms shaping the AI stack has tended to frame Microsoft, Google, and AWS as fixed pillars. The RITS architecture is one of the cleanest signs yet that the inference layer underneath those pillars is becoming commodity software.

The Cost Math Buyers Are Quietly Running

For a CIO weighing whether to copy IBM’s pattern, the math is becoming concrete. Three numbers anchor the decision.

  • 73 percent. Reported Stripe inference cost reduction after the move to vLLM, on the same 50 million daily API calls.
  • 40 percent. Per-output-token latency reduction for DeepSeek V3.1 on H200 GPUs under llm-d v0.4 disaggregated serving, per the project’s April 2026 release notes.
  • 1,300+. Active internal users IBM Research is already serving from a single shared RITS deployment as of April 2026.

The cost story is what makes the governance story urgent.

Where the Plan Could Still Break

Centralized inference platforms create centralized risk. If one shared endpoint serves a thousand researchers, a single misconfigured model can leak training data to a thousand sessions. RITS handles part of this with self-managed API keys and gateway-level rate limits, but the harder problem is policy.

Three pressure points stand out for any enterprise considering the RITS pattern.

  • Model provenance. When a research team can spin up an endpoint on a model that was published last week, audit logs and license review have to keep up.
  • Data residency. A shared inference fleet that spans regions has to enforce where prompts and KV caches actually sit on disk.
  • Cost attribution. Centralized GPU pools get expensive fast if no one team owns the bill. Turbonomic-style autoscaling helps, but chargeback discipline still has to come from finance, not the platform.

None of these are theoretical. They are the same governance failures that hit early shared Hadoop clusters a decade ago, and the same ones that will decide whether the RITS pattern becomes the default or stays a research-team novelty.

What To Watch Next

Three near-term milestones will tell whether the IBM bet is generalizing. First, the depth of Spyre support inside vLLM through the second half of 2026, especially multi-card collective ops. Second, llm-d’s CNCF graduation track and whether AWS or Azure ship managed offerings on top of it. Third, the arrival of commercial vLLM support from Inferact and Red Hat AI in the same enterprise procurement cycles.

The same week IBM’s RITS case study landed, Anthropic’s Claude Opus 4.7 release pushed coding benchmarks to new highs, a reminder that the model layer is moving faster than ever. The serving layer underneath has to keep pace, and IBM’s bet is that an open one will.

vLLM is the engine. llm-d is the cluster scheduler. RITS is IBM’s reference deployment.

Frequently Asked Questions

What is the IBM RITS Platform?

RITS, the Research Inference and Tuning Service, is IBM Research’s centralized internal platform for hosting LLM inference and tuning endpoints. It launched in November 2024 and serves more than 1,300 active IBM researchers across over 100 hosted models as of April 2026.

Why did IBM choose vLLM as the RITS serving engine?

vLLM offers PagedAttention memory efficiency, OpenAI-compatible APIs, continuous batching, and a hardware-agnostic runtime that can target Nvidia, AMD, Intel, and IBM Spyre silicon. Reported enterprise gains include up to 24 times higher throughput than older serving stacks and 73 percent cost reductions in production cases like Stripe’s.

How does llm-d relate to vLLM and RITS?

llm-d is a Kubernetes-native distributed inference layer that uses vLLM as its default model server. IBM Research, Red Hat, and Google Cloud donated llm-d to the Cloud Native Computing Foundation as a sandbox project at KubeCon Europe 2026.

What is the IBM Spyre accelerator’s role here?

Spyre is IBM’s purpose-built AI accelerator. By integrating it through the vLLM serving runtime via the vllm-spyre plugin, IBM can shift workloads from Nvidia GPUs to its own silicon without rewriting application code, which is the strategic point of the entire RITS architecture.

Can other enterprises copy the RITS model?

Yes, the components are open source. Any organization can stand up vLLM behind a Kubernetes gateway, layer in llm-d for distributed scheduling, and add commercial support through Red Hat AI or Inferact. The harder work is governance: API key management, model-license review, and cost attribution across teams.