metaKube

KubeRay

KubeRay

Running Ray at Scale on Kubernetes, Made Simple

As AI and machine learning workloads grow in complexity, teams need infrastructure that can scale effortlessly—without turning operations into a second full-time job. This is where KubeRay comes in.

GitHub - ray-project/kuberay: A toolkit to run Ray applications on Kubernetes
A toolkit to run Ray applications on Kubernetes. Contribute to ray-project/kuberay development by creating an account on GitHub.

KubeRay is an open-source Kubernetes operator purpose-built to simplify how you deploy, run, and manage Ray workloads on Kubernetes. Whether you’re training models, serving LLMs, or running large-scale batch inference, KubeRay bridges the gap between Ray’s distributed compute model and Kubernetes’ cloud-native orchestration.

What Is KubeRay?

At its core, KubeRay provides Kubernetes-native abstractions for running Ray applications. Instead of manually wiring pods, services, autoscaling rules, and lifecycle logic, KubeRay lets you define everything declaratively using Custom Resource Definitions (CRDs).

KubeRay consists of two main parts:

  • KubeRay Core – the fully maintained, production-ready foundation
  • The KubeRay Ecosystem – optional tools that improve usability and visibility

KubeRay Core: The Foundation

KubeRay Core introduces three CRDs that cover most Ray workloads:

RayCluster

A RayCluster represents a full Ray cluster running on Kubernetes. KubeRay manages the entire lifecycle for you:

  • Cluster creation and deletion
  • Autoscaling workers up and down
  • Fault tolerance and recovery

This is ideal for long-running or shared compute environments.


RayJob

A RayJob is designed for batch and offline workloads.

With a single resource definition, KubeRay will:

  1. Create a RayCluster
  2. Wait until it’s ready
  3. Submit your Ray job
  4. (Optionally) tear the cluster down when the job finishes

It’s a clean, Kubernetes-native way to run one-off or scheduled workloads without manual cleanup.


RayService

RayService is built for production-grade model serving.

It combines:

  • A RayCluster
  • A Ray Serve deployment graph

RayService supports:

  • Zero-downtime upgrades
  • High availability
  • Safer rollouts for model and infrastructure changes

This makes it well-suited for online inference and API-driven ML services.


The KubeRay Ecosystem

Beyond the core operator, KubeRay includes optional components that improve developer experience and operations.

kubectl Ray Plugin (Beta)

Starting with KubeRay v1.3.0, the kubectl ray plugin simplifies common workflows—especially for teams new to Kubernetes. It abstracts away much of the complexity involved in managing Ray resources.


KubeRay API Server (Alpha)

The KubeRay API Server provides a simplified configuration layer on top of KubeRay resources. Some organizations use it internally to power custom UIs for Ray cluster and job management.


KubeRay Dashboard (Experimental)

Introduced in v1.4.0, the KubeRay Dashboard offers visibility into clusters, jobs, and services. While not yet production-ready, it’s a promising step toward better observability and management.


Built for the Kubernetes Ecosystem

KubeRay doesn’t operate in isolation—it integrates naturally with the broader Kubernetes ecosystem, including:

  • Observability: Prometheus, Grafana, py-spy
  • Scheduling & Queuing: Volcano, Apache YuniKorn, Kueue
  • Ingress & Networking: NGINX and other controllers

This makes it easier to fit Ray into existing platform architectures.


Real-World Adoption at Scale

KubeRay and Ray are already powering production systems across the industry. Companies like Google, Spotify, DoorDash, Reddit, Airbnb, Apple, and AWS use Ray and KubeRay for workloads ranging from distributed training to large-scale inference and model serving.

These real-world deployments demonstrate that KubeRay isn’t just a convenience layer—it’s infrastructure proven at scale.

About the author

metaKube

Kubernetes Operators for metaClusters

metaKube

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to metaKube.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.