Description

Learn how to move AI and data workloads from development to production with Kubernetes. This session covers reproducibility, resource scheduling, and managing large datasets—giving you practical strategies to build scalable, resilient systems.

You've built your models and pipelines, perfected your algorithms, and proven your hypotheses. Now, what's the most reliable and efficient way to move your work from a local environment to a production system that handles real-world scale? This is where a background in cloud infrastructure and site reliability engineering (SRE) becomes essential. In this tutorial, I'll share my perspective on how to operationalize your data and AI workloads using Kubernetes. We'll focus on the infrastructure challenges that often block projects in their final stages: ensuring reproducibility, securing dedicated hardware, and managing data and model artifacts at scale. This session will show you how core Kubernetes concepts directly solve your operational pain points. Solving the "it works on my machine" problem to guarantee your environment is consistent from development to production. Scheduling & Resources: How to declare your need for specific resources, like GPUs, and ensure your jobs are placed on the right hardware, every time. Stateful Workloads: How to properly manage and persist the large datasets your models depend on for training and inference. Extending Kubernetes: Understanding how purpose-built tools and controllers can automate the deployment and management of complex frameworks with a single command. This talk provides a clear mental model for how complex applications are operationalized. Attendees will leave with a practical understanding of the architectural principles and operational best practices needed to build and manage resilient, production-ready systems.

Details

October 3, 2025

9:35 am

-

10:55 am

Cypress 1

Add to Calendar

Track:

Tutorial

Level:

Intermediate

Tags

No items found.

Presenters

Shefali Victors
Strategic Architect
Rackspace