Releases: GoogleCloudPlatform/ai-on-gke
v1.8
This release includes a number of new features, improvements and bug fixes.
New Features
- Add HTTP streaming support for JetStream: Added functionality for streaming responses via HTTP, enhancing real-time data processing. (#877)
- SkyPilot tutorial: Created a tutorial to demonstrate using SkyPilot to launch batch workloads across regions. (#887)
- Define ephemeral-storage in ClusterQueue: Introduced support for workloads requesting ephemeral storage. (#872)
- Slurm on GKE Guide: Published a guide for deploying Slurm clusters on GKE for AI/ML workloads. (#864)
Improvements
- Benchmarking improvements:
- Ray job optimization: Explicitly shut down Ray jobs after completing vector embedding tasks to avoid unnecessary runtime. (#735)
- Update permissions for guides:
- Added storage.objectViewer permission to boot disk guide to resolve access issues. (#893)
- Update Jupyter Notebook image: Introduced a new tag prefix to mitigate internal vulnerability checks. (#886)
- Use TPU network optimizer image: Shifted to an image-based approach for network optimization, improving maintainability. (#870)
Bug Fixes
- Fix vLLM PodMonitoring: Addressed issues related to vLLM monitoring configurations. (#889)
- Fix jupyter hub helm chart version: Pinned the JupyterHub helm chart version to mitigate server spawning errors. (#879)
- TF version dependency mismatch: Resolved TensorFlow version mismatch issues. (#885)
- Ray TPU webhook image update: Bumped image version to fix inconsistent PodInformer updates for large TPU slices. (#891)
New Contributors
- @dsafdsa1 made their first contribution in #872
- @danielmarzini made their first contribution in #864
- @darinpeetz made their first contribution in #870
Full Changelog: v1.7...v1.8
v1.7
This release includes a number of new features, improvements and bug fixes.
New Features
- Added a benchmarking tool for measuring data loading performance with gcsfuse. (#863)
- Added a Prometheus server to the Latency Profile Generator (LPG) running on port 9090, along with new metrics for prompt_length, response_length, and time_per_output_token. (#857)
- Added support for Google Cloud Monitoring and Managed Collection for the gke-batch-refarch. (#856)
- Added a tutorial on packaging models and low-rank adapters (LoRA) from Hugging Face as images, pushing them to Artifact Registry, and deploying them in GKE. (#855)
Improvements
- Updated outdated references to the Text Generation Inference (TGI) container to use the Hugging Face Deep Learning Containers (DLCs) hosted on Google Cloud's Artifact Registry. (#816)
- Added the ability to benchmark multiple models concurrently in the LPG. (#850)
- Added support for "inf" (infinity) request rate and number of prompts in the LPG. (#847)
- Fixed the latency_throughput_curve.sh script to correctly parse non-integer request rates and added "errors" to the benchmark results. (#850)
Bug Fixes
- Fixed an issue where the README was not rendering correctly. (#862)
New Contributors
- @alvarobartt made their first contribution in #816
- @liu-cong made their first contribution in #850
- @coolkp made their first contribution in #855
- @JamesDuncanNz made their first contribution in #856
Full Changelog: v1.6...v1.7
v1.6
New Features
- JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
- Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
- Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
- Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
- Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
- Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)
Improvements
- Benchmarking Script Enhancements:
- The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
- Added request_rate to the summary statistics generated by the benchmarking script. (#837)
- Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
- Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
- Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
- MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)
Bug Fixes
- Fixed Internal Links: Updated internal links to publicly accessible Cloud Console links. (#843)
- Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
- Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
- Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
- Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
- Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
- Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
- Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
- Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)
Other Changes
v1.5
Ray
Ray on GKE Terraform now uses the GKE Ray Add-on when creating GKE clusters (#781)
GKE image builder
Add mirror.gcr.io in containerd configuration to reduce docker rate limiting (#764)
Benchmarks
Add latency profile generator (#775)
Decrease scrape interval of metrics from TGI and DCGM to 15s (#772)
Enable Pod monitoring for vLLM (#796)
Testing
Add e2e tests for Hugging Face TGI tutorial (#780)
v1.4
Quick start solutions:
Ray
- Released v1.2.0, supporting autoscaling RayClusters (#740) and adding reliability improvements (#723)
- Added a helm-chart (#745)
- Bump Ray TPU webhook image (#763)
Rag
- Update RAG fronend docker image in (#762)
TPU
- Add HuggingFace support for automated inference checkpoint conversion (#712)
- Jetstream Maxtext Deployment Module: All scale rules now in a single HPA in (#730)
- Update pip in JetStream Pytorch and checkpoint Dockerfiles in (#739)
- Fix faulty HPA in Jetstream Maxtext module in (#741)
- Correct tokenizer for Jetstream Module in (#742)
- Make image names optional in Jetstream Maxtext module in (#744)
- Terraform modules cleanup in (#758)
- TPU Metrics Improvements in (#727, #761, #770)
Benchmark
- update main README.md quickstart guide in (#734)
- Add Quantization support for TGI in (#757)
- Update README with the latest input variables in (#759)
Tutorials and Examples
- update image url for gemma finetune yaml in (#729)
- NIM on GKE Tutorial in (#737)
- Add Kueue exemplary setup for reservation and DWS in (#746)
Full Changelog: v1.3...v1.4
v1.3
Quick start solutions
- Add finetuning gemma on GKE with L4 GPUs example (#697)
- Jetstream autoscaling guide (#703)
- Enable Ray autoscaler for RAG example application (#722)
- Fix GKE training tutorial (#706)
- Update Kueue to 0.7.0 (#707)
ML Platform release (#715)
- Documentation
- Add notebook packaging guide to docs (#690)
- Added enhancements to the data processing use cases
- Infrastructure
- Added H100 and A100 40GB DWS node pools
- Moved cpu node pool from n2 to n4 machines
- Updated Kueue to 0.7.0
- Added initial test harness
- Configuration Management
- ConfigSync git repository name allows for easier use of multiple environments. Standardized GitOps scripts
- Added a GitLab project module and allowed users to choose between GitHub and GitLab
- Observability
- Added NVIDIA DCGM
- Added environment_name to the Ray dashboard endpoint
- Added Config Controller Terraform module
- Security
- Added allow KubeRay Operator to the namespace network policy
- Added Secret Manager add-on to the cluster
TPU Provisioner
- Add admission label selectors and e2e test script (#702)
Benchmarking
General
Add custom metrics stackdriver adapter Terraform module (#718)
Add prometheus adapter Terraform module (#716)
Add Jetstream MaxText Terraform module (#719)
v.1.2
Quick start solutions
Ray
- Enabled TPU webhook on GKE Autopilot (#585)
- Support Multi-slice TPU groups (#453)
- Support multiple worker groups requesting TPUs (#467)
- Added unit tests for Ray TPU webhook (#578)
- Fix GMP on GKE Standard (#689)
RAG
ML Platform
- intended for platform admins to have a multi-tenant AI/ML platform running on GKE)
- Initial release! (#690)
TPU Provisioner
- Add fixes relating to interacting with JobSets (#645)
- Allow forcing use of on-demand nodes and disable auto upgrade for node pools (#656)
- Support location hint label (#666)
- Update usage instructions (#684)
Benchmarking
v1.1.2
Highlights
- RAG, Ray & Jupyter terraform solutions now support GKE Autopilot as the default cluster type #635
- The RAG solution has improved test coverage to (1) validate the notebook that generates vector embeddings as part of the E2E tests #524 (2) validate prompt responses from the LLM with context #511
What's Changed
- Cherrypick AP cloud build stockout mitigation onto release-1.1 by @artemvmin in #580
- Jupyter notebook cherry pick by @chiayi in #600
- quick fix or rag prompt test output by @chiayi in #612
- Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket… by @gongmax in #621
- Cherry-pick #599 and #618 to release-1.1 by @roberthbailey in #627
- Cherry-pick #631 to release-1.1 branch by @roberthbailey in #632
- Cherry-pick #635 to release-1.1 branch by @roberthbailey in #637
Full Changelog: v1.1.0...v1.1.2
v1.1.0
We are excited to announce the release of AI on GKE v1.1! This release brings several new features, improvements, and bug fixes to enhance your experience with running AI workloads on Google Kubernetes Engine (GKE).
Highlights
AI on GKE Quick Starts
Get started with popular AI frameworks and tools using new quickstart guides for RAG, Ray and Jupyter notebooks on GKE.
RAG
Retrieval Augmented Generation (RAG) is a technique used to give Large Language Models (LLMs) additional context related to a prompt. RAG has many benefits including providing external information (e.g. from knowledge repositories) and introducing “grounding”, which helps the LLM generate an appropriate response.
The new quick start deploys a RAG stack on a new or existing GKE cluster using open source tools and frameworks such as Ray, LangChain, HuggingFace TGI, and Jupyter notebooks. The model used for inference is Mistral-7B. The solution uses GCS fuse driver to load the input dataset quickly and the Cloud SQL pgvector extension to store generating vector embeddings for RAG. It includes features like authenticated access for your application via Identity Aware Proxy, sensitive data protection & text moderation. See the README to get started.
Ray
Ray is an open-source framework to easily scale up Python applications across multiple nodes in a cluster. Ray provides a simple API for building distributed, parallelized applications, especially for machine learning.
KubeRay enables Ray to be deployed on Kubernetes. You get the wonderful Pythonic unified experience delivered by Ray, and the enterprise reliability and scale of GKE managed Kubernetes. Together, they offer scalability, fault tolerance, and ease of use for building, deploying, and managing distributed applications.
The new quick start deploys KubeRay on a new or existing GKE cluster along with a sample Ray cluster. See the README to get started.
Jupyter
JupyterHub is a powerful, multi-tenant server-based web application that allows users to interact with and collaborate on Jupyter notebooks. Users can create custom computing environments with custom images and computational resources in which to run their notebooks. “Zero to Jupyterhub for Kubernetes” (z2jh) is a Helm chart that you can use to install Jupyterhub on Kubernetes that provides numerous configurations for complex user scenarios.
The new quick start solution sets up Jupyterhub on GKE. Running your Jupyter notebooks and JupyterHub on Google Kubernetes Engine (GKE) provides a way to prototype your distributed, compute-intensive ML applications with security and scalability built-in as core elements of the platform. See the README to get started.
Ray on GKE guide
Dive deeper into running Ray workloads on GKE with comprehensive guides and tutorials covering various use cases and best practices. See the Ray on GKE README to get started. We’ve also included a new user guide specifically for leveraging TPU Multihost and Multislice Support with Ray.
Inference Benchmarks
Evaluate and compare the performance of different AI models and frameworks on GKE using newly added inference benchmarks. It supports benchmarking popular LLMs like Gemma, Llama 2, Falcon and other models available in Hugging Face. It supports different model servers like Text Generation Inference and Triton with TensorRT-LLM. You can measure the performance of these models and model servers on various GPU types in GKE. To get started, refer to the README.
Guides, Tutorials and Examples
LLM Guides
We’ve introduced the following guides for serving LLMs on GKE:
- Guide to Serving Mistral 7B-Instruct v0.1 on GKE Utilizing Nvidia L4-GPUs
- Guide to Serving Mixtral 8x7 Model on GKE Utilizing Nvidia L4-GPUs
- RAG with Weavite and Vertex AI
GKE ML Platform
Introducing the first MVP in the GKE ML Platform Solution, featuring:
- Opinionated GKE Platform for AI/ML workloads
- Comes with a sample deployment of Ray
- Infrastructure automated through Terraform and GitOps for cluster configuration management
- Parallel data processing using Ray, accelerating the notebook to cluster experience
- Includes a sample data processing script for a publicly available dataset using Ray.
- Resources:
- Automated Deployment via Terraform: github.com/GoogleCloudPlatform/ai-on-gke/tree/main/best-practices/ml-platform
TPU Provisioner
This release introduces the TPU Provisioner. A controller that automatically provisions new TPU node pools based on the requirements on pending pods, then deprovisions them when they are no longer in use. See the README for how to get started.
Bug fixes and improvements
- Reorganized folders in the ai-on-gke repo
- E2E tests for all quick start deployments are now running on Google Cloud Build
- Introduced the modules directory containing commonly used terraform modules used across our different deployments
- Renamed the gke-platform directory to infrastructure with additional features and capabilities
v1.0.2
Ray Serve
- Introduced support for Ray on Autopilot with 3 predefined worker groups - small (only CPU), medium (1 GPU), and large (8 GPUs): 7082b13
Ray on GKE Storage
#87 provides examples for Ray on GKE storage solutions:
- One-click deploy setup for GCS bucket + Kuberay access of control
- Leveraging GKE GCS Fuse CSI to access GCS Buckets as a shared filesystem and use standard file semantics (thereby eliminating the need to use specialized fsspec libraries)
Ray Data
The Ray data API tutorial with stable diffusion e2e finetuning example (PR) deploys a Ray training job from a Jupyter notebook to a Ray cluster on GKE, and illustrates the following:
- Caching HuggingFace StableDiffusion model checkpoint into a GCS bucket and mount it to Ray workers in the Ray cluster hosted on GKE
- Using RayData APIs to perform batch inference to generate regularization images needed for the fine-tuning
- Using RayTrain framework for distributed training with multiple GPUs in a multi-node GKE cluster setup
Kuberay
- Pin Kuberay version to
v0.6.0
and helm chart version tov0.6.1
- Install Kuberay operator in a dedicated namespace (
ray-system
)
Jupyter Notebooks
- Secure authentication via Identity-aware proxy (IAP) is now enabled by default for Jupyterhub, for both Standard & Autopilot clusters. Here is the sample user guide to configure the IAP client in your Jupyterhub installation. This ensures the Jupyterhub endpoint is no longer exposed to the public internet.
Distributed training of PyTorch CNN
- JobSet example for distributed training of PyTorch CNN handwritten digit classification model using the MNIST dataset.
- Indexed Job example for distributed training of a PyTorch CNN handwritten digit classification model the MNIST dataset on NVIDIA T4 GPUs.
Inferencing using Saxml and an HTTP Server
- Example to deploy an HTTP Server to handle HTTP requests to Sax, which has support for features such as model publishing, listing, updating, unpublishing, and generating predictions. With an HTTP server, interaction with Sax can also expand further than at the VM-level. For example, integration with GKE and load balancing will enable requests to Sax from inside and outside the GKE cluster.
Finetuning and Serving Llama on L4 GPUs
- Example for finetuning Llama 7B model on GKE using 8 x L4 GPUs
- Example for serving Llama 70B model on GKE with 2 L4 GPUs
Validation of Changes to Ray on GKE Templates
- Pull requests now trigger cloud build tests to detect breaking changes made to the GKE platform and Kuberay solution templates.