Skip to content

v1.9

Latest
Compare
Choose a tag to compare
@roberthbailey roberthbailey released this 14 Jan 22:15
768bd8a

This release includes a number of new features, improvements, and bug fixes.

New Features

  • Automated Security Scanning: Introduced a Cloud Build workflow to automatically perform Helm scans on the ai-on-gke repository using the Shipshape validation service along with documentation to guide users on how to handle security violations identified by Shipshape. (#918, #920)
  • 65k Node GKE Benchmark: Added a benchmark for GKE on a simulated AI workload using Terraform and ClusterLoader2, supporting clusters with 65,000 nodes. (#898)

Improvements

  • DWS-Kueue Example: Bumped Kueue quotas to 1 billion per resource to provide almost unlimited admission in examples. (#911)
  • Infrastructure Module Queued Provisioning: Added queued_provisioning setting on all node pools for compatibility with DWS. (#909)
  • TCP Receive Buffer Limit Configuration: Added a daemonset to configure TCP receive buffer limits for improved DCN performance. (#906)
  • SkyPilot Tutorial: Improved formatting and fixed typos in the SkyPilot tutorial. (#905, #907)
    Documentation Clarity: Clarified instructions on obtaining billing account ID and folder ID. (#914)

Bug Fixes

  • E2E Test Flakiness: Increased timeout to mitigate intermittent failures in E2E tests caused by delays in Ray pod startup. (#919)
  • Slurm-on-GKE Container Image: Fixed shebang line position in Slurm-on-GKE container image. (#917)
  • Directory Path: Fixed an incorrect directory path in the SLUM guide. (#912)

New Contributors

Full Changelog: v1.8...v1.9