WIP: Hotswap hero train job #1580

TarasRudko · 2024-12-30T07:28:45Z

Description

This PR....

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
Editable variables have been used, where applicable.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

code-review-assist

Hi TarasRudko and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:

This pull request aims to create a hot-swappable hero train job. The changes introduce a new sample that demonstrates how to deploy and manage high-priority and low-priority jobs on a Google Kubernetes Engine (GKE) cluster, leveraging TPUs for accelerated training. The sample uses Terraform to provision the GKE cluster and JobSets to manage the jobs.

Here's a breakdown of the changes:

ai-ml/hotswap-hero-train-job/deploy.sh: This script sets up the environment, creates a terraform.tfvars file, initializes and applies the Terraform configuration to create the GKE cluster, installs JobSets, and creates the high and low-priority jobs. The script is entirely new (lines 1-48).
ai-ml/hotswap-hero-train-job/gke-platform/main.tf: This file defines the Terraform configuration for the GKE cluster. It includes the creation of a service account, granting necessary permissions, and using modules for GKE Autopilot and standard GKE cluster configurations (lines 1-73).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/main.tf: This module configures a GKE Autopilot cluster (lines 1-96).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/output.tf: This module defines outputs for the GKE Autopilot cluster (lines 1-38).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/variables.tf: This module defines variables for the GKE Autopilot cluster (lines 1-64).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/main.tf: This module configures a standard GKE cluster with TPU node pools (lines 1-173).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/output.tf: This module defines outputs for the standard GKE cluster (lines 1-38).
ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/variables.tf: This module defines variables for the standard GKE cluster (lines 1-104).
ai-ml/hotswap-hero-train-job/gke-platform/variables.tf: This file defines variables used across the Terraform configuration (lines 1-92).
ai-ml/hotswap-hero-train-job/gke-platform/versions.tf: This file specifies the required Terraform providers and their versions (lines 1-26).
ai-ml/hotswap-hero-train-job/remove.sh: This script tears down the GKE cluster (lines 1-5).
ai-ml/hotswap-hero-train-job/workloads/high-priority-job-autopilot.yaml: This file defines a high-priority job for GKE Autopilot (lines 1-65).
ai-ml/hotswap-hero-train-job/workloads/high-priority-job.yaml: This file defines a high-priority job for standard GKE (lines 1-69).
ai-ml/hotswap-hero-train-job/workloads/low-priority-job-autopilot.yaml: This file defines a low-priority job for GKE Autopilot (lines 1-67).
ai-ml/hotswap-hero-train-job/workloads/low-priority-job.yaml: This file defines a low-priority job for standard GKE (lines 1-69).
ai-ml/hotswap-hero-train-job/workloads/priority.yaml: This file defines priority classes for high and low-priority jobs (lines 1-31).

I haven't fully tested the functionality yet, but the code looks well-structured and comprehensive. I'll provide more detailed feedback shortly.

Here's a little poem to lighten the mood:

Code flows like a river,
Through branches, a path it carves,
Reviews make it clear.

code-review-assist

This pull request introduces a deployment script and Terraform configuration for setting up a GKE cluster with TPUs and running prioritized workloads. The use of JobSets and PriorityClasses is a good practice for managing these types of workloads. The Terraform configuration is generally well-structured. However, there are several areas for improvement in terms of script robustness, error handling, and adherence to style guides. I've summarized the style guide elements I've referenced below:

Shell Script Style Guide: I've focused on variable naming, quoting, and error handling.
Terraform Style Guide: I've emphasized variable naming conventions and module structure.
Kubernetes Resource Style Guide: I've focused on resource naming and YAML formatting.

I recommend addressing the following points before merging this PR:

code-review-assist · 2024-12-30T07:31:05Z