-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Hotswap hero train job #1580
base: main
Are you sure you want to change the base?
WIP: Hotswap hero train job #1580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi TarasRudko and reviewers,
I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:
This pull request aims to create a hot-swappable hero train job. The changes introduce a new sample that demonstrates how to deploy and manage high-priority and low-priority jobs on a Google Kubernetes Engine (GKE) cluster, leveraging TPUs for accelerated training. The sample uses Terraform to provision the GKE cluster and JobSets to manage the jobs.
Here's a breakdown of the changes:
ai-ml/hotswap-hero-train-job/deploy.sh
: This script sets up the environment, creates aterraform.tfvars
file, initializes and applies the Terraform configuration to create the GKE cluster, installs JobSets, and creates the high and low-priority jobs. The script is entirely new (lines 1-48).ai-ml/hotswap-hero-train-job/gke-platform/main.tf
: This file defines the Terraform configuration for the GKE cluster. It includes the creation of a service account, granting necessary permissions, and using modules for GKE Autopilot and standard GKE cluster configurations (lines 1-73).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/main.tf
: This module configures a GKE Autopilot cluster (lines 1-96).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/output.tf
: This module defines outputs for the GKE Autopilot cluster (lines 1-38).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/variables.tf
: This module defines variables for the GKE Autopilot cluster (lines 1-64).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/main.tf
: This module configures a standard GKE cluster with TPU node pools (lines 1-173).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/output.tf
: This module defines outputs for the standard GKE cluster (lines 1-38).ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/variables.tf
: This module defines variables for the standard GKE cluster (lines 1-104).ai-ml/hotswap-hero-train-job/gke-platform/variables.tf
: This file defines variables used across the Terraform configuration (lines 1-92).ai-ml/hotswap-hero-train-job/gke-platform/versions.tf
: This file specifies the required Terraform providers and their versions (lines 1-26).ai-ml/hotswap-hero-train-job/remove.sh
: This script tears down the GKE cluster (lines 1-5).ai-ml/hotswap-hero-train-job/workloads/high-priority-job-autopilot.yaml
: This file defines a high-priority job for GKE Autopilot (lines 1-65).ai-ml/hotswap-hero-train-job/workloads/high-priority-job.yaml
: This file defines a high-priority job for standard GKE (lines 1-69).ai-ml/hotswap-hero-train-job/workloads/low-priority-job-autopilot.yaml
: This file defines a low-priority job for GKE Autopilot (lines 1-67).ai-ml/hotswap-hero-train-job/workloads/low-priority-job.yaml
: This file defines a low-priority job for standard GKE (lines 1-69).ai-ml/hotswap-hero-train-job/workloads/priority.yaml
: This file defines priority classes for high and low-priority jobs (lines 1-31).
I haven't fully tested the functionality yet, but the code looks well-structured and comprehensive. I'll provide more detailed feedback shortly.
Here's a little poem to lighten the mood:
Code flows like a river,
Through branches, a path it carves,
Reviews make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces a deployment script and Terraform configuration for setting up a GKE cluster with TPUs and running prioritized workloads. The use of JobSets and PriorityClasses is a good practice for managing these types of workloads. The Terraform configuration is generally well-structured. However, there are several areas for improvement in terms of script robustness, error handling, and adherence to style guides. I've summarized the style guide elements I've referenced below:
- Shell Script Style Guide: I've focused on variable naming, quoting, and error handling.
- Terraform Style Guide: I've emphasized variable naming conventions and module structure.
- Kubernetes Resource Style Guide: I've focused on resource naming and YAML formatting.
I recommend addressing the following points before merging this PR:
|
||
# Set up env variables values | ||
|
||
# export PROJECT_ID= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PROJECT_ID
variable is commented out and not assigned a value. This will cause the script to fail. Please uncomment and replace <YOUR_PROJECT_ID>
with the actual project ID. Also, per the Shell Style Guide, use uppercase for variable names. For consistency with Google Cloud samples, consider renaming this to GOOGLE_CLOUD_PROJECT
.
# export PROJECT_ID= | |
export GOOGLE_CLOUD_PROJECT=<YOUR_PROJECT_ID> |
export TPU_NODE_LOCATION=us-west4-a | ||
|
||
|
||
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PROJECT_NUMBER
variable is calculated but not used. Either remove it or add a comment explaining its intended future use. If you keep it, ensure the variable name GOOGLE_CLOUD_PROJECT
is used correctly here, and quote the variable.
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") | |
# The following line calculates the project number, to be used later. | |
PROJECT_NUMBER=$(gcloud projects describe "$GOOGLE_CLOUD_PROJECT" --format="value(projectNumber)") |
--project=$PROJECT_ID | ||
|
||
# Create terraform.tfvars file | ||
cat <<EOF >gke-platform/terraform.tfvars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terraform -chdir=gke-platform init | ||
terraform -chdir=gke-platform apply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error checking after each command to ensure the script exits if a command fails. This is important for robust scripting. See the Shell Style Guide.
terraform -chdir=gke-platform init | |
terraform -chdir=gke-platform apply | |
terraform -chdir=gke-platform init | |
if [[ $? -ne 0 ]]; then | |
echo "Terraform init failed" | |
exit 1 | |
fi | |
terraform -chdir=gke-platform apply | |
if [[ $? -ne 0 ]]; then | |
echo "Terraform apply failed" | |
exit 1 | |
fi |
--project=$PROJECT_ID | ||
|
||
# Install JobSets | ||
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use kubectl apply -f <file>
instead of kubectl apply --server-side -f <url>
. Downloading the YAML locally allows for version control and avoids potential issues with network connectivity during deployment.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml | |
kubectl apply -f jobset.yaml |
# Install JobSets | ||
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml | ||
|
||
sleep 60 # wait for jobset to install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a fixed sleep, use kubectl wait
to ensure the JobSet controller is running before creating the workloads. This makes the script more reliable. See the Kubernetes Resource Style Guide.
sleep 60 # wait for jobset to install | |
kubectl wait --for=condition=Available deployment/jobset-controller --timeout=5m |
variable "project_id" { | ||
type = string | ||
description = "GCP project id" | ||
default = "<your project>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location = var.region | ||
node_locations = var.tpu_node_location | ||
cluster = var.enable_autopilot == false && var.enable_tpu ? google_container_cluster.ml_cluster[0].name : null | ||
initial_node_count = var.num_nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable "tpu_node_pools_number" { | ||
description = "Number of TPU node pools. " | ||
default = 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a variable for the initial node count of TPU node pools. This avoids hardcoding values in main.tf
and improves module flexibility. Per the Terraform Style Guide, use lowercase and underscores for variable names.
} | |
variable "tpu_initial_node_count" { | |
description = "Initial node count for TPU node pools. " | |
type = number | |
default = 1 | |
} | |
Description
This PR....
Tasks