Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Hotswap hero train job #1580

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

TarasRudko
Copy link
Contributor

Description

This PR....

Tasks

  • The contributing guide has been read and followed.
  • The samples added / modified have been fully tested.
  • Workflow files have been added / modified, if applicable.
  • Region tags have been properly added, if new samples.
  • Editable variables have been used, where applicable.
  • All dependencies are set to up-to-date versions, as applicable.
  • Merge this pull-request for me once it is approved.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi TarasRudko and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:

This pull request aims to create a hot-swappable hero train job. The changes introduce a new sample that demonstrates how to deploy and manage high-priority and low-priority jobs on a Google Kubernetes Engine (GKE) cluster, leveraging TPUs for accelerated training. The sample uses Terraform to provision the GKE cluster and JobSets to manage the jobs.

Here's a breakdown of the changes:

  • ai-ml/hotswap-hero-train-job/deploy.sh: This script sets up the environment, creates a terraform.tfvars file, initializes and applies the Terraform configuration to create the GKE cluster, installs JobSets, and creates the high and low-priority jobs. The script is entirely new (lines 1-48).
  • ai-ml/hotswap-hero-train-job/gke-platform/main.tf: This file defines the Terraform configuration for the GKE cluster. It includes the creation of a service account, granting necessary permissions, and using modules for GKE Autopilot and standard GKE cluster configurations (lines 1-73).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/main.tf: This module configures a GKE Autopilot cluster (lines 1-96).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/output.tf: This module defines outputs for the GKE Autopilot cluster (lines 1-38).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_autopilot/variables.tf: This module defines variables for the GKE Autopilot cluster (lines 1-64).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/main.tf: This module configures a standard GKE cluster with TPU node pools (lines 1-173).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/output.tf: This module defines outputs for the standard GKE cluster (lines 1-38).
  • ai-ml/hotswap-hero-train-job/gke-platform/modules/gke_standard/variables.tf: This module defines variables for the standard GKE cluster (lines 1-104).
  • ai-ml/hotswap-hero-train-job/gke-platform/variables.tf: This file defines variables used across the Terraform configuration (lines 1-92).
  • ai-ml/hotswap-hero-train-job/gke-platform/versions.tf: This file specifies the required Terraform providers and their versions (lines 1-26).
  • ai-ml/hotswap-hero-train-job/remove.sh: This script tears down the GKE cluster (lines 1-5).
  • ai-ml/hotswap-hero-train-job/workloads/high-priority-job-autopilot.yaml: This file defines a high-priority job for GKE Autopilot (lines 1-65).
  • ai-ml/hotswap-hero-train-job/workloads/high-priority-job.yaml: This file defines a high-priority job for standard GKE (lines 1-69).
  • ai-ml/hotswap-hero-train-job/workloads/low-priority-job-autopilot.yaml: This file defines a low-priority job for GKE Autopilot (lines 1-67).
  • ai-ml/hotswap-hero-train-job/workloads/low-priority-job.yaml: This file defines a low-priority job for standard GKE (lines 1-69).
  • ai-ml/hotswap-hero-train-job/workloads/priority.yaml: This file defines priority classes for high and low-priority jobs (lines 1-31).

I haven't fully tested the functionality yet, but the code looks well-structured and comprehensive. I'll provide more detailed feedback shortly.

Here's a little poem to lighten the mood:

Code flows like a river,
Through branches, a path it carves,
Reviews make it clear.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces a deployment script and Terraform configuration for setting up a GKE cluster with TPUs and running prioritized workloads. The use of JobSets and PriorityClasses is a good practice for managing these types of workloads. The Terraform configuration is generally well-structured. However, there are several areas for improvement in terms of script robustness, error handling, and adherence to style guides. I've summarized the style guide elements I've referenced below:

  • Shell Script Style Guide: I've focused on variable naming, quoting, and error handling.
  • Terraform Style Guide: I've emphasized variable naming conventions and module structure.
  • Kubernetes Resource Style Guide: I've focused on resource naming and YAML formatting.

I recommend addressing the following points before merging this PR:


# Set up env variables values

# export PROJECT_ID=

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PROJECT_ID variable is commented out and not assigned a value. This will cause the script to fail. Please uncomment and replace <YOUR_PROJECT_ID> with the actual project ID. Also, per the Shell Style Guide, use uppercase for variable names. For consistency with Google Cloud samples, consider renaming this to GOOGLE_CLOUD_PROJECT.

Suggested change
# export PROJECT_ID=
export GOOGLE_CLOUD_PROJECT=<YOUR_PROJECT_ID>

export TPU_NODE_LOCATION=us-west4-a


PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PROJECT_NUMBER variable is calculated but not used. Either remove it or add a comment explaining its intended future use. If you keep it, ensure the variable name GOOGLE_CLOUD_PROJECT is used correctly here, and quote the variable.

Suggested change
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
# The following line calculates the project number, to be used later.
PROJECT_NUMBER=$(gcloud projects describe "$GOOGLE_CLOUD_PROJECT" --format="value(projectNumber)")

--project=$PROJECT_ID

# Create terraform.tfvars file
cat <<EOF >gke-platform/terraform.tfvars

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Quote the here-doc delimiter EOF to prevent variable expansion within the here-doc. This is crucial for predictable behavior. See the Shell Style Guide.

Suggested change
cat <<EOF >gke-platform/terraform.tfvars
cat <<"EOF" >gke-platform/terraform.tfvars

Comment on lines +31 to +32
terraform -chdir=gke-platform init
terraform -chdir=gke-platform apply

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add error checking after each command to ensure the script exits if a command fails. This is important for robust scripting. See the Shell Style Guide.

Suggested change
terraform -chdir=gke-platform init
terraform -chdir=gke-platform apply
terraform -chdir=gke-platform init
if [[ $? -ne 0 ]]; then
echo "Terraform init failed"
exit 1
fi
terraform -chdir=gke-platform apply
if [[ $? -ne 0 ]]; then
echo "Terraform apply failed"
exit 1
fi

--project=$PROJECT_ID

# Install JobSets
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use kubectl apply -f <file> instead of kubectl apply --server-side -f <url>. Downloading the YAML locally allows for version control and avoids potential issues with network connectivity during deployment.

Suggested change
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml
kubectl apply -f jobset.yaml

# Install JobSets
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml

sleep 60 # wait for jobset to install

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of a fixed sleep, use kubectl wait to ensure the JobSet controller is running before creating the workloads. This makes the script more reliable. See the Kubernetes Resource Style Guide.

Suggested change
sleep 60 # wait for jobset to install
kubectl wait --for=condition=Available deployment/jobset-controller --timeout=5m

variable "project_id" {
type = string
description = "GCP project id"
default = "<your project>"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Per the Terraform Style Guide, use lowercase and underscores for variable names. Change project_id to google_cloud_project.

Suggested change
default = "<your project>"
variable "google_cloud_project" {

location = var.region
node_locations = var.tpu_node_location
cluster = var.enable_autopilot == false && var.enable_tpu ? google_container_cluster.ml_cluster[0].name : null
initial_node_count = var.num_nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable num_nodes is not defined within this module. Define it in variables.tf and pass it as an input to the module. This is crucial for proper Terraform usage.

Suggested change
initial_node_count = var.num_nodes
initial_node_count = var.tpu_initial_node_count

variable "tpu_node_pools_number" {
description = "Number of TPU node pools. "
default = 1
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a variable for the initial node count of TPU node pools. This avoids hardcoding values in main.tf and improves module flexibility. Per the Terraform Style Guide, use lowercase and underscores for variable names.

Suggested change
}
variable "tpu_initial_node_count" {
description = "Initial node count for TPU node pools. "
type = number
default = 1
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant