Skip to content

Commit

Permalink
Cherry-pick #635 to release-1.1 branch (#637)
Browse files Browse the repository at this point in the history
Update RAG to use Autopilot by default (#635)

Remove DNS troubleshooting information, as this has been patched.

Co-authored-by: artemvmin <[email protected]>
  • Loading branch information
roberthbailey and artemvmin authored Apr 30, 2024
1 parent 422073b commit e7b191a
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 25 deletions.
36 changes: 15 additions & 21 deletions applications/rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,17 @@ Install the following on your computer:

### Bring your own cluster (optional)

By default, this tutorial creates a Standard cluster on your behalf. We highly recommend following the default settings.
By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.

If you prefer to manage your own cluster, set `create_cluster = false` in the [Installation section](#installation). Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.

Use the provided infrastructue module to create a cluster:

1. `cd ai-on-gke/infrastructure`

2. Edit `platform.tfvars` to set your project ID, location and cluster name. The other fields are optional. Ensure you create an L4 nodepool as this tutorial requires it.

3. Run `terraform init`

4. Run `terraform apply --var-file workloads.tfvars`
Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features, available on the latest versions of 1.28 and 1.29.

```
gcloud container clusters create-auto rag-cluster \
--location us-central1 \
--cluster-version 1.28
```
### Bring your own VPC (optional)

By default, this tutorial creates a new network on your behalf with [Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) already enabled. We highly recommend following the default settings.
Expand All @@ -64,10 +61,11 @@ This section sets up the RAG infrastructure in your GCP project using Terraform.
1. `cd ai-on-gke/applications/rag`

2. Edit `workloads.tfvars` to set your project ID, location, cluster name, and GCS bucket name. Ensure the `gcs_bucket` name is globally unique (add a random suffix). Optionally, make the following changes:
* (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
* (Recommended) [Enable authenticated access](#configure-authenticated-access-via-iap) for JupyterHub, frontend chat and Ray dashboard services.
* (Not recommended) Set `create_cluster = false` if you bring your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled.
* (Not recommended) Set `create_network = false` if you bring your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
* (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
* (Optional) Set `autopilot_cluster = false` to deploy using GKE Standard.
* (Optional) Set `create_cluster = false` if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions in [`infrastructure/README.md`](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md).
* (Optional) Set `create_network = false` if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.

3. Run `terraform init`

Expand Down Expand Up @@ -193,17 +191,13 @@ Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
```

1. Troubleshoot JupyterHub job failures:
- If the JupyterHub job fails to start the proxy with error code 599, it is likely an known issue with Cloud DNS, which occurs when a cluster is quickly deleted and recreated with the same name.
- Recreate the cluster with a different name or wait several minutes after running `terraform destroy` before running `terraform apply`.

2. Troubleshoot Ray job failures:
1. Troubleshoot Ray job failures:
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
- Run `kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay`. There should be a Ray head and Ray worker pod in `Running` state. If your ray pods aren't running, it's likely due to quota or stockout issues. Check that your project and selected `cluster_location` have L4 GPU capacity.
- Often, retrying the Ray job submission (the last cell of the notebook) helps.
- The Ray job may take 15-20 minutes to run the first time due to environment setup.

3. Troubleshoot IAP login issues:
2. Troubleshoot IAP login issues:
- Verify the cert is Active:
- For JupyterHub `kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'`
- For the frontend: `kubectl get managedcertificates frontend-managed-cert -n rag --output jsonpath='{.status.domainStatus[0].status}'`
Expand All @@ -213,14 +207,14 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L
- Org error:
- The [OAuth Consent Screen](https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent) has `User type` set to `Internal` by default, which means principals external to the org your project is in cannot log in. To add external principals, change `User type` to `External`.

4. Troubleshoot `terraform apply` failures:
3. Troubleshoot `terraform apply` failures:
- Inference server (`mistral`) fails to deploy:
- This usually indicates a stockout/quota issue. Verify your project and chosen `cluster_location` have L4 capacity.
- GCS bucket already exists:
- GCS bucket names have to be globally unique, pick a different name with a random suffix.
- Cloud SQL instance already exists:
- Ensure the `cloudsql_instance` name doesn't already exist in your project.

5. Troubleshoot `terraform destroy` failures:
4. Troubleshoot `terraform destroy` failures:
- Network deletion issue:
- `terraform destroy` fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
4 changes: 2 additions & 2 deletions applications/rag/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ spec:
varType: string
defaultValue: "created-by=gke-ai-quick-start-solutions,ai.gke.io=rag"
- name: autopilot_cluster
varType: string
defaultValue: false
varType: bool
defaultValue: true
- name: iap_consent_info
description: Configure the <a href="https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent"><i>OAuth Consent Screen</i></a> for your project. Ensure <b>User type</b> is set to <i>Internal</i>. Note that by default, only users within your organization can be allowlisted. To add external users, change the <b>User type</b> to <i>External</i> after the application is deployed.
varType: bool
Expand Down
2 changes: 1 addition & 1 deletion applications/rag/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ variable "private_cluster" {

variable "autopilot_cluster" {
type = bool
default = false
default = true
}

variable "cloudsql_instance" {
Expand Down
2 changes: 1 addition & 1 deletion applications/rag/workloads.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ subnetwork_cidr = "10.100.0.0/16"
create_cluster = true # Creates a GKE cluster in the specified network.
cluster_name = "<cluster-name>"
cluster_location = "us-central1"
autopilot_cluster = false
autopilot_cluster = true
private_cluster = false

## GKE environment variables
Expand Down

0 comments on commit e7b191a

Please sign in to comment.