Skip to content

Latest commit

 

History

History
347 lines (238 loc) · 12 KB

File metadata and controls

347 lines (238 loc) · 12 KB

🏷 Continuous Annotation with Label Studio

Data annotation / labeling is a core part of MLOps that is frequently left out of the conversation. ZenML offers a way to build continuous annotation (combining training and annotation into a loop) with Label Studio. This uses a combination of user-defined steps as well as some built-in steps that ZenML provides.

🗺 Overview

NOTE: This example currently only runs with a cloud artifact store. This guide by Label Studio contains more details if you have trouble understanding the instructions below.

For a full video walkthrough on how to run this example, please check out our community hour demo in which we show how to run the various pipelines, in what order, all using an AWS stack.

🖥 Run the example

In order to run this example, you need to install and initialize ZenML and Label Studio.

pip install "zenml[server]" torchvision

# clone the ZenML repository
git clone https://github.com/zenml-io/zenml-projects.git
cd label_studio_annotation

# Initialize ZenML repo
zenml init

# Start the ZenServer to enable dashboard access
zenml up

You will need to install and start Label Studio locally:

zenml integration install label_studio

# start label-studio locally
label-studio start -p 8093

To connect ZenML to your local Label Studio server, you also need an API key. This will give you access to the web annotation interface.

First order of business is to sign up if you haven't done so already, using an email and a password. Then visit http://localhost:8093/ to log in, and then visit http://localhost:8093/user/account and get your Label Studio API key (from the upper right hand corner). You will need it for later. Keep the Label Studio server running, because the ZenML Label Studio annotator will use it as the backend.

👣 Step-by-Step Infrastructure Setup

Pick your preferred infrastructure provider from Azure, GCP and AWS. Then follow the relevant guide below for how to set up the infrastructure required by this example.

🥞 Set up your stack for Microsoft Azure

You should install the relevant integrations:

zenml integration install pytorch azure pillow

Some setup for your stack is required.

A resource group will allow you to clean up any associated resources easily, so either create one with the following command, or reference an existing one in what follows. The same applies to the storage account and container that we'll create and use:

az group create --name <YOUR_RESOURCE_GROUP_NAME> --location northeurope

az storage account create --name <YOUR_STORAGE_ACCOUNT_NAME> --resource-group <YOUR_RESOURCE_GROUP_NAME> --location northeurope --sku Standard_ZRS --encryption-services blob

az storage container create \
    --account-name <YOUR_STORAGE_ACCOUNT_NAME> \
    --name <YOUR_CONTAINER_NAME> \
    --auth-mode login

At this point you'll want to get a storage account key with the following command:

az storage account keys list --account-name <YOUR_STORAGE_ACCOUNT_NAME>

You can pick either of the 'value' properties to use for the --account-key value for the following command that adds CORS permissions to your storage account:

az storage cors add --origins '*' --methods GET --allowed-headers '*' --exposed-headers 'Access-Control-Allow-Origin' --max-age 3600 --services blob --account-key <YOUR_STORAGE_ACCOUNT_KEY> --account-name <YOUR_STORAGE_ACCOUNT_NAME>

Now we can set up all the necessary ZenML components:

zenml stack copy default <ANNOTATION_STACK_NAME>

zenml stack set <ANNOTATION_STACK_NAME>

zenml secret create <YOUR_AZURE_AUTH_SECRET_NAME> --account_name="<YOUR_STORAGE_ACCOUNT_NAME>" --account_key="<YOUR_STORAGE_ACCOUNT_KEY>"

zenml artifact-store register <YOUR_CLOUD_ARTIFACT_STORE> -f azure --path="az://<YOUR_CONTAINER_NAME>" --authentication_secret="<YOUR_AZURE_AUTH_SECRET_NAME>"

zenml stack update <ANNOTATION_STACK_NAME> -a <YOUR_CLOUD_ARTIFACT_STORE>

zenml secret create <LABEL_STUDIO_SECRET_NAME> --api_key="<YOUR_API_KEY>"

zenml annotator register <YOUR_LABEL_STUDIO_ANNOTATOR> --flavor label_studio --authentication_secret="<LABEL_STUDIO_SECRET_NAME>"

zenml stack update <ANNOTATION_STACK_NAME> -an <YOUR_LABEL_STUDIO_ANNOTATOR>

At this point you should stop (CTRL-C) Label Studio, set the following two environment variables (or do it wherever is appropriate if you have a deployed Label Studio instance) before restarting Label Studio:

export AZURE_BLOB_ACCOUNT_NAME="<YOUR_STORAGE_ACCOUNT_NAME>"
export AZURE_BLOB_ACCOUNT_KEY="<YOUR_STORAGE_ACCOUNT_KEY>"

🥞 Set up your stack for GCP

You should install the relevant integrations:

zenml integration install label_studio pytorch gcp pillow

This setup guide assumes that you have installed and are able to use the gcloud CLI to run commands.

Create your artifact store and set up the necessary permissions and service accounts:

gsutil mb -p <YOUR_GCP_PROJECT_NAME> -l <YOUR_GCP_REGION_NAME> -c standard "gs://<YOUR_BUCKET_NAME>"

gcloud iam service-accounts create <YOUR_SERVICE_ACCOUNT_NAME>

gcloud projects add-iam-policy-binding <YOUR_GCP_PROJECT_NAME> --member="serviceAccount:<YOUR_SERVICE_ACCOUNT_NAME>@<YOUR_GCP_PROJECT_NAME>.iam.gserviceaccount.com" --role="roles/storage.admin"

gcloud projects add-iam-policy-binding <YOUR_GCP_PROJECT_NAME> --member="serviceAccount:<YOUR_SERVICE_ACCOUNT_NAME>@<YOUR_GCP_PROJECT_NAME>.iam.gserviceaccount.com" --role="roles/iam.serviceAccountTokenCreator"

gcloud iam service-accounts keys create ls-annotation-credentials.json --iam-account=<YOUR_SERVICE_ACCOUNT_NAME>@<YOUR_GCP_PROJECT_NAME>.iam.gserviceaccount.com

gsutil cors set cloud_config/gcp/cors-config.json gs://<YOUR_BUCKET_NAME>

Now you have a credentials json file that you can use to authenticate with Label Studio. Make sure to note down the path where this was created.

Now you can set up the rest of your stack:

zenml stack copy default <YOUR_ANNOTATION_STACK_NAME>

zenml stack set <YOUR_ANNOTATION_STACK_NAME>

zenml secret create <YOUR_GCP_AUTH_SECRETS_NAME> --token="@PATH/TO/JSON/FILE/CREATED/ABOVE"

zenml artifact-store register gcp_artifact_store -f gcp --path="gs://<YOUR_BUCKET_NAME>" --authentication_secret="<YOUR_GCP_AUTH_SECRETS_NAME>"

zenml stack update <YOUR_ANNOTATION_STACK_NAME> -a <YOUR_CLOUD_ARTIFACT_STORE>

zenml secret create <LABEL_STUDIO_SECRET_NAME> --api_key="<YOUR_API_KEY>"

zenml annotator register <YOUR_LABEL_STUDIO_ANNOTATOR> --flavor label_studio --authentication_secret="<LABEL_STUDIO_SECRET_NAME>"

zenml stack update <YOUR_ANNOTATION_STACK_NAME> -an <YOUR_LABEL_STUDIO_ANNOTATOR>

Be sure also to set the environment variable for your GCP credentials and then restart Label Studio:

export GOOGLE_APPLICATION_CREDENTIALS="<PATH_TO_YOUR_CREDENTIALS>"
label-studio start -p 8093

🥞 Set up your stack for AWS

You should install the relevant integrations:

# assuming label_studio was installed earlier
zenml integration install pytorch s3 pillow

Create your basic S3 bucket via CLI command:

REGION=<YOUR_DESIRED_AWS_REGION>
S3_BUCKET_NAME=<YOUR_DESIRED_S3_BUCKET_NAME>

aws s3api create-bucket --bucket=$S3_BUCKET_NAME --region=$REGION --create-bucket-configuration=LocationConstraint=$REGION

Label Studio also needs you to set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. You can use the CORS file we have pre-populated for you inside the examples repo as follows:

cd cloud_config/aws
aws s3api put-bucket-cors --bucket $S3_BUCKET_NAME --cors-configuration file://cors.json
cd ../..

You will also need to assign a policy to your IAM user that allows it to access the S3 bucket in the way that Label Studio needs. You can do this using the instructions described in the Label Studio documentation (step two, specifically).

Now you can get to the fun part of setting up Label Studio and working with ZenML:

YOUR_AWS_ZENML_STACK_NAME=<YOUR_AWS_ZENML_STACK_NAME>
YOUR_ARTIFACT_STORE_NAME=<YOUR_ARTIFACT_STORE_NAME>

zenml stack copy default $YOUR_AWS_ZENML_STACK_NAME

zenml stack set $YOUR_AWS_ZENML_STACK_NAME

# use your standard access key id and secret access key from ~/.aws/credentials here
zenml secret create <YOUR_AWS_SECRET_NAME> --aws_access_key_id="<YOUR_ACCESS_KEY_ID>" --aws_secret_access_key="<YOUR_SECRET_ACCESS_KEY>"

zenml artifact-store register $YOUR_ARTIFACT_STORE_NAME --flavor=s3 --path=s3://$S3_BUCKET_NAME --authentication_secret="<YOUR_AWS_SECRET_NAME>"

zenml stack update $YOUR_AWS_ZENML_STACK_NAME -a $YOUR_ARTIFACT_STORE_NAME

zenml secret create <LABEL_STUDIO_SECRET_NAME> --api_key="<YOUR_API_KEY>"

zenml annotator register <YOUR_LABEL_STUDIO_ANNOTATOR> --flavor label_studio --authentication_secret="<LABEL_STUDIO_SECRET_NAME>"

zenml stack update $YOUR_AWS_ZENML_STACK_NAME -an <YOUR_LABEL_STUDIO_ANNOTATOR>

▶️ Run the Code

There are several parts to running the pipeline. Start with:

python run.py --train

On the first time you run this, this will fetch a pretrained model and amend the head so that it is able to perform binary image classification. It does not get finetuned at this point, however.

You can use the CLI command to view our newly created dataset:

zenml annotator dataset list

It will show that we now have an aria_detector dataset.

Then you can use this model to perform inference on some images (located in data/batch_1). These 'pre-annotations' will then be synchronized with Label Studio.

python run.py --inference

At this point you should do some annotation using the Label Studio web interface, making sure to annotate all the ten images. You can launch the annotation interface with the CLI:

zenml annotator dataset launch aria_detector

When you're done, you can get the stats on the annotation work you did by using the ZenML CLI:

zenml annotator dataset stats aria_detector

This will tell you how many labeled and unlabeled tasks you have for your Label Studio dataset.

With the annotations complete, you should rerun the training step. This time instead of just pulling a pretrained model, the step will finetune our previously-downloaded pretrained model using the ten annotations we just made. This will improve our model's performance (a bit).

python run.py --train --rerun

Now we can rerun the inference step, though using a different set of data (data/batch_2).

python run.py --inference --rerun

Once we've rerun this step, we can inspect the predictions our newly fine-tuned model made within Label Studio.

CLI Commands for the Label Studio Integration

Once you've run the pipeline for the first time, you'll be able to use some ZenML CLI commands to interact with your Label Studio annotations and the dataset:

# the obvious ones to try
zenml annotator describe
zenml annotator dataset list
zenml annotator dataset <YOUR_DATASET_NAME> stats

🧽 Clean up

In order to clean up, delete any relevant cloud infrastructure that was created above (the cloud storage bucket is the main one, along with any permissions and whatever secrets or secrets vaults you created) and the remaining ZenML references. If you created assets under a GCP project, or an Azure resource group, then the newly created assets should be easy to find. With Amazon, the IAM rules and the S3 bucket are what you need to delete.

📜 Learn more

If you want to learn more about annotation in general or about how to use your own annotation tool in ZenML check out our docs.

To explore other available CLI commands related to the annotator stack components, check out the CLI docs annotator section.