- Introduction
- Project Overview
- Airflow DAG Details
- Dataset
- Model Fine-Tuning
- Inference
- Project Structure
- Prerequisites
- Installation
- Usage
- Testing
- Future Work
- Contributing
- License
This project demonstrates advanced use of Apache Airflow to orchestrate a machine learning pipeline for fine-tuning a Transformer model using Hugging Face. We focus on an innovative use case: fine-tuning a Transformer to detect emotions in text data from the Emotion Dataset. This pipeline showcases deep competency in Airflow, including custom operators, task dependencies, and integration with external libraries.
The goal is to build an end-to-end automated pipeline that:
- Ingests and Preprocesses Data: Downloads the Emotion Dataset and prepares it for training.
- Fine-Tunes a Transformer Model: Uses Hugging Face's
bert-base-uncased
model for emotion classification. - Evaluates the Model: Assesses performance using metrics like accuracy and F1-score.
- Saves the Model: Stores the fine-tuned model locally or on cloud storage.
- Performs Inference: Uses the fine-tuned model to predict emotions on new text data.
- Versions Data with DVC: Automatically versions processed data using Data Version Control (DVC).
- Notifies Upon Completion: Sends an email or Slack notification when the pipeline finishes.
By integrating these steps into an Airflow DAG, we demonstrate how to manage complex workflows, ensure data consistency, automate machine learning tasks, and maintain data versioning in a production-like environment.
The Airflow DAG (dags/emotion_detection_dag.py
) orchestrates the following tasks:
- Start:
DummyOperator
signaling the DAG's initiation. - Data Download:
PythonOperator
that downloads the Emotion Dataset. - Data Preprocessing: Cleans and tokenizes text data, encoding labels.
- Data Versioning:
PythonOperator
that versions the processed data using DVC. - Model Fine-Tuning: Fine-tunes
bert-base-uncased
for emotion detection. - Model Evaluation: Evaluates the model using the test set.
- Model Saving: Saves the trained model and tokenizer.
- Notification:
EmailOperator
orSlackAPIPostOperator
sends completion alerts. - End:
DummyOperator
signaling the DAG's completion.
Advanced Airflow Features Demonstrated:
- Custom Operators: Specialized tasks for model training, evaluation, and data versioning.
- XComs: Pass data between tasks securely.
- Error Handling: Retries and alerting on failures.
- Dynamic Task Mapping: For scalable and parallel data processing.
- Integration with External Libraries: Seamless use of Hugging Face within Airflow tasks.
- Data Versioning with DVC: Ensures data consistency and reproducibility.
We utilize the Emotion Dataset, containing 20,000 English Twitter messages labeled with six emotions: anger, fear, joy, love, sadness, and surprise.
- Source: Emotion Dataset on Hugging Face Datasets
- Purpose: Enables multi-class emotion detection, providing a nuanced understanding beyond binary sentiment analysis.
We fine-tune the bert-base-uncased
Transformer model for emotion classification. Each step is encapsulated in scripts within the scripts/
directory and executed via Airflow tasks, ensuring modularity and ease of maintenance.
Once the model has been fine-tuned and saved, you can use it to predict emotions on new text data using the scripts/example_data.py
script.
Running Inference:
-
Ensure the Fine-Tuned Model is Saved:
Make sure the model has been trained and saved in the
models/saved_model/
directory by running the Airflow pipeline. -
Run the Inference Script:
python scripts/example_data.py
This script will load the fine-tuned model and tokenizer and perform emotion prediction on a set of sample sentences.
Sample Output:
Text: I'm so happy to hear from you!
Predicted Emotion: joy
Text: This is the worst day ever.
Predicted Emotion: sadness
Text: I can't wait to see you again.
Predicted Emotion: love
Text: I'm feeling very anxious about the meeting.
Predicted Emotion: fear
Text: You did an amazing job!
Predicted Emotion: joy
Text: I'm utterly disappointed.
Predicted Emotion: sadness
Custom Input:
You can modify the sample_texts
list in scripts/example_data.py
to include your own sentences for emotion prediction.
transformer_airflow_project/
├── dags/
│ ├── emotion_detection_dag.py
│ └── example_dag.py # Added example DAG
├── data/
│ ├── raw/
│ └── processed/
├── models/
│ └── saved_model/
├── scripts/
│ ├── download_data.py
│ ├── preprocess_data.py
│ ├── train_model.py
│ ├── evaluate_model.py
│ ├── save_model.py
│ ├── example_data.py # Added inference script
│ └── data_versioning.py # Added data versioning script
├── tests/
│ ├── __init__.py
│ └── test_functions.py
├── logs/
├── .github/
│ └── workflows/
│ └── ci.yml
├── .gitignore
├── requirements.txt
├── README.md
├── CONTRIBUTING.md
└── LICENSE
- Python 3.7+
- Apache Airflow 2.x
- Hugging Face Transformers 4.21.0
- Datasets Library 2.3.2
- PyTorch 1.11.0
- Scikit-learn 0.24.2
- Pandas 1.3.0
- NumPy 1.21.0
- Slack API Token (if using Slack notifications)
- DVC 2.7.3 (for data versioning)
- Git
- Docker (if using Docker)
- GitHub Actions (for CI/CD)
-
Clone the Repository:
git clone https://github.com/your_username/transformer_airflow_project.git cd transformer_airflow_project
-
Set Up Virtual Environment:
python3 -m venv venv source venv/bin/activate pip install --upgrade pip pip install -r requirements.txt
-
Initialize Airflow:
export AIRFLOW_HOME=$(pwd)/airflow airflow db init airflow users create \ --username admin \ --firstname YOUR_FIRST_NAME \ --lastname YOUR_LAST_NAME \ --role Admin \ --email [email protected]
-
Set Up Airflow Configuration:
- Ensure the
dags_folder
inairflow.cfg
points to thedags/
directory in your project. - Set any required environment variables or connections.
- Ensure the
-
Configure Connections and Variables:
- Slack: Add a Slack connection in Airflow with your API token.
- AWS/GCP: Set up connections if saving models to cloud storage.
- Environment Variables: Set any required environment variables.
-
Install Airflow Providers:
pip install apache-airflow-providers-slack pip install apache-airflow-providers-email
-
Initialize DVC:
dvc init dvc remote add -d s3remote s3://your-bucket-name/path dvc remote modify s3remote endpointurl https://s3.amazonaws.com
Replace your-bucket-name/path with your actual S3 bucket and path.
Start the Airflow scheduler and webserver in separate terminal windows or as background processes:
airflow scheduler &
airflow webserver -p 8080 &
Open your web browser and navigate to http://localhost:8080 to access the Airflow UI.
-
Via UI:
- Locate the
emotion_detection_dag
in the DAGs list. - Click the toggle to activate the DAG if it's not already active.
- Click the Trigger DAG button to start the pipeline.
- Locate the
-
Via CLI:
airflow dags trigger emotion_detection_dag
The example_dag.py
is a simple DAG included for demonstration and testing purposes.
-
Purpose: To verify that your Airflow setup is functioning correctly by running a minimal DAG with no real tasks.
-
Via UI:
- Locate the
example_dag
in the DAGs list. - Click the toggle to activate the DAG if it's not already active.
- Click the Trigger DAG button to run the example DAG.
- Locate the
-
Via CLI:
airflow dags trigger example_dag
Note: The example_dag
contains only DummyOperator
tasks (start
and end
) and serves as a template or a sanity check for your Airflow installation.
-
Airflow UI:
- Use the Graph View or Tree View to monitor task progress.
- Click on individual tasks to view logs and troubleshoot if necessary.
-
Logs Directory:
- Check the
logs/
directory for detailed logs of each task execution.
- Check the
-
Emotion Detection Pipeline:
- The fine-tuned model is saved in
models/saved_model/
. - Evaluation metrics are available in the task logs and can be accessed via the Airflow UI.
- The fine-tuned model is saved in
-
Example DAG:
- Since it uses
DummyOperator
, it won't produce any outputs but confirms that Airflow is operational.
- Since it uses
After training and saving the model, perform inference using the example_data.py
script:
python scripts/example_data.py
This script will output predicted emotions for the sample sentences provided.
Run unit tests to ensure each component functions correctly.
# Ensure you're in the virtual environment
source venv/bin/activate
# Run unit tests
python -m unittest discover tests
Note: Ensure that the Airflow services are not running during testing to prevent conflicts, especially if using local SQLite databases.
- Data Version Control (DVC): Implement DVC to track data and model changes.
- Hyperparameter Optimization: Use Optuna or Ray Tune for automated hyperparameter tuning.
- Model Serving: Deploy the model using TensorFlow Serving or TorchServe.
- Kubernetes Executor: Scale Airflow tasks with Kubernetes for distributed execution.
- CI/CD Integration: Set up GitHub Actions or Jenkins for continuous integration and deployment.
- Dockerization: Containerize the application for easier deployment and scalability.
- Model Registry: Use tools like MLflow for model tracking and management.
- Advanced Monitoring: Integrate Prometheus and Grafana for real-time monitoring and alerting.
- Security Enhancements: Implement authentication and authorization mechanisms for model access.
- Automated Data Validation: Add tasks to validate data quality before processing.
- Scheduled Model Retraining: Automate periodic retraining of models with fresh data.
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- Data Augmentation: Implement techniques to enhance the dataset and improve model robustness.
- Multi-Language Support: Extend the pipeline to handle datasets in multiple languages.
- Automated Retraining: Set up Airflow sensors to detect new data and trigger retraining.
- Monitoring and Logging: Integrate tools like Prometheus and Grafana for real-time monitoring.
- Security Enhancements: Implement authentication and authorization mechanisms for model access.
Feel free to explore these ideas and contribute to the project's growth!
We use the Emotion Dataset, but you can experiment with other text datasets:
- Amazon Reviews: For sentiment analysis or product feature extraction.
- Twitter Sentiment Analysis: Real-time sentiment tracking on specific hashtags.
- News Articles: Topic modeling or fake news detection.
- Customer Support Tickets: Intent classification to route tickets.
Note: Remember to update the CONTRIBUTING.md
and LICENSE
files if any changes have been made.