DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks(ICPP22,TPDS24)
Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for big data applications. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT+ utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT+ modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT+ designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT+ also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences.
New features in DeepCAT+ beyond ICPP22 Paper DeepCAT
Progressive Neural Networks (PNN) based Online Continual Learner to enhance the adaptability for dynamic workloads and hardware environments changes.
- Log-based workload features extraction
- PNN-based knowledge transfer
- Install Hadoop distributed environment and file system.
- install the Spark computing framework.
- Install and compile the hibench testing framework.
- Install Ansible Playbook for batch configuration and automated deployment.
- Data collection: collect offline exploration data, including cluster metric states, configuration values, rewards. The interaction between Python programs and clusters is conducted through
Ansible
tools, check target/target_spark/readme.md for more details. - Use the data to form
memory pool
for offline training and save the model, seeoffline_train()
function inDeepCAT.py
. - Use the model to tune configuration for big data frameworks using
tune()
inDeepCAT.py
. Note there are two polcies:- if the workload is known, DeepCAT+ will direct conduct optimization, details in
DeepCAT.py
. - if the workload is unknown, DeepCAT+ will use Progressive Neural Networks for continual learning to enhence it's adaptability, details in
DeepCAT_with_PNN.py
.
- if the workload is known, DeepCAT+ will direct conduct optimization, details in
- Compare DeepCAT with CDBTune, OtterTune and Qtune baselines.
- Hadoop 2.7.3
- Spark 2.2.2
- Hibench 7.0
- Ansible
pip install -r requirements.txt
we use 9 worklaods with different input data sizes form Hibench The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis
- WordCount (WC)
- TeraSort (TS)
- PageRank (PR)
- KMeans (KM)
- Gradient Boosted Trees(GBT)
- Nweight (NW)
- Principal Component Analysis (PCA)
- Aggregation (AGG)
- WordCount(for streaming)
- The data collected based on the local 3-node Spark cluster includes the execution time of 4 spark workloads under different configuration values in the
dataset
, check dataset for more details. - For reinforcement learning training, memory pools consist of
transitions
(s,a,r,s') is in test_kit/ultimate/memory
- Description of the performance-critical parameters From Spark, YARN and HDFS
- For experiments on Flink, check test_kit/ultimate/flink-experimental/readme.md for more details.
Seven experimental clusters to extensively evaluate the effectiveness of DeepCAT/DeepCAT+ and its robustness to various hardware environments.
Cluster | Nodes | Cluster types | BD frameworks | Evaluation |
---|---|---|---|---|
Cluster_A | 3 | Physical machines | Spark | Effectiveness |
Cluster_B | 3 | VMs_1 | Spark | Adaptability |
Cluster_C | 6 | Physical machines | Flink | Other BD frameworks |
Cluster_D | 5 | Physical machines + VMs_2 | Spark | Heterogeneous clusters |
Cluster_E | 8 | VMs_3 | Spark | Large-scale clusters |
Cluster_F | 10 | VMs_3 | Spark | Large-scale clusters |
Cluster_G | 12 | VMs_3 | Spark | Large-scale clusters |
- Physical machines: 8 cores, 16GB memory
- VMs_1: 8 cores, 8GB memory
- VMs_2: 12 cores, 8GB memory
- VMs_3: 8 cores, 16GB memory