[Draft] Distributed tuning #1435

brianchooou · 2024-12-10T04:24:50Z

[Draft] TensileParallel Documentation

Overview

TensileParallel is an enhancement to the original Tensile tuning tool that enables parallel tuning across multiple GPU devices. It optimizes the tuning process by distributing workloads across available GPUs, significantly reducing the total tuning time.

Features

Multi-GPU support for parallel tuning
Automatic workload distribution and load balancing
Fallback mechanism to standard Tensile execution
Comprehensive logging and error handling
Automatic results merging from multiple devices

Prerequisites

ROCm environment with hipBLASLt installed
Python 3.x
Multiple AMD GPU devices (optional)

Installation

No additional installation required beyond the standard hipBLASLt setup.

Usage

Basic Command

cd /hipBLASLt/tensilelite
./Tensile/bin/TensileParallel <config.yaml> <output_path>

Configuration for Device Selection

Modify your config.yaml to specify GPU devices using the DeviceList parameter under GlobalParameters:

Specific Devices

GlobalParameters:
	...
  DeviceList: [1, 2, 3]  # Use GPUs 1, 2, and 3

All Available Devices

GlobalParameters:
	...
  DeviceList: [-1]  # Use all available GPUs

Default Behavior

If DeviceList is not specified or empty, falls back to standard Tensile execution
If specified devices are unavailable, falls back to standard Tensile execution

Execution Flow

Configuration Loading

Validates input configuration
Checks device availability

Workload Distribution

Analyzes problem sizes
Distributes workload based on complexity
Generates device-specific configurations

Parallel Execution

Runs tuning processes on selected devices
Monitors progress and handles errors

Results Processing

Merges results from all devices
Generates execution summary
Creates consolidated output

Output Structure

output_path/
├── config_gpu_*.yaml        # Device-specific configurations
├── outputs/
│   ├── gpu_0/              # Results from GPU 0
│   ├── gpu_1/              # Results from GPU 1
│   └── ...
├── merged_output/          # Final merged results

Distributed tuning

43b42f6

brianchooou requested review from jichangjichang, KKyang, aazz44ss, vin-huang, imcarsonliao, hcman2, Serge45, Jinp800125, TonyYHsieh and solaslin as code owners December 10, 2024 04:24

vin-huang added the noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. label Dec 10, 2024

brianchooou changed the title ~~Distributed tuning~~ [Draft] Distributed tuning Dec 10, 2024

Merge branch 'ROCm:develop' into distributed_tuning

2063fda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Distributed tuning #1435

[Draft] Distributed tuning #1435

brianchooou commented Dec 10, 2024 •

edited

Loading

[Draft] Distributed tuning #1435

Are you sure you want to change the base?

[Draft] Distributed tuning #1435

Conversation

brianchooou commented Dec 10, 2024 • edited Loading

[Draft] TensileParallel Documentation

Overview

Features

Prerequisites

Installation

Usage

Execution Flow

Output Structure

brianchooou commented Dec 10, 2024 •

edited

Loading