Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Distributed tuning #1435

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

brianchooou
Copy link
Contributor

@brianchooou brianchooou commented Dec 10, 2024

[Draft] TensileParallel Documentation

Overview

TensileParallel is an enhancement to the original Tensile tuning tool that enables parallel tuning across multiple GPU devices. It optimizes the tuning process by distributing workloads across available GPUs, significantly reducing the total tuning time.

Features

  • Multi-GPU support for parallel tuning
  • Automatic workload distribution and load balancing
  • Fallback mechanism to standard Tensile execution
  • Comprehensive logging and error handling
  • Automatic results merging from multiple devices

Prerequisites

  • ROCm environment with hipBLASLt installed
  • Python 3.x
  • Multiple AMD GPU devices (optional)

Installation

No additional installation required beyond the standard hipBLASLt setup.

Usage

Basic Command

cd /hipBLASLt/tensilelite
./Tensile/bin/TensileParallel <config.yaml> <output_path>

Configuration for Device Selection

Modify your config.yaml to specify GPU devices using the DeviceList parameter under GlobalParameters:

  1. Specific Devices
GlobalParameters:
	...
  DeviceList: [1, 2, 3]  # Use GPUs 1, 2, and 3
  1. All Available Devices
GlobalParameters:
	...
  DeviceList: [-1]  # Use all available GPUs
  1. Default Behavior
  • If DeviceList is not specified or empty, falls back to standard Tensile execution
  • If specified devices are unavailable, falls back to standard Tensile execution

Execution Flow

  1. Configuration Loading
  • Validates input configuration
  • Checks device availability
  1. Workload Distribution
  • Analyzes problem sizes
  • Distributes workload based on complexity
  • Generates device-specific configurations
  1. Parallel Execution
  • Runs tuning processes on selected devices
  • Monitors progress and handles errors
  1. Results Processing
  • Merges results from all devices
  • Generates execution summary
  • Creates consolidated output

Output Structure

output_path/
├── config_gpu_*.yaml        # Device-specific configurations
├── outputs/
│   ├── gpu_0/              # Results from GPU 0
│   ├── gpu_1/              # Results from GPU 1
│   └── ...
├── merged_output/          # Final merged results

@vin-huang vin-huang added the noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. label Dec 10, 2024
@brianchooou brianchooou changed the title Distributed tuning [Draft] Distributed tuning Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noCI Disable testing on supported CI systems: math libraries CI has this feature enabled..
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants