Skip to content

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Notifications You must be signed in to change notification settings

xlang-ai/aguvis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AGUVIS

📑 Paper    |    🌐 Project Page    |    💾 AGUVIS Data Collection   

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

  • 🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
  • 🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
  • 📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
  • 🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
  • 💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

overview.mp4

Mobile Tasks (Android World)

androidworld.mp4

Web Browsing Tasks (Mind2Web-Live)

mind2web-live.mp4

Computer-use Tasks (OSWorld)

osworld.mp4

Getting Started

Installation

  1. Clone the repository:
git clone [email protected]:xlang-ai/aguvis.git
cd aguvis
  1. Create and activate a conda environment:
conda create -n aguvis python=3.10
conda activate aguvis
  1. Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .

Data Preparation

  1. Stage 1: Grounding

  2. Stage 2: Planning and Reasoning

Training

  1. Configure your training settings:

    • Open scripts/train.sh
    • Set the SFT_TASK variable to specify your training stage
  2. Start training:

bash scripts/train.sh

Inference

  1. Configure your inference settings:

    • Open scripts/inference.sh
    • Set the MODEL_PATH variable to specify your model path
    • Set the IMAGE_PATH variable to specify your image path
    • Set the INSTRUCTION variable to specify your instruction
    • Set the PREVIOUS_ACTIONS variable to specify your previous actions or leave it empty
    • Set the LOW_LEVEL_INSTRUCTION variable to specify your low-level instruction or leave it empty
  2. Start inference:

bash scripts/inference.sh

Checklist

  • Data
    • ✅ Stage 1: Grounding Dataset
    • ✅ Stage 2: Planning and Reasoning Trajectories
  • Code
    • ✅ Training Pipeline
    • 🚧 Model Weights and Configurations
    • 🚧 Inference Scripts
    • 🚧 Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,
  title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
  author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
  year={2024},
  url={https://arxiv.org/abs/2412.04454}
}