📑 Paper | 🌐 Project Page | 💾 AGUVIS Data Collection
AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.
- 🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
- 🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
- 📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
- 🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
- 💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training
Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.
overview.mp4
androidworld.mp4
mind2web-live.mp4
osworld.mp4
- Clone the repository:
git clone [email protected]:xlang-ai/aguvis.git
cd aguvis
- Create and activate a conda environment:
conda create -n aguvis python=3.10
conda activate aguvis
- Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .
-
Stage 1: Grounding
- Download the dataset from aguvis-stage1
- Place the data according to the structure defined in
data/stage1.yaml
-
Stage 2: Planning and Reasoning
- Download the dataset from aguvis-stage2
- Place the data according to the structure defined in
data/stage2.yaml
-
Configure your training settings:
- Open
scripts/train.sh
- Set the
SFT_TASK
variable to specify your training stage
- Open
-
Start training:
bash scripts/train.sh
-
Configure your inference settings:
- Open
scripts/inference.sh
- Set the
MODEL_PATH
variable to specify your model path - Set the
IMAGE_PATH
variable to specify your image path - Set the
INSTRUCTION
variable to specify your instruction - Set the
PREVIOUS_ACTIONS
variable to specify your previous actions or leave it empty - Set the
LOW_LEVEL_INSTRUCTION
variable to specify your low-level instruction or leave it empty
- Open
-
Start inference:
bash scripts/inference.sh
- Data
- ✅ Stage 1: Grounding Dataset
- ✅ Stage 2: Planning and Reasoning Trajectories
- Code
- ✅ Training Pipeline
- 🚧 Model Weights and Configurations
- 🚧 Inference Scripts
- 🚧 Evaluation Toolkit
If this work is helpful, please kindly cite as:
@article{xu2024aguvis,
title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
year={2024},
url={https://arxiv.org/abs/2412.04454}
}