Authors: Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Yang Feng*
Code for AAAI 2025 paper "Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation"
💡Highlight:
- LSG is a LLM-driven Simultaneous Generation framework, which allows the off-the-shelf LLMs to decide the generation timing and produce output concurrently.
- Experiments on simultaneous text-to-text translation and speech-to-text translation demonstrates LSG achieves SOTA performance on standard benchmarks.
- LSG shows robust performance on streaming ASR task.
-
Python version = 3.11.9
-
PyTorch version = 2.2.1
-
Transformers version = 4.32.0
-
Install our library:
git clone https://github.com/ictnlp/LSG
cd LSG
pip install -e .
We keep settings with Agent-SiMT. We use Llama2-7B-Chat as the base model and fine-tune it by sampling 50k samples from WMT15 German-English (download here) and MusT-C English-German dataset (download here). The detailed fine-tuning scripts can be found here.
We directly use off-the-shelf Qwen-Audio model for speech input.
We prepare the test data following SimulEval format.
source_audio.txt: Each line records the path of a source speech. target.txt: Each line records the reference text, e.g., target translation or source transcription (used to calculate the BLEU or WER metrics).
Run the following scripts to performance evaluation. We provide the inference scripts for simultaneous speech-to-text translation and streaming ASR.
We prepare the inference scripts in the eval_contrastive_policy.sh.
export CUDA_VISIBLE_DEVICES=0,1
DELTA=delta
ALPHA=alpha
LOW_BOUND=low_bound
TOP_BOUND=top_bound
SEG_SIZE=640
MODEL=qwen_audio_dir
SOURCE=translation_file/source_audio.txt
TARGET=translation_file/target.txt
simuleval --agent contrastive_policy.py \
--source-segment-size $SEG_SIZE \
--source_size $SEG_SIZE \
--source $SOURCE \
--target $TARGET \
--threshold $ALPHA \
--low_bound $LOW_BOUND \
--top_bound $TOP_BOUND \
--decision_ratio $DELTA \
--lang_pair fr_en \
--model_dir $MODEL \
--output result_log_${SEG_SIZE}_${LOW_BOUND}_${TOP_BOUND}_${DELTA}_${ALPHA}
We prepare the inference scripts in the eval_contrastive_policy_asr.sh.
export CUDA_VISIBLE_DEVICES=0,1
DELTA=delta
ALPHA=alpha
LOW_BOUND=low_bound
TOP_BOUND=top_bound
SEG_SIZE=640
MODEL=qwen_audio_dir
SOURCE=source_audio.txt
TARGET=transcription.txt
simuleval --agent contrastive_policy_asr.py \
--source-segment-size $SEG_SIZE \
--source_size $SEG_SIZE \
--source $SOURCE \
--target $TARGET \
--threshold $ALPHA \
--low_bound $LOW_BOUND \
--top_bound $TOP_BOUND \
--decision_ratio $DELTA \
--lang_pair fr_fr \
--quality-metrics WER \
--model_dir $MODEL \
--output result_log_${SEG_SIZE}_${LOW_BOUND}_${TOP_BOUND}_${DELTA}_${ALPHA}
If you have any questions, please feel free to submit an issue or contact [email protected].
If our work is useful for you, please cite as:
@article{lsg_ictnlp,
title={Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation},
author={Shoutao Guo and Shaolei Zhang and Zhengrui Ma and Yang Feng},
year={2025},
journal={Proceedings of the AAAI Conference on Artificial Intelligence}
}