S2ST Survey & History

2025-05-10T10:47:23.918452

End-to-End Speech Translation Progress

Data

Corpus	Direction	Target	Duration	License
CoVoST 2	{Fr, De, Es, Ca, It, Ru, Zh, Pt, Fa, Et, Mn, Nl, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id, Cy} -> En and En ->	Text	2880h	CC0
CVSS	{Fr, De, Es, Ca, It, Ru, Zh, Pt, Fa, Et, Mn, Nl, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id, Cy} -> En	Text & Speech	1900h	CC BY 4.0
mTEDx	{Es, Fr, Pt, It, Ru, El} -> En, {Fr, Pt, It} -> Es, Es -> {Fr, It}, {Es,Fr} -> Pt	Text	765h	CC BY-NC-ND 4.0
CoVoST	{Fr, De, Nl, Ru, Es, It, Tr, Fa, Sv, Mn, Zh} -> En	Text	700h	CC0
MUST-C & MUST-Cinema	En ->	Text	504h	CC BY-NC-ND 4.0
How2	En -> Pt	Text	300h	Youtube & CC BY-SA 4.0
Augmented LibriSpeech	En -> Fr	Text	236h	CC BY 4.0
Europarl-ST	{En, Fr, De, Es, It, Pt, Pl, Ro, Nl} ->	Text	280h	CC BY-NC 4.0
Kosp2e	Ko -> En	Text	198h	Mixed CC
Fisher + Callhome	Es -> En	Text	160h+20h	LDC
MaSS	parallel among En, Es, Eu, Fi, Fr, Hu, Ro and Ru	Text & Speech	172h	Bible.is
LibriVoxDeEn	De -> En	Text	110h	CC BY-NC-SA 4.0
Prabhupadavani	parallel among En, Fr, De, Gu, Hi, Hu, Id, It, Lv, Lt, Ne, Fa, Pl, Pt, Ru, Sl, Sk, Es, Se, Ta, Te, Tr, Bg, Hr, Da and Nl	Text	94h
BSTC	Zh -> En	Text	68h
LibriS2S	De <-> En	Text & Speech	52h/57h	CC BY-NC-SA 4.0

Toolkit

This repository collects the tookits, common datasets and paper list related to the research on Simultaneous Translation. This repository is continuously updating… update-badge

It is a great honor if this repository brings some help or reference to your research:blush: If you have any suggestions, feel free to contact me with: Shaolei Zhang zhangshaolei20z@ict.ac.cn.

wordcloud

Tookits

Fairseq: a sequence modeling toolkit, covering the machine translation, speech translation and simultaneous translation (both text-to-text and speech-to-text).
SimulEval: a general evaluation framework for simultaneous translation on text and speech.

Datasets

Conventional text-to-text translation datasets:
IWSLT15 English-Vietnamese: 133K sentence pairs. [Link]
WMT15 German-English: 4.5M sentence pairs. [Link]
WMT14 English-French: 36.3M sentence pairs. [Link]
Conventional speech-to-text translation datasets:
MuST-C: multilingual speech-to-text translation corpus with 8 language pairs. [Link]
Conventional speech-to-Speech translation datasets:
CVSS: massively multilingual-to-English speech-to-speech translation corpus. [Link]
Simultaneous interpretation datasets:
BSTC Chinese-English: 68 hours. [Link]
NAIST-SIC English-Japanese: 22 hours.[Link]

Tutorials & Talks

PACLIC 2016: The Challenge of Simultaneous Speech Translation. Anoop Sarkar. [Link]

EMNLP 2020: Simultaneous Translation. Liang Huang, Colin Cherry, Mingbo Ma, Naveen Arivazhagan, and Zhongjun He. [Link]

AMTA 2020: Simultaneous Speech Translation in Google Translate. Jeff Pitman. [Link]

Paper List

This is a paper list of Simultaneous Translation, organized by publication year.

We also collect a paper list organized by different categories. Refer to Here.

2002 | 2006 | 2007 | 2009 | 2010 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024

2002

Translation Unit Concerning Timing of Simultaneous Translation. LREC 2002. [PDF]

2006

Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer. ACL 2006. [PDF]

2007

Simultaneous translation of lectures and speeches. Mach Translat 2007. [PDF]

2009

End-to-End Evaluation in Simultaneous Translation. EACL 2009. [PDF]

2010

Stream-based Translation Models for Statistical Machine Translation. NAACL 2010. [PDF]
Construction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation. LREC 2010. [PDF]

2012

Real-time Incremental Speech-to-Speech Translation of Dialogs. NAACL 2012. [PDF]

2013

Incremental Segmentation and Decoding Strategies for Simultaneous Translation. IJCNLP 2013. [PDF]

2014

Optimizing Segmentation Strategies for Simultaneous Speech Translation. ACL 2014. [PDF]
Collection of a Simultaneous Translation Corpus for Comparative Analysis. IREC 2014. [PDF]
Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation. EMNLP 2014. [PDF]
Towards Simultaneous Interpreting: the Timing of Incremental Machine Translation and Speech Synthesis. IWSLT 2014. [PDF]
Segmentation Strategies for Streaming Speech Translation. NAACL 2014. [PDF]

2015

Automated Simultaneous Interpretation: Hints of a Cognitive Framework for Machine Translation. HyTra 2015. [PDF]
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic Constituents. ACL 2015. [PDF]
Syntax-based Rewriting for Simultaneous Machine Translation. EMNLP 2015. [PDF]
Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus. IWSLT 2015. [PDF]

2016

An Efficient and Effective Online Sentence Segmenter for Simultaneous Interpretation. WAT 2016. [PDF]
Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation. NAACL 2016. [PDF] [Code]
Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons. LREC 2016. [PDF]
A Prototype Automatic Simultaneous Interpretation System. COLING 2016. [PDF]
Simultaneous Machine Translation using Deep Reinforcement Learning. ICML 2016. [PDF]
Can neural Machine Translation do Simultaneous Translation? Arxiv 2016. [PDF]
Listen and translate: A proof of concept for end-to-end speech-to-text translation. NIPS Workshop 2016. [PDF]
An Attentional Model for Speech Translation Without Transcription. NAACL 2016. [PDF]

2017

Online and Linear-Time Attention by Enforcing Monotonic Alignments. ICML 2017. [PDF] [Code]
Learning to Translate in Real-time with Neural Machine Translation. EACL 2017. [PDF] [Code]
Sequence-to-Sequence Models Can Directly Translate Foreign Speech. INTERSPEECH 2017. [PDF]
Structured-based Curriculum Learning for End-to-end English-Japanese Speech Translation. INTERSPEECH 2017. [PDF]
Towards speech-to-text translation without speech recognition. EACL 2017. [PDF]

2018

Simultaneous Translation using Optimized Segmentation. AMTA 2018. [PDF]
Automatic Estimation of Simultaneous Interpreter Performance. ACL 2018. [PDF] [Code]
Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation. NAACL 2018. [PDF] [Code]
Statistical Analysis of Missing Translation in Simultaneous Interpretation Using A Large-scale Bilingual Speech Corpus. LREC 2018. [PDF]
Prediction Improves Simultaneous Neural Machine Translation. EMNLP 2018. [PDF] [Code]
KIT Lecture Translator: Multilingual Speech Translation with One-Shot Learning. COLING 2018. [PDF]
Monotonic Chunkwise Attention. ICLR 2018. [PDF] [Code]
How2: A Large-scale Dataset for Multimodal Language Understanding. NIPS 2018. [PDF]
End-to-End Speech Translation with the Transformer. IberSPEECH 2018. [PDF]
Low-Resource Speech-to-Text Translation. INTERSPEECH 2018. [PDF]
Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation. LREC 2018. [PDF]
Tied multitask learning for neural speech translation. NAACL 2018. [PDF]
End-to-End Automatic Speech Translation of Audiobooks. ICASSP 2018. [PDF]

2019

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation. ACL 2019. [PDF]
STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. ACL 2019. [PDF]
Simultaneous Translation with Flexible Policy via Restricted Imitation Learning. ACL 2019. [PDF]
Lost in Interpretation: Predicting Untranslated Terminology in Simultaneous Interpretation. NAACL 2019. [PDF] [Code]
Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation. EMNLP 2019. [PDF]
Speculative Beam Search for Simultaneous Translation. EMNLP 2019. [PDF]
Thinking Slow about Latency Evaluation for Simultaneous Machine Translation. Arxiv 2019. [PDF]
DuTongChuan: Context-aware Translation Model for Simultaneous Interpreting. Arxiv 2019. [PDF]
Simultaneous Neural Machine Translation using Connectionist Temporal Classification. Arxiv 2019. [PDF]
One-To-Many Multilingual End-to-end Speech Translation. ASRU 2019. [PDF]
Multilingual End-to-End Speech Translation. ASRU 2019. [PDF]
Speech-to-speech Translation between Untranscribed Unknown Languages. ASRU 2019. [PDF]
A Comparative Study on End-to-end Speech to Text Translation. ASRU 2019. [PDF]
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade. IWSLT 2019. [PDF]
On Using SpecAugment for End-to-End Speech Translation. IWSLT 2019. [PDF]
End-to-End Speech Translation with Knowledge Distillation. INTERSPEECH 2019. [PDF]
Adapting Transformer to End-to-end Spoken Language Translation. INTERSPEECH 2019. [PDF]
Direct speech-to-speech translation with a sequence-to-sequence model. INTERSPEECH 2019. [PDF]
Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation. ACL 2019. [PDF]
Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation. ACL 2019. [PDF]
Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation. NAACL 2019. [PDF]
MuST-C: a Multilingual Speech Translation Corpus. NAACL 2019. [PDF]
Fluent Translations from Disfluent Speech in End-to-End Speech Translation. NAACL 2019. [PDF]
Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation. ICASSP 2019. [PDF]
Towards unsupervised speech-to-text translation. ICASSP 2019. [PDF]
Towards End-to-end Speech-to-text Translation with Two-pass Decoding. ICASSP 2019. [PDF]

2020

Towards Multimodal Simultaneous Neural Machine Translation. WMT 2020. [PDF] [Code]
Opportunistic Decoding with Timely Correction for Simultaneous Translation. ACL 2020. [PDF]
Simultaneous Translation Policies: From Fixed to Adaptive. ACL 2020. [PDF]
SimulSpeech: End-to-End Simultaneous Speech to Text Translation. ACL 2020. [PDF]
Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus. ACL 2020. [PDF]
Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. ACL 2020 Theme. [PDF]
Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation. ACL 2020. [PDF]
Phone Features Improve Speech Translation. ACL 2020. [PDF]
Curriculum Pre-training for End-to-End Speech Translation. ACL 2020. [PDF]
ESPnet-ST: All-in-One Speech Translation Toolkit. ACL 2020 Demo. [PDF]
Monotonic Multihead Attention. ICLR 2020. [PDF] [Code]
Learning Adaptive Segmentation Policy for Simultaneous Translation. EMNLP 2020. [PDF]
Simultaneous Machine Translation with Visual Context. EMNLP 2020. [PDF] [Code]
Direct Segmentation Models for Streaming Speech Translation. EMNLP 2020. [PDF] [Code]
Effectively pretraining a speech translation decoder with Machine Translation data. EMNLP 2020. [PDF]
SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation. EMNLP 2020 Demo. [PDF] [Code]
Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework. EMNLP 2020 Findings. [PDF]
Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training. EMNLP 2020 Findings. [PDF]
Adaptive Feature Selection for End-to-End Speech Translation. EMNLP 2020 Findings. [PDF]
A General Framework for Adaptation of Neural Machine Translation to Simultaneous Translation. AACL 2020. [PDF]
SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation. AACL 2020. [PDF] [Code]
fairseq S2T: Fast Speech-to-Text Modeling with fairseq. AACL 2020 Demo. [PDF]
Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation. AAAI 2020. [PDF]
Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding. AAAI 2020. [PDF]
Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation. ICASSP 2020. [PDF]
Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates. ICASSP 2020. [PDF]
Instance-Based Model Adaptation For Direct Speech Translation. ICASSP 2020. [PDF]
Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning. ICASSP 2020. [PDF]
Analyzing ASR pretraining for low-resource speech-to-text translation. ICASSP 2020. [PDF]
End-to-End Speech Translation with Self-Contained Vocabulary Manipulation. ICASSP 2020. [PDF]
Efficient Wait-k Models for Simultaneous Machine Translation. InterSpeech 2020. [PDF] [Code]
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. InterSpeech 2020. [PDF]
Relative Positional Encoding for Speech Recognition and Direct Translation. InterSpeech 2020. [PDF]
Contextualized Translation of Automatically Segmented Speech. InterSpeech 2020. [PDF]
Self-Training for End-to-End Speech Translation. InterSpeech 2020. [PDF]
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation. InterSpeech 2020. [PDF]
Self-Supervised Representations Improve End-to-End Speech Translation. InterSpeech 2020. [PDF]
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation. InterSpeech 2020. [PDF]
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus. LREC 2020. [PDF]
MuST-Cinema: a Speech-to-Subtitles corpus. LREC 2020. [PDF]
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. LREC 2020. [PDF]
LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition. LREC 2020. [PDF]
On Target Segmentation for Direct Speech Translation. AMTA 2020. [PDF]
Consistent Transcription and Translation of Speech. TACL 2020. [PDF]
Presenting Simultaneous Translation in Limited Space. Arxiv 2020. [PDF]
Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS. Arxiv 2020. [PDF]
Low Latency ASR for Simultaneous Speech Translation. Arxiv 2020. [PDF]
Bridging the Modality Gap for Speech-to-Text Translation. Arxiv 2020. [PDF]
CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning. Arxiv 2020. [PDF]
On Knowledge Distillation for Direct Speech Translation. CLiC-IT 2020. [PDF]
Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation. COLING 2020. [PDF]
Breeding Gender-aware Direct Speech Translation Systems. COLING 2020. [PDF]

2021

Monotonic Simultaneous Translation with Chunk-wise Reordering and Refinement. WMT2021. [PDF]
Simultaneous Neural Machine Translation with Constituent Label Prediction. WMT 2021. [PDF]
Future-Guided Incremental Transformer for Simultaneous Translation. AAAI 2021. [PDF]
Studying The Impact Of Document-level Context On Simultaneous Neural Machine Translation. Machine Translation 2021. [PDF]
Beyond Sentence-Level End-to-End Speech Translation: Context Helps. ACL 2021. [PDF] [Code]
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer. ACL 2021 findings. [PDF]
Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR. ACL 2021 findings. [PDF]
Multilingual Simultaneous Neural Machine Translation. ACL 2021 findings. [PDF]
Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy. EMNLP 2021. [PDF] [Code]
Cross Attention Augmented Transducer Networks for Simultaneous Translation. EMNLP 2021. [PDF] [Code]
Translation-based Supervision for Policy Generation in Simultaneous Neural Machine Translation. EMNLP 2021. [PDF] [Code]
Improving Simultaneous Translation by Incorporating Pseudo-References with Fewer Reorderings. EMNLP 2021. [PDF]
A Generative Framework for Simultaneous Machine Translation. EMNLP 2021. [PDF]
It Is Not As Good As You Think! Evaluating Simultaneous Machine Translation on Interpretation Data. EMNLP 2021. [PDF] [Code]
Stream-level Latency Evaluation for Simultaneous Machine Translation. EMNLP 2021 findings. [PDF] [Code]
MiSS: An Assistant for Multi-Style Simultaneous Translation. EMNLP 2021 Demo. [PDF]
Learning Coupled Policies for Simultaneous Machine Translation using Imitation Learning. EACL 2021. [PDF] [Code]
Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation. EACL 2021. [PDF] [Code]
An Empirical Study Of End-To-End Simultaneous Speech Translation Decoding Strategies. ICASSP 2021 [PDF]
Streaming Simultaneous Speech Translation With Augmented Memory Transformer. ICASSP 2021 [PDF]
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. Interspeech 2021. [PDF]
Visualization: the missing factor in Simultaneous Speech Translation. CLIC-it 2021. [PDF]
UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation. Arxiv 2021. [PDF]
Faster Re-translation Using Non-Autoregressive Model For Simultaneous Neural Machine Translation. Arxiv 2021. [PDF]
Learning to Use Future Information in Simultaneous Translation. Arxiv 2021 [PDF]
Simultaneous Multi-Pivot Neural Machine Translation. Arxiv 2021. [PDF]
Full-Sentence Models Perform Better in Simultaneous Translation Using the Information Enhanced Decoding Strategy. Arxiv 2021. [PDF]
Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems. Arxiv 2021. [PDF]
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention. Arxiv 2021. [PDF]
SimulSLT: End-to-End Simultaneous Sign Language Translation. Arxiv 2021. [PDF]
Efficient Transformer for Direct Speech Translation. Arxiv 2021. [PDF]
Zero-shot Speech Translation. Arxiv 2021. [PDF]
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates. ASRU 2021. [PDF]
Assessing Evaluation Metrics for Speech-to-Speech Translation. ASRU 2021. [PDF]
Enabling Zero-shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. ASRU 2021. [PDF]
Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation. ICNLSP 2021. [PDF]
Speechformer: Reducing Information Loss in Direct Speech Translation. EMNLP 2021. [PDF]
Is “moby dick” a Whale or a Bird? Named Entities and Terminology in Speech Translation. EMNLP 2021. [PDF]
Mutual-Learning Improves End-to-End Speech Translation. EMNLP 2021. [PDF]
End-to-end Speech Translation via Cross-modal Progressive Training. Interspeech 2021. [PDF]
CoVoST 2 and Massively Multilingual Speech-to-Text Translation. Interspeech 2021. [PDF]
The Multilingual TEDx Corpus for Speech Recognition and Translation. Interspeech 2021. [PDF]
Large-Scale Self-and Semi-Supervised Learning for Speech Translation. Interspeech 2021. [PDF]
Kosp2e: Korean Speech to English Translation Corpus. Interspeech 2021. [PDF]
AlloST: Low-resource Speech Translation without Source Transcription. Interspeech 2021. [PDF]
SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction. Interspeech 2021. [PDF]
Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation. Interspeech 2021. [PDF]
ASR Posterior-based Loss for Multi-task End-to-end Speech Translation. Interspeech 2021. [PDF]
Simultaneous Speech Translation for Live Subtitling: from Delay to Display. AMTA 2021. [PDF]
Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders. ACL 2021. [PDF]
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. ACL 2021. [PDF]
Lightweight Adapter Tuning for Multilingual Speech Translation. ACL 2021. [PDF]
Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference? ACL 2021. [PDF]
Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task. ACL 2021. [PDF]
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation. ACL 2021 Findings. [PDF]
Learning Shared Semantic Space for Speech-to-Text Translation. ACL 2021 Findings. [PDF]
Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation. ACL 2021 Findings. [PDF]
How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation. ACL 2021 Findings. [PDF]
NeurST: Neural Speech Translation Toolkit. ACL 2021 Demo. [PDF]
Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation. ICML 2021. [PDF]
Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation. NAACL 2021. [PDF]
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks. NAACL 2021. [PDF]
BSTC: A Large-Scale Chinese-English Speech Translation Dataset. NAACL AutoSimTrans 2021. [PDF]
Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation. AmericasNLP 2021. [PDF]
Task Aware Multi-Task Learning for Speech to Text Tasks. ICASSP 2021. [PDF]
A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks. ICASSP 2021. [PDF]
An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies. ICASSP 2021. [PDF]
Streaming Simultaneous Speech Translation with Augmented Memory Transformer. ICASSP 2021. [PDF]
Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder. ICASSP 2021. [PDF]
Cascaded Models With Cyclic Feedback For Direct Speech Translation. ICASSP 2021. [PDF]
Jointly Trained Transformers models for Spoken Language Translation. ICASSP 2021. [PDF]
Efficient Use of End-to-end Data in Spoken Language Processing. ICASSP 2021. [PDF]
CTC-based Compression for Direct Speech Translation. EACL 2021. [PDF]
Streaming Models for Joint Speech Recognition and Translation. EACL 2021. [PDF]
mintzai-ST: Corpus and Baselines for Basque-Spanish Speech Translation. IberSPEECH 2021. [PDF]
Consecutive Decoding for Speech-to-text Translation. AAAI 2021. [PDF]
UWSpeech: Speech to Speech Translation for Unwritten Languages. AAAI 2021. [PDF]
“Listen, Understand and Translate”: Triple Supervision Decouples End-to-end Speech-to-text Translation. AAAI 2021. [PDF]
Tight Integrated End-to-End Training for Cascaded Speech Translation. SLT 2021. [PDF]
Transformer-based Direct Speech-to-speech Translation with Transcoder. SLT 2021. [PDF]
Beyond Sentence-Level End-to-End Speech Translation: Context Helps. ACL 2021. [PDF]
Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR. ACL 2021 Findings. [PDF]
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer. ACL 2021 Findings. [PDF]
Efficient Transformer for Direct Speech Translation. arXiv 2021. [PDF]
Zero-shot Speech Translation. arXiv 2021. [PDF]
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention. arXiv 2021. [PDF]
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates. ASRU 2021. [PDF]
Assessing Evaluation Metrics for Speech-to-Speech Translation. ASRU 2021. [PDF]
Enabling Zero-shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. ASRU 2021. [PDF]
Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation. ICNLSP 2021. [PDF]
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. INTERSPEECH 2021. [PDF]
Speechformer: Reducing Information Loss in Direct Speech Translation. EMNLP 2021. [PDF]

2022

Modeling Dual Read/Write Paths for Simultaneous Machine Translation. ACL 2022. [PDF] [Code]
Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework. ACL 2022. [PDF]
From Simultaneous to Streaming Machine Translation by Leveraging Streaming History. ACL 2022. [PDF]
Learning When to Translate for Streaming Speech. ACL 2022. [PDF] [Code]
Learning Adaptive Segmentation Policy for End-to-End Simultaneous Translation. ACL 2022. [PDF]
Gaussian Multi-head Attention for Simultaneous Machine Translation. ACL 2022 findings. [PDF] [Code]
Sample, Translate, Recombine: Leveraging Audio Alignments for Data Augmentation in End-to-end Speech Translation. ACL 2022. [PDF]
UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation. ACL 2022. [PDF]
Direct speech-to-speech translation with discrete units. ACL 2022. [PDF]
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation. ACL 2022. [PDF]
End-to-End Speech Translation for Code Switched Speech. ACL 2022 Findings. [PDF]
Language Model Augmented Monotonic Attention for Simultaneous Translation. NAACL 2022. [PDF]
Textless Speech-to-Speech Translation on Real Data. NAACL 2022. [PDF]
Information-Transport-based Policy for Simultaneous Translation. EMNLP 2022. [PDF][code]
Wait-info Policy: Balancing Source and Target at Information Level for Simultaneous Machine Translation. EMNLP 2022 findings. [PDF][code]
Turning Fixed to Adaptive: Integrating Post-Evaluation into Simultaneous Machine Translation. EMNLP 2022 findings. [PDF][code]
Does Simultaneous Speech Translation need Simultaneous Models? EMNLP 2022 findings. [PDF][code]
RedApt: An Adaptor for WAV2VEC 2 Encoding Faster and Smaller Speech Translation without Quality Compromise. EMNLP 2022 Findings. [PDF]
Revisiting End-to-End Speech-to-Text Translation From Scratch. ICML 2022. [PDF]
Translatotron 2: Robust direct speech-to-speech translation. ICML 2022. [PDF]
Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation. InterSpeech 2022. [PDF][code]
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation. InterSpeech 2022. [PDF]
Multilingual Simultaneous Speech Translation. InterSpeech 2022. [PDF][code]
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation. InterSpeech 2022. [PDF][code]
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation. InterSpeech 2022. [PDF]
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers. InterSpeech 2022. [PDF]
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation. InterSpeech 2022. [PDF]
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. InterSpeech 2022. [PDF]
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation. InterSpeech 2022. [PDF]
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation. InterSpeech 2022. [PDF]
Supervised Visual Attention for Simultaneous Multimodal Machine Translation. JAIR 2022. [PDF]
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. LREC 2022. [PDF]
LibriS2S: A German-English Speech-to-Speech Translation Corpus. LREC 2022. [PDF]
Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques. ICASSP 2022. [PDF]
Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement. AAAI 2022. [PDF]
Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing. Neural Networks 2022. [PDF]
Comprehension of Subtitles from Re-Translating Simultaneous Speech Translation. Arxiv 2022. [PDF]
Data-Driven Adaptive Simultaneous Machine Translation. Arxiv 2022. [PDF]
Simultaneous Translation for Unsegmented Input: A Sliding Window Approach. Arxiv 2022. [PDF]
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation (Technical Report). Arxiv 2022. [PDF]
Attention as a guide for Simultaneous Speech Translation. Arxiv 2022. [PDF]
AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation. Arxiv 2022. [PDF]
Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features. Arxiv 2022. [PDF]
ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English. Arxiv 2022. [PDF]
Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages. Arxiv 2022. [PDF]

2023

Tuning Large language model for End-to-end Speech Translation. Arxiv 2023. [PDF]
Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning. Arxiv 2023. [PDF]
Multilingual Speech-to-Speech Translation into Multiple Target Languages. Arxiv 2023. [PDF]
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. ICCV 2023. [PDF]
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation. InterSpeech 2023. [PDF]
Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer. InterSpeech 2023. [PDF]
Joint Speech Translation and Named Entity Recognition. InterSpeech 2023. [PDF]
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation. InterSpeech 2023. [PDF]
Knowledge Distillation on Joint Task End-to-End Speech Translation. InterSpeech 2023. [PDF]
GigaST: A 10,000-hour Pseudo Speech Translation Corpus. InterSpeech 2023. [PDF]
Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation. InterSpeech 2023. [PDF]
HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation. InterSpeech 2023. [PDF]
Pre-training for Speech Translation: CTC Meets Optimal Transport. ICML 2023. [PDF]
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units. ACL 2023. [PDF]
Simple and effective unsupervised speech translation. ACL 2023. [PDF]
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric. ACL 2023. [PDF]
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations. ACL 2023. [PDF]
Understanding and Bridging the Modality Gap for Speech Translation. ACL 2023. [PDF]
Back Translation for Speech-to-text Translation Without Transcripts. ACL 2023. [PDF]
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation. ACL 2023. [PDF]
WACO: Word-Aligned Contrastive Learning for Speech Translation. ACL 2023. [PDF]
Speech-to-Speech Translation for a Real-world Unwritten Language. ACL 2023 Findings. [PDF]
CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation. ACL 2023 Findings. [PDF]
Duplex Diffusion Models Improve Speech-to-Speech Translation. ACL 2023 Findings. [PDF]
DUB: Discrete Unit Back-translation for Speech Translation. ACL 2023 Findings. [PDF]
Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data. ACL 2023 Findings. [PDF]
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation. ICASSP 2023. [PDF]
M3ST: Mix at Three Levels for Speech Translation. ICASSP 2023. [PDF]
Generating Synthetic Speech from SpokenVocab for Speech Translation. EACL 2023 Findings. [PDF]
Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data. AAAI 2023. [PDF]
Improving Simultaneous Machine Translation with Monolingual Data. AAAI 2023. [PDF][Code]
Hidden Markov Transformer for Simultaneous Machine Translation. ICLR 2023. [PDF][Code]
Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation. ICASSP 2023. [PDF]
LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation. ICASSP 2023. [PDF]
Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks. ACL 2023. [PDF][Code]
Learning Optimal Policy for Simultaneous Machine Translation via Binary Search. ACL 2023 [PDF][Code]
Better Simultaneous Translation with Monotonic Knowledge Distillation. ACL 2023 [PDF][Code]
Attention as a Guide for Simultaneous Speech Translation. ACL 2023 [PDF][Code]
End-to-End Simultaneous Speech Translation with Differentiable Segmentation. ACL 2023 findings [PDF][Code]
Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation. ACL 2023 findings [PDF][Code]
Japanese-to-English Simultaneous Dubbing Prototype. ACL 2023 demo [PDF]
AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation. InterSpeech 2023. [PDF]
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models. InterSpeech 2023. [PDF][Code]
LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers. InterSpeech 2023. [PDF]
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. InterSpeech 2023. [PDF]
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation. ICML 2023. [PDF][Code]
Non-autoregressive Streaming Transformer for Simultaneous Translation. EMNLP 2023. [PDF][Code]
Adaptive Policy with Wait-k Model for Simultaneous Translation. EMNLP 2023. [PDF][Code]
Simultaneous Machine Translation with Tailored Reference. EMNLP 2023 findings. [PDF][Code]
Enhanced Simultaneous Machine Translation with Word-level Policies. EMNLP 2023 findings. [PDF][Code]
Long-form Simultaneous Speech Translation: Thesis Proposal. AACL 2023. [PDF]
Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach. ASRU 2023. [PDF]
Average Token Delay: A Latency Metric for Simultaneous Translation. Arxiv 2023. [PDF]
Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference. Arxiv 2023. [PDF]
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation. Arxiv 2023. [PDF][Code]
Simultaneous Machine Translation with Large Language Models. Arxiv 2023. [PDF]
CBSiMT: Mitigating Hallucination in Simultaneous Machine Translation with Weighted Prefix-to-Prefix Training. Arxiv 2023. [PDF]
Context Consistency between Training and Testing in Simultaneous Machine Translation. Arxiv 2023. [PDF][Code]
Seamless: Multilingual Expressive and Streaming Speech Translation. Arxiv 2023. [PDF][Code]
Unified Segment-to-Segment Framework for Simultaneous Sequence Generation. NeurIPS 2023. [PDF][Code]
Efficient Monotonic Multihead Attention.Arxiv 2023. [PDF][Code]

2024

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning. ACL 2024. [PDF][Code][Project]
Decoder-only Streaming Transformer for Simultaneous Translation. ACL 2024. [PDF][Code]
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation. ACL 2024. [Code][Project]
Self-Modifying State Modeling for Simultaneous Machine Translation. ACL 2024 [PDF][Code]
Simul-LLM: A Framework for Exploring High-Quality Simultaneous Translation with Large Language Models. Arxiv 2023. [PDF][Code]
Glancing Future for Simultaneous Machine Translation. ICASSP 2024. [PDF][Code]
LANGUAGE MODEL IS A BRANCH PREDICTOR FOR SIMULTANEOUS MACHINE TRANSLATION. ICASSP 2024. [PDF][Code]
R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation. Arxiv 2024. [PDF]
SimulTron: On-Device Simultaneous Speech to Speech Translation. Arxiv 2024. [PDF]
Recent Advances in End-to-End Simultaneous Speech Translation. Arxiv 2024. [PDF]
Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation. Arxiv 2024. [PDF]
SiLLM: Large Language Models for Simultaneous Machine Translation. Arxiv 2024. [PDF][Code]
Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models. Arxiv 2024. [PDF]
TransLLaMa: LLM-based Simultaneous Translation System. Arxiv 2024. [PDF]

2025

Rethinking Cascaded Speech-to-Text Translation. Arxiv 2025. [PDF]
High-Fidelity Simultaneous Speech-To-Speech Translation. Arxiv 2025. [PDF]
SpeechT: Findings of the First Mentorship in Speech Translation. Arxiv 2025. [PDF]
InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model. Arxiv 2025. [PDF]
Direct Speech to Speech Translation: A Review. Arxiv 2025. [PDF]
Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation. Arxiv 2025. [PDF]
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation. Arxiv 2025. [PDF]
Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture. Arxiv 2025. [PDF]
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs. Arxiv 2025. [PDF]
Large Model Empowered Streaming Semantic Communications for Speech Translation. Arxiv 2025. [PDF]
Speech Translation Refinement using Large Language Models. Arxiv 2025. [PDF]
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation. Arxiv 2025. [PDF]
Bemba Speech Translation: Exploring a Low-Resource African Language. Arxiv 2025. [PDF]
Language translation, and change of accent for speech-to-speech translation. Arxiv 2025. [PDF]
Improve Speech Translation Through Text Rewrite. COLING Industry 2025. [PDF]

Tutorial

INTERSPEECH 2019 survey talk: Spoken Language Translation
ACL 2020 Theme paper: Speech Translation and the End-to-End Promise: Taking Stock of Where We Are
EACL 2021 tutorial: Speech Translation
Blog: Getting Started with End-to-End Speech Translation

Workshops

IWSLT 2020

ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020. [PDF]
Start-Before-End and End-to-End: Neural Speech Translation by AppTek and RWTH Aachen University. [PDF]
KIT’s IWSLT 2020 SLT Translation System. [PDF]
End-to-End Simultaneous Translation System for IWSLT2020 Using Modality Agnostic Meta-Learning. [PDF]
ELITR Non-Native Speech Translation at IWSLT 2020. [PDF]
Re-translation versus Streaming for Simultaneous Translation. [PDF]
Towards Stream Translation: Adaptive Computation Time for Simultaneous Machine Translation. [PDF]
Neural Simultaneous Speech Translation Using Alignment-Based Chunking. [PDF]

AutoSimTrans 2020

Dynamic Sentence Boundary Detection for Simultaneous Translation. [PDF]

ASLTRW 2021

Operating a Complex SLT System with Speakers and Human Interpreters. [PDF]
Simultaneous Speech Translation for Live Subtitling: from Delay to Display. [PDF]

IWSLT 2021

The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021. [PDF]
NAIST English-to-Japanese Simultaneous Translation System for IWSLT 2021 Simultaneous Text-to-text Task. [PDF]
The University of Edinburgh’s Submission to the IWSLT21 Simultaneous Translation Task. [PDF]
Without Further Ado: Direct and Simultaneous Speech Translation by AppTek in 2021. [PDF]
The Volctrans Neural Speech Translation System for IWSLT 2021. [PDF]
Large-Scale English-Japanese Simultaneous Interpretation Corpus: Construction and Analyses with Sentence-Aligned Data. [PDF]
Towards the evaluation of automatic simultaneous speech translation from a communicative perspective. [PDF]
Tag Assisted Neural Machine Translation of Film Subtitles. [PDF]

AutoSimTrans 2021

ICT’s System for AutoSimTrans 2021: Robust Char-Level Simultaneous Translation. [PDF]
BIT’s system for AutoSimulTrans2021. [PDF]
XMU’s Simultaneous Translation System at NAACL 2021. [PDF]
System Description on Automatic Simultaneous Translation Workshop. [PDF]
BSTC: A Large-Scale Chinese-English Speech Translation Dataset. [PDF]

IWSLT 2022

Simultaneous Neural Machine Translation with Prefix Alignment. [PDF]
Anticipation-Free Training for Simultaneous Machine Translation. [PDF]
The AISP-SJTU Simultaneous Translation System for IWSLT 2022. [PDF]
The Xiaomi Text-to-Text Simultaneous Speech Translation System for IWSLT 2022. [PDF]
The HW-TSC’s Simultaneous Speech Translation System for IWSLT 2022
Evaluation. [PDF]
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks. [PDF]
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022. [PDF]
NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022. [PDF]

AutoSimTrans 2022

Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation. [PDF]
System Description on Automatic Simultaneous Translation Workshop. [PDF]
System Description on Third Automatic Simultaneous Translation Workshop. [PDF]
End-to-End Simultaneous Speech Translation with Pretraining and Distillation: Huawei Noah’s System for AutoSimTranS 2022. [PDF]
BIT-Xiaomi’s System for AutoSimTrans 2022. [PDF]
USST’s System for AutoSimTrans 2022. [PDF]

IWSLT 2023

Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023. [PDF]
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation. [PDF]
CMU’s IWSLT 2023 Simultaneous Speech Translation System. [PDF]
NAIST Simultaneous Speech-to-speech Translation System for IWSLT 2023. [PDF]
Language Model Based Target Token Importance Rescaling for Simultaneous Neural Machine Translation. [PDF]
Tagged End-to-End Simultaneous Speech Translation Training Using Simultaneous Interpretation Data. [PDF]
The HW-TSC’s Simultaneous Speech-to-Text Translation System for IWSLT 2023 Evaluation. [PDF]
The HW-TSC’s Simultaneous Speech-to-Speech Translation System for IWSLT 2023 Evaluation. [PDF]
Towards Efficient Simultaneous Speech Translation: CUNI-KIT System for Simultaneous Track at IWSLT 2023. [PDF]
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023 Offline Task, Simultaneous Task and Speech-to-Speech Task. [PDF]