Fast-slow Transformer For Visually Grounding Speech | Awesome Learning to Hash Add your paper to Learning2Hash

Fast-slow Transformer For Visually Grounding Speech

Puyuan Peng, David Harwath . ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 – 16 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Datasets Evaluation ICASSP Image Retrieval

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

Similar Work