Abstract: Transformers are widely used in natural language processing and computer vision, and Bidirectional Encoder Representations from Transformers (BERT) is one of the most popular pre-trained ...
Abstract: Using a vision-inspired keyword spotting framework, we propose an architecture with input-dependent dynamic depth capable of processing streaming audio. Specifically, we extend a conformer ...