A Theoretical Framework for Acoustic Neighbor Embeddings
Summary
Proposes a theoretical framework for 'Acoustic Neighbor Embeddings' that represents audio or text speech content of variable width in a fixed-dimensional embedding space.
Key Points
- Proposes a probabilistic interpretation of distance based on phonetic similarity definitions between speech signals.
- Reduces complex distances to simple Euclidean distances through uniform cluster-wise isotropy approximation.
- Achieves isolated word classification accuracy on par with Finite State Transducers (FST) across a 500k vocabulary.
- Applicable to English dialect clustering and predicting confusion probability for wake-up words.
Notable Quotes & Details
Notable Data / Quotes
- 500k vocabularies
- 0.5% point difference compared to phone edit distances
Intended Audience
AI researchers and automated speech recognition (ASR) engineers