Paper: arXiv
Authors: Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan
Abstract: Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.
Contents
Audio | Transcription | |
---|---|---|
Prompt 1 | Typical of tree frogs, it has toe pads which enable it to climb smooth vertical [...] | |
Prompt 2 | WestJet Encore flight 511 on December 24, departs [...] | |
Prompt 3 | Alright, so you have it for later, I've added pencils to your shopping list. | |
Prompt 4 | Some websites have appropriated lists of royalty, and have reprinted the na[mes ...] |
Model | Phoneme accuracy | Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 |
---|---|---|---|---|---|
SlowAE models with fixed λ | |||||
λ = 0.1 | 86.87% | ||||
λ = 0.3 | 88.65% | ||||
λ = 1.0 | 89.82% | ||||
λ = 3.0 | 87.87% | ||||
λ = 10.0 | 87.37% | ||||
λ = 30.0 | 80.89% | ||||
λ = 100.0 | 29.28% | ||||
λ = 300.0 | 5.55% | ||||
λ = 1000.0 | 3.73% | ||||
SlowAE models with target average event rates (AERs) RT | |||||
RT = 35Hz | 64.95% | ||||
RT = 45Hz | 71.09% | ||||
RT = 55Hz | 77.34% | ||||
RT = 65Hz | 79.52% | ||||
*RT = 75Hz | 81.21% | ||||
RT = 85Hz | 84.19% | ||||
RT = 95Hz | 84.98% | ||||
RT = 105Hz | 86.30% | ||||
RT = 115Hz | 85.29% | ||||
SlowAE models with varying number of channels (C) and quantisation levels (2k + 1), RT = 75Hz | |||||
C = 1, k = 1 | 3.62% | ||||
C = 2, k = 1 | 3.60% | ||||
C = 4, k = 1 | 3.50% | ||||
C = 8, k = 1 | 3.74% | ||||
C = 1, k = 3 | 49.12% | ||||
C = 2, k = 3 | 3.59% | ||||
C = 4, k = 3 | 81.91% | ||||
C = 8, k = 3 | 80.66% | ||||
C = 1, k = 7 | 50.71% | ||||
C = 2, k = 7 | 76.12% | ||||
*C = 4, k = 7 | 81.21% | ||||
C = 8, k = 7 | 81.60% | ||||
C = 1, k = 10 | 49.25% | ||||
C = 2, k = 10 | 76.91% | ||||
C = 4, k = 10 | 80.72% | ||||
C = 8, k = 10 | 83.12% | ||||
SlowAE models with different slowness penalties (RT = 75Hz) | |||||
*Group-sparse slowness | 81.21% | ||||
L1 slowness | 80.10% | ||||
L2 slowness | 80.84% | ||||
L4 slowness | 78.83% | ||||
VQ-VAE model (fixed-rate baseline) | |||||
*VQ-VAE (62.5Hz) | 81.75% |
Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 | |
---|---|---|---|---|
Samples from a transformer trained on VQ-VAE code sequences | ||||
sample 1 | ||||
sample 2 | ||||
sample 3 | ||||
sample 4 | ||||
sample 5 | ||||
sample 6 | ||||
sample 7 | ||||
sample 8 | ||||
Samples from a run-length transformer (RLT) trained on SlowAE code sequences | ||||
sample 1 | ||||
sample 2 | ||||
sample 3 | ||||
sample 4 | ||||
sample 5 | ||||
sample 6 | ||||
sample 7 | ||||
sample 8 |
Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 | |
---|---|---|---|---|
sample 1 | ||||
sample 2 | ||||
sample 3 | ||||
sample 4 | ||||
sample 5 | ||||
sample 6 | ||||
sample 7 | ||||
sample 8 | ||||
sample 9 | ||||
sample 10 | ||||
sample 11 | ||||
sample 12 | ||||
sample 13 | ||||
sample 14 | ||||
sample 15 | ||||
sample 16 |
Book (LibriVox ID) | Sample 1 | Sample 2 | Sample 3 | Sample 4 |
---|---|---|---|---|
'The Truth about Jesus' M. M. Mangasarian (#76) |
||||
'Jane Eyre' Charlotte Brontë (#133) |
||||
'Hamlet' William Shakespeare (#346) |
||||
'Jungle Book' Rudyard Kipling (#662) |
||||
'Dracula' Bram Stoker (#10747) |
||||
'Fairy Stories my Children Love Best of All' Edgar Shimer (#13372) |
Page template borrowed from Wave-Tacotron.