Audio samples from "Variable-rate discrete representation learning"

Paper: arXiv

Authors: Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan

Abstract: Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.

Contents

1. Prompts

Below are four 4-second clips of recorded speech, which we use to compare reconstruction quality across different discrete autoencoder models, and as prompts for our generative models.
AudioTranscription
Prompt 1 Typical of tree frogs, it has toe pads which enable it to climb smooth vertical [...]
Prompt 2 WestJet Encore flight 511 on December 24, departs [...]
Prompt 3 Alright, so you have it for later, I've added pencils to your shopping list.
Prompt 4 Some websites have appropriated lists of royalty, and have reprinted the na[mes ...]

2. Reconstructions

We sample reconstructions for each prompt clip using different discrete autoencoder models (see Figure 6 in the paper). We report phoneme accuracy as a proxy for intelligibility. The models we used for further experiments are indicated with an asterisk.
ModelPhoneme accuracyPrompt 1Prompt 2Prompt 3Prompt 4
SlowAE models with fixed λ
λ = 0.1 86.87%
λ = 0.3 88.65%
λ = 1.0 89.82%
λ = 3.0 87.87%
λ = 10.0 87.37%
λ = 30.0 80.89%
λ = 100.0 29.28%
λ = 300.0 5.55%
λ = 1000.0 3.73%
SlowAE models with target average event rates (AERs) RT
RT = 35Hz 64.95%
RT = 45Hz 71.09%
RT = 55Hz 77.34%
RT = 65Hz 79.52%
*RT = 75Hz 81.21%
RT = 85Hz 84.19%
RT = 95Hz 84.98%
RT = 105Hz 86.30%
RT = 115Hz 85.29%
SlowAE models with varying number of channels (C) and quantisation levels (2k + 1), RT = 75Hz
C = 1, k = 1 3.62%
C = 2, k = 1 3.60%
C = 4, k = 1 3.50%
C = 8, k = 1 3.74%
C = 1, k = 3 49.12%
C = 2, k = 3 3.59%
C = 4, k = 3 81.91%
C = 8, k = 3 80.66%
C = 1, k = 7 50.71%
C = 2, k = 7 76.12%
*C = 4, k = 7 81.21%
C = 8, k = 7 81.60%
C = 1, k = 10 49.25%
C = 2, k = 10 76.91%
C = 4, k = 10 80.72%
C = 8, k = 10 83.12%
SlowAE models with different slowness penalties (RT = 75Hz)
*Group-sparse slowness 81.21%
L1 slowness 80.10%
L2 slowness 80.84%
L4 slowness 78.83%
VQ-VAE model (fixed-rate baseline)
*VQ-VAE (62.5Hz) 81.75%

3. Comparison: VQ-VAE & SlowAE

We compare variable-rate and fixed-rate discrete representations by training Transformer models with approximately 1 billion parameters, and sampling completions for the given prompts (sequence length 512).
Prompt 1Prompt 2Prompt 3Prompt 4
Samples from a transformer trained on VQ-VAE code sequences
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
Samples from a run-length transformer (RLT) trained on SlowAE code sequences
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8

4. 2.4B model: completions

We sample completions for the given prompts from a Transformer with approximately 2.4 billion parameters (sequence length 512). This model is conditioned on book identity embeddings, but we include an additional "unconditional" embedding which is used 10% of the time during training, so that the model is also able to generate samples without book identity conditioning. We use this embedding to produce these completions.
Prompt 1Prompt 2Prompt 3Prompt 4
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
sample 9
sample 10
sample 11
sample 12
sample 13
sample 14
sample 15
sample 16

5. 2.4B model: unconditional re-ranked samples

We draw samples from the 2.4 billion parameter model without prompts (using the "unconditional" embedding). This is a much harder task, so we draw 512 samples and select the 32 that produce the lowest phoneme prediction entropy under our ASR model (see §6.3.2 in the paper).

5. 2.4B model: book-conditional samples

We draw samples from the 2.4 billion parameter model conditioned on book embeddings. We selected a few books, drew 16 samples per book, and then cherry-picked four samples for each. Data quality (and hence, sample quality) varies considerably across books.
Book (LibriVox ID)Sample 1Sample 2Sample 3Sample 4
'The Truth about Jesus'
M. M. Mangasarian (#76)
'Jane Eyre'
Charlotte Brontë (#133)
'Hamlet'
William Shakespeare (#346)
'Jungle Book'
Rudyard Kipling (#662)
'Dracula'
Bram Stoker (#10747)
'Fairy Stories my Children Love Best of All'
Edgar Shimer (#13372)

Page template borrowed from Wave-Tacotron.