Audio samples from "Variable-rate discrete representation learning"

Paper: arXiv

Authors: Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan

Abstract: Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.

Contents

1. Prompts
2. Reconstructions
3. Comparison: VQ-VAE & SlowAE
4. 2.4B model: completions
5. 2.4B model: unconditional re-ranked samples
6. 2.4B model: book-conditional samples

1. Prompts

Below are four 4-second clips of recorded speech, which we use to compare reconstruction quality across different discrete autoencoder models, and as prompts for our generative models.

	Audio	Transcription
Prompt 1		Typical of tree frogs, it has toe pads which enable it to climb smooth vertical [...]
Prompt 2		WestJet Encore flight 511 on December 24, departs [...]
Prompt 3		Alright, so you have it for later, I've added pencils to your shopping list.
Prompt 4		Some websites have appropriated lists of royalty, and have reprinted the na[mes ...]

2. Reconstructions

We sample reconstructions for each prompt clip using different discrete autoencoder models (see Figure 6 in the paper). We report phoneme accuracy as a proxy for intelligibility. The models we used for further experiments are indicated with an asterisk.

Model	Phoneme accuracy	Prompt 1	Prompt 2	Prompt 3	Prompt 4
SlowAE models with fixed λ
λ = 0.1	86.87%
λ = 0.3	88.65%
λ = 1.0	89.82%
λ = 3.0	87.87%
λ = 10.0	87.37%
λ = 30.0	80.89%
λ = 100.0	29.28%
λ = 300.0	5.55%
λ = 1000.0	3.73%
SlowAE models with target average event rates (AERs) R_T
R_T = 35Hz	64.95%
R_T = 45Hz	71.09%
R_T = 55Hz	77.34%
R_T = 65Hz	79.52%
*R_T = 75Hz	81.21%
R_T = 85Hz	84.19%
R_T = 95Hz	84.98%
R_T = 105Hz	86.30%
R_T = 115Hz	85.29%
SlowAE models with varying number of channels (C) and quantisation levels (2k + 1), R_T = 75Hz
C = 1, k = 1	3.62%
C = 2, k = 1	3.60%
C = 4, k = 1	3.50%
C = 8, k = 1	3.74%
C = 1, k = 3	49.12%
C = 2, k = 3	3.59%
C = 4, k = 3	81.91%
C = 8, k = 3	80.66%
C = 1, k = 7	50.71%
C = 2, k = 7	76.12%
*C = 4, k = 7	81.21%
C = 8, k = 7	81.60%
C = 1, k = 10	49.25%
C = 2, k = 10	76.91%
C = 4, k = 10	80.72%
C = 8, k = 10	83.12%
SlowAE models with different slowness penalties (R_T = 75Hz)
*Group-sparse slowness	81.21%
L1 slowness	80.10%
L2 slowness	80.84%
L4 slowness	78.83%
VQ-VAE model (fixed-rate baseline)
*VQ-VAE (62.5Hz)	81.75%

3. Comparison: VQ-VAE & SlowAE

We compare variable-rate and fixed-rate discrete representations by training Transformer models with approximately 1 billion parameters, and sampling completions for the given prompts (sequence length 512).

	Prompt 1	Prompt 2	Prompt 3	Prompt 4
Samples from a transformer trained on VQ-VAE code sequences
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
Samples from a run-length transformer (RLT) trained on SlowAE code sequences
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8

4. 2.4B model: completions

We sample completions for the given prompts from a Transformer with approximately 2.4 billion parameters (sequence length 512). This model is conditioned on book identity embeddings, but we include an additional "unconditional" embedding which is used 10% of the time during training, so that the model is also able to generate samples without book identity conditioning. We use this embedding to produce these completions.

	Prompt 1	Prompt 2	Prompt 3	Prompt 4
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
sample 9
sample 10
sample 11
sample 12
sample 13
sample 14
sample 15
sample 16

5. 2.4B model: unconditional re-ranked samples

We draw samples from the 2.4 billion parameter model without prompts (using the "unconditional" embedding). This is a much harder task, so we draw 512 samples and select the 32 that produce the lowest phoneme prediction entropy under our ASR model (see §6.3.2 in the paper).

5. 2.4B model: book-conditional samples

We draw samples from the 2.4 billion parameter model conditioned on book embeddings. We selected a few books, drew 16 samples per book, and then cherry-picked four samples for each. Data quality (and hence, sample quality) varies considerably across books.

Book (LibriVox ID)	Sample 1	Sample 2	Sample 3	Sample 4
'The Truth about Jesus' M. M. Mangasarian (#76)
'Jane Eyre' Charlotte Brontë (#133)
'Hamlet' William Shakespeare (#346)
'Jungle Book' Rudyard Kipling (#662)
'Dracula' Bram Stoker (#10747)
'Fairy Stories my Children Love Best of All' Edgar Shimer (#13372)

Page template borrowed from Wave-Tacotron.