Common Voice and Accent Choice: Data Contributors Self-Describe Their Spoken Accents in Diverse Ways

EasyChair Preprint 9678

14 pages•Date: February 7, 2023

Abstract

The number of people using speech technologies, such as automatic speech recognition (ASR), powered by machine learning (ML), has increased exponentially in recent years. Datasets used as inputs for training speech models often represent demographic features of the speaker – such as their gender, age, and accent. Often, those demographic axes are used to evaluate the training set and resultant model for bias and fairness.

Here, we first examine voice datasets to identify how accents are currently represented. We then analyse the speaker-described accent entries in Mozilla's Common Voice v11 dataset using a force-directed graph data visualisation. From this we formulate an emergent taxonomy of accent descriptors, of pragmatic use in accent bias detection.

We find that accents are currently represented in ways that are geographically, and predominantly, nationally bound. More diverse representations are identified in the CV dataset. This work provides some early evidence for re-thinking how accents are represented in voice data, particularly where intended for use in building or evaluating ML-based speech technologies. Our tooling is open-sourced to aid in replication and impact.

Keyphrases: accent bias, bias, bias corpora, data visualization, dataset documentation, datasets, voice data

Links:

https://easychair.org/publications/preprint/gFLz

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:9678,
  author    = {Kathy Reid and Elizabeth T. Williams},
  title     = {Common Voice and Accent Choice: Data Contributors Self-Describe Their Spoken Accents in Diverse Ways},
  howpublished = {EasyChair Preprint 9678},
  year      = {EasyChair, 2023}}

Download PDF Open PDF in browser