Download PDFOpen PDF in browserCommon Voice and Accent Choice: Data Contributors Self-Describe Their Spoken Accents in Diverse WaysEasyChair Preprint 967814 pages•Date: February 7, 2023AbstractThe number of people using speech technologies, such as automatic speech recognition (ASR), powered by machine learning (ML), has increased exponentially in recent years. Datasets used as inputs for training speech models often represent demographic features of the speaker – such as their gender, age, and accent. Often, those demographic axes are used to evaluate the training set and resultant model for bias and fairness. Here, we first examine voice datasets to identify how accents are currently represented. We then analyse the speaker-described accent entries in Mozilla's Common Voice v11 dataset using a force-directed graph data visualisation. From this we formulate an emergent taxonomy of accent descriptors, of pragmatic use in accent bias detection. We find that accents are currently represented in ways that are geographically, and predominantly, nationally bound. More diverse representations are identified in the CV dataset. This work provides some early evidence for re-thinking how accents are represented in voice data, particularly where intended for use in building or evaluating ML-based speech technologies. Our tooling is open-sourced to aid in replication and impact. Keyphrases: accent bias, bias, bias corpora, data visualization, dataset documentation, datasets, voice data
|