The AfricanVoices corpus is a speech corpus containing datasets of aligned sentences and audio for 11 languages. We have uploaded data for 12 different languages in this website so far. We obtain the datasets in three ways:


Data_id Lang code Language Source Speaker No. of sentences Hrs MCD* Quality Download
luo_opb luo Dholuo Open.Bible Male 11263 12.43 5.20 Good Available
luo_afv luo Dholuo AfricanVoices Male 1516 1.79 6.35 Okay Available
lug_cmv lug Luganda CommonVoice Male 2942 4.52 6.59 Okay Available
en-ke_afv en-ke English (Kenyan) AfricanVoices Female 596 0.56 4.99 Good Available
sxb_afv sxb Suba AfricanVoices Male 1178 1.70 4.80 Good Available
sxb_bbi sxb Suba Male 8917 18.76 5.40 Good **Unavailable
yor_opb yor Yoruba Open.Bible Male 9275 18.04 4.61 Good Available
kik_opb kik Kikuyu Open.Bible Male 10877 18.04 5.19 Good Available
wol_alf wol Wolof ALFFA Male 1000 1.20 4.42 Good Available
swa_lst swa Kiswahili LLSTI Male 426 0.53 5.06 Good Available
ibb_lst ibb Ibibio LLSTI Female 125 0.32 4.91 Good Available
hau_cmv_m hau Hausa CommonVoice Male 518 0.62 6.95 Okay Available
hau_cmv_f hau Hausa CommonVoice Female 1938 2.30 6.29 Okay Available
fon_alf fon Fon ALFFA Male 542 0.33 6.10 Okay Available
lin_opb lin Lingala Open.Bible Male 12957 27.52 5.20 Good Available
* "MCD is a distortion measure, comparing synthesized examples with originals. Smaller is better. For TTS, less than 5 is probably good, less than 6 is probably fine, greater than 6 is possibly bad (but still statistically useful), greater than 7 probably indicates something is wrong. For alignment, MCDs seem to be about 1 larger than for TTS (TTS only uses the best examples, and uses a much more complex prediction model)." Alan W Black
For data from Open.Bible and, the MCD reported is for alignment while for the rest the MCD reported is for TTS.

** Suba data is unavailable because the license doesn't allow us to redistribute. You can download it from the Faith Comes by Hearing website for your personal use.