Abstract: This paper introduces an open source speech dataset, KeSpeech, which involves the most spoken Chinese dialect and its 8 subdialects across 34 cities in China with 1,542 hours of audio.
Specifically, the dataset provides multiple supervisions like transcription, speaker information and subdialect type to support a variety of speech tasks, such as speech recognition, speaker recognition, subdialect identification and their multi-task learning and conditional learning on each other.
Moreover, some of the text samples were parallel recorded with both standard Chinese and a subdialect, allowing for easily evaluating subdialect style conversion, and there is certain interval between the two phases of recording for most of the speakers, allowing for the study of time variance in speaker recognition.
Also, we describe the license, design and creation of the dataset, and based on some baseline experiments including speech recognition, speaker verification, subdialect identification and voice conversion, we do some analysis to show the challenges involved in those tasks and inspirations the dataset may provide.
Target Speaker
We select a speaker (ID 1024426) with Southwestern Mandarin (Chengdu) as our target speaker. Ground truth is the speech samples recorded from that speaker. In the subsequent sections, we convert source utterances in different subdialects to the target speaker's voice. So that the target speaker can speak with different subdialect.
Ground Truth
关系到整个民族的素质和道德情操的提高
产品相关特性与剧情设定较为吻合
就足以把它的储备消耗得所剩无几
订单量达到九千多台哦
女交警冯爽正在岗台上指挥交通
音乐成为他生命中的另一种灵感来源
Source utterances from speaker A
Speaker A (ID 101763) can speak Zhongyuan Mandarin (Zhengzhou). We respectively convert his Mandarin and Zhongyuan Mandarin utterances to the target speaker's voice. All converted audio samples are shown below.
1、Mandarin
Source audio
Converted audio
陈增新说他打电话让对方拿回去
交易中的盈利需要行情的配合
那么从发明家脑袋里蹦出来的东西就只能
泡椒和泡姜再加上能能的大块的兔肉
主要目的就是避免区县的同质化竞争
有关于他与汽车的信息却少之又少
这就和她挑水果是一个道理
远看就如同一座宝塔矗立在山谷之中
结果让我做三百个俯卧撑
武装分子携带爆炸物和枪支袭击了酒店
2、Zhongyuan Mandarin(Zhengzhou)
Source audio
Converted audio
这就和她挑水果是一个道理
自愿嫁给了她压根瞧不上的男人
武装分子携带爆炸物和枪支袭击了酒店
交易中的盈利需要行情的分配
那么从发明家脑袋里蹦出来的东西就只能
有关他与汽车的信息却少之又少
陈增新说他打电话让对方拿回去
泡椒和泡姜再加上能能的大块的兔肉
远看就如同一座宝塔矗立在山谷之中
主要目的就是避免区县的同质化竞争
Source utterances from speaker B
Source Speaker B (ID 1007609)can speaker Jiao-Liao Mandarin(Qingdao).We respectively convert her Mandarin and Jiao-Liao Mandarin utterances to the target speaker's voice.All converted audio samples are shown below