Audio samples from "KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects"

Authors: Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, Xiangang Li

Abstract: This paper introduces an open source speech dataset, KeSpeech, which involves the most spoken Chinese dialect and its 8 subdialects across 34 cities in China with 1,542 hours of audio. Specifically, the dataset provides multiple supervisions like transcription, speaker information and subdialect type to support a variety of speech tasks, such as speech recognition, speaker recognition, subdialect identification and their multi-task learning and conditional learning on each other. Moreover, some of the text samples were parallel recorded with both standard Chinese and a subdialect, allowing for easily evaluating subdialect style conversion, and there is certain interval between the two phases of recording for most of the speakers, allowing for the study of time variance in speaker recognition. Also, we describe the license, design and creation of the dataset, and based on some baseline experiments including speech recognition, speaker verification, subdialect identification and voice conversion, we do some analysis to show the challenges involved in those tasks and inspirations the dataset may provide.

Target Speaker

We select a speaker (ID 1024426) with Southwestern Mandarin (Chengdu) as our target speaker. Ground truth is the speech samples recorded from that speaker. In the subsequent sections, we convert source utterances in different subdialects to the target speaker's voice. So that the target speaker can speak with different subdialect.

Ground Truth

关系到整个民族的素质和道德情操的提高

产品相关特性与剧情设定较为吻合

就足以把它的储备消耗得所剩无几

订单量达到九千多台哦

女交警冯爽正在岗台上指挥交通

音乐成为他生命中的另一种灵感来源

Source utterances from speaker A

Speaker A (ID 101763) can speak Zhongyuan Mandarin (Zhengzhou). We respectively convert his Mandarin and Zhongyuan Mandarin utterances to the target speaker's voice. All converted audio samples are shown below.

1、Mandarin

Source audio Converted audio

陈增新说他打电话让对方拿回去

交易中的盈利需要行情的配合

那么从发明家脑袋里蹦出来的东西就只能

泡椒和泡姜再加上能能的大块的兔肉

主要目的就是避免区县的同质化竞争

有关于他与汽车的信息却少之又少

这就和她挑水果是一个道理

远看就如同一座宝塔矗立在山谷之中

结果让我做三百个俯卧撑

武装分子携带爆炸物和枪支袭击了酒店

2、Zhongyuan Mandarin(Zhengzhou)

Source audio Converted audio

这就和她挑水果是一个道理

自愿嫁给了她压根瞧不上的男人

武装分子携带爆炸物和枪支袭击了酒店

交易中的盈利需要行情的分配

那么从发明家脑袋里蹦出来的东西就只能

有关他与汽车的信息却少之又少

陈增新说他打电话让对方拿回去

泡椒和泡姜再加上能能的大块的兔肉

远看就如同一座宝塔矗立在山谷之中

主要目的就是避免区县的同质化竞争

Source utterances from speaker B

Source Speaker B (ID 1007609)can speaker Jiao-Liao Mandarin(Qingdao).We respectively convert her Mandarin and Jiao-Liao Mandarin utterances to the target speaker's voice.All converted audio samples are shown below

1、Mandarin

Source audio Converted audio

这是我国目前为数不多的国家级相关研究

中央企业间的横向整合基本完成

万科垦请监管部门对上述情况予以核查

水面上一块黑乎乎的物体映入眼帘

另一方面是一种厂商实力的验证手段

该网站是否安全或保护隐私

那时的我还没有意识到粗犷的真正含义

美国经济可能还没有准备好

白银最近的行情一直在下探

也祝我们青岛的明天更加美好

2、Jiao-Liao Mandarin(Qingdao)

Source audio Converted audio

旋即不少游客询问几款电动车的售价

一定要养成绿色生活的日常行为和习惯

在原唱者面前唱错歌词又说错歌名

几分钟后接单的司机小王到达现场

我们也不愿意再给他一个机会

批判地继承中国古代的传统文化和道德

这些学生大都有很强的独立学习能力

使得卖萌的戏份更少不了这位小美女

如果你父亲再满街追着你打

耶鲁大学管理学院金融学终身教授陈志武

	Ground Truth
关系到整个民族的素质和道德情操的提高
产品相关特性与剧情设定较为吻合
就足以把它的储备消耗得所剩无几
订单量达到九千多台哦
女交警冯爽正在岗台上指挥交通
音乐成为他生命中的另一种灵感来源

	Source audio	Converted audio
陈增新说他打电话让对方拿回去
交易中的盈利需要行情的配合
那么从发明家脑袋里蹦出来的东西就只能
泡椒和泡姜再加上能能的大块的兔肉
主要目的就是避免区县的同质化竞争
有关于他与汽车的信息却少之又少
这就和她挑水果是一个道理
远看就如同一座宝塔矗立在山谷之中
结果让我做三百个俯卧撑
武装分子携带爆炸物和枪支袭击了酒店

	Source audio	Converted audio
这是我国目前为数不多的国家级相关研究
中央企业间的横向整合基本完成
万科垦请监管部门对上述情况予以核查
水面上一块黑乎乎的物体映入眼帘
另一方面是一种厂商实力的验证手段
该网站是否安全或保护隐私
那时的我还没有意识到粗犷的真正含义
美国经济可能还没有准备好
白银最近的行情一直在下探
也祝我们青岛的明天更加美好

	Source audio	Converted audio
旋即不少游客询问几款电动车的售价
一定要养成绿色生活的日常行为和习惯
在原唱者面前唱错歌词又说错歌名
几分钟后接单的司机小王到达现场
我们也不愿意再给他一个机会
批判地继承中国古代的传统文化和道德
这些学生大都有很强的独立学习能力
使得卖萌的戏份更少不了这位小美女
如果你父亲再满街追着你打
耶鲁大学管理学院金融学终身教授陈志武