LLM2D
一步音频:智能语音交互中的统一理解和生成
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
作者: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuting Yan, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
发布日期: 2/18/2025
arXiv ID: oai:arXiv.org:2502.11946v1

摘要

arXiv:2502.11946v1 交叉领域公告类型:跨领域 摘要:实时语音交互作为人机协作的基础接口,具有巨大的潜力。然而,当前开源模型面临语音数据采集成本高、动态控制能力弱、智能水平有限等问题。为了解决这些挑战,本文介绍了Step-Audio,这是第一个生产就绪的开源解决方案。主要贡献包括:1) 一个包含130B参数的统一语音-文本多模态模型,实现了统一的理解和生成,集成了Step-Audio-Chat版本;2) 生成语音数据引擎,建立了可负担的语音克隆框架,并通过蒸馏生成了轻量级的Step-Audio-TTS-3B开源模型;3) 以指令驱动的精细控制系统,能够动态调整方言、情感、唱歌和饶舌;4) 增强的认知架构,增强了工具调用和角色扮演能力,以有效管理复杂任务。基于我们新的StepEval-Audio-360评估基准,Step-Audio在人工评估中达到了最先进的性能,特别是在指令遵循方面。在像LLaMA Question这样的开源基准中,平均性能提高了9.3%,展示了我们致力于推动开放源代码多模态语言技术发展的决心。我们的代码和模型可在https://github.com/stepfun-ai/Step-Audio获取。