LLM2D
多源知识剪枝以增强检索生成:基准与实证研究
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study
作者: Shuo Yu (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Mingyue Cheng (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Jiqian Yang (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Jie Ouyang (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Yucong Luo (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Chenyi Lei (Kuaishou Technology, Beijing, China), Qi Liu (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Enhong Chen (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China)
发布日期: 2/18/2025
arXiv ID: oai:arXiv.org:2409.13694v3

摘要

arXiv:2409.13694v3 通知类型: replace-cross 摘要:检索增强生成(RAG)越来越被认作是一种通过整合外部知识来减轻大型语言模型(LLMs)幻觉的有效方法。尽管已经投入了大量努力,大多数研究仍集中于单一类型的外部知识来源。相比之下,大多数实际应用涉及多种来源的多样知识,这一场景尚未得到充分探索。主要困境在于缺乏一个包含多种知识源的合适数据集,以及与之相关的预探索。为了解决这些挑战,我们标准化了一个融合了不同且互补领域结构化和非结构化知识的基准数据集。基于该数据集,我们识别出在这些条件下现有方法的局限性。因此,我们开发了PruningRAG,这是一种即插即用的RAG框架,它使用多粒度修剪策略,更有效地整合相关背景信息并减轻误导性信息的负面影响。广泛的实验结果表明了PruningRAG的出色性能,我们的洞察性发现也得到了报告。我们的数据集和代码可公开获取\footnote{https://github.com/USTCAGI/PruningRAG}。