LLM2D
用于检索增强生成的多分支知识剪枝:基准测试和实证研究
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study
作者: Shuo Yu (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Mingyue Cheng (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Jiqian Yang (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Jie Ouyang (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Yucong Luo (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Chenyi Lei (Kuaishou Technology, Beijing, China), Qi Liu (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China), Enhong Chen (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China)
发布日期: 11/28/2024
arXiv ID: oai:arXiv.org:2409.13694v2

摘要

检索增强生成 (RAG) 越来越多地被认为是通过整合外部知识来减轻大型语言模型 (LLM) 幻觉的有效方法。虽然已有大量研究,但大多数研究都集中于单一类型的外部知识源。然而,在实际应用中,大多数情况都涉及来自各种来源的多种知识,而这方面却鲜有研究。主要难题在于缺乏包含多个知识源并预先探索相关问题的合适数据集。为了应对这些挑战,我们标准化了一个基准数据集,该数据集结合了来自不同互补领域的结构化和非结构化知识。基于此数据集,我们进一步开发了一个即插即用的 RAG 框架 PruningRAG,其主要特点是采用多粒度剪枝策略来优化相关信息的整合并最大限度地减少误导性上下文。基于标准化数据集和 PruningRAG,我们还报告了一系列实验结果以及有见地的发现。我们的数据集和代码已公开发布,旨在推动 RAG 社区未来的研究。