LLM 生成质量提升的工程实战指南：从提示词到流式输出的完整优化方案

为什么生成质量是 RAG 的最后一公里

即使检索到了完美的上下文，如果 LLM 生成的答案不准确、不相关或不流畅，用户体验依然会很差。生成优化是 RAG 系统从"能用"到"好用"的关键一步，涉及提示词工程、上下文管理、流式输出、幻觉控制等多个维度。

生成优化技巧

背景与核心挑战

RAG 生成阶段的独特挑战

与通用 LLM 对话不同，RAG 生成需要处理：

上下文整合：如何让 LLM 充分利用检索到的文档
幻觉控制：如何避免 LLM 编造不在上下文中的信息
引用管理：如何标注答案来源以提升可信度
流式输出：如何优化首 token 延迟和整体体验
成本控制：如何在质量和成本之间取得平衡

关键优化维度

维度	问题	优化方向
提示词设计	如何引导 LLM 基于上下文回答	结构化提示词、Few-shot、角色设定
上下文管理	如何处理过长/过短的上下文	上下文压缩、排序、截断策略
幻觉控制	如何减少编造信息	明确指示、引用要求、自我验证
流式输出	如何优化用户体验	异步流式、分块传输、UI 反馈
质量保证	如何评估和改进生成质量	自动评估、人工反馈、A/B 测试

九大项目生成方案全景

项目	提示词工程	流式输出	引用管理	幻觉控制	技术成熟度
LightRAG	高级（全局/本地 prompt）	✅	基础	中	⭐⭐⭐⭐⭐
onyx	企业级（多层级 prompt）	✅ 高级	✅ 完整	高	⭐⭐⭐⭐⭐
Self-Corrective-Agentic-RAG	自我修正 prompt	✅	✅ 验证机制	极高	⭐⭐⭐⭐⭐
kotaemon	LlamaIndex 集成	✅	✅	中	⭐⭐⭐⭐
Verba	简洁 prompt	✅	✅	中	⭐⭐⭐⭐
ragflow	多 Agent prompt	✅	✅ 高级	中高	⭐⭐⭐⭐
SurfSense	浏览器场景优化	✅	✅	中	⭐⭐⭐
UltraRAG	基础 prompt	✅	基础	低	⭐⭐⭐
RAG-Anything	继承 LightRAG	✅	基础	中	⭐⭐⭐⭐⭐

关键洞察

最创新：Self-Corrective-Agentic-RAG 的自我修正循环（检测幻觉 → 重新检索 → 再生成）
最完整：onyx 的多层级 prompt（系统/任务/上下文三层分离）+ 引用追踪
最实用：LightRAG 的全局/本地双模式 prompt（适配不同查询类型）
趋势：从单次生成向自我验证 + 迭代优化演进

核心实现深度对比

1. Self-Corrective-Agentic-RAG：自我修正循环

设计理念：生成 → 检测幻觉 → 重新检索 → 再生成的闭环优化

agentic_rag/self_corrective_generation.py

class SelfCorrectiveGenerator:
    """
    自我修正生成器
    
    流程：
    1. 基于上下文生成初始答案
    2. 评估答案相关性（Relevance Grader）
    3. 检测幻觉（Hallucination Detector）
    4. 如果不满意，重新检索更好的上下文
    5. 再次生成，最多尝试 N 次
    """
    
    def __init__(
        self,
        llm_client,
        retriever,
        max_iterations: int = 3,
        relevance_threshold: float = 0.7,
        enable_hallucination_check: bool = True
    ):
        self.llm = llm_client
        self.retriever = retriever
        self.max_iterations = max_iterations
        self.relevance_threshold = relevance_threshold
        self.enable_hallucination_check = enable_hallucination_check
        
        # 初始化评估器
        self.relevance_grader = self._init_relevance_grader()
        self.hallucination_detector = self._init_hallucination_detector()
    
    async def generate_with_correction(
        self,
        query: str,
        initial_context: list[str]
    ) -> dict:
        """
        自我修正生成
        
        Returns:
            {
                "answer": str,
                "iterations": int,
                "corrections": list[dict],
                "final_context": list[str]
            }
        """
        context = initial_context
        corrections = []
        
        for iteration in range(self.max_iterations):
            # 1. 生成答案
            answer = await self._generate_answer(query, context)
            
            # 2. 评估相关性
            relevance_score = await self._grade_relevance(query, answer, context)
            
            # 3. 检测幻觉
            has_hallucination = False
            if self.enable_hallucination_check:
                has_hallucination = await self._detect_hallucination(answer, context)
            
            # 记录本次迭代
            corrections.append({
                "iteration": iteration + 1,
                "relevance_score": relevance_score,
                "has_hallucination": has_hallucination,
                "answer_preview": answer[:100] + "..."
            })
            
            # 4. 判断是否需要修正
            if relevance_score >= self.relevance_threshold and not has_hallucination:
                # 质量合格，返回
                return {
                    "answer": answer,
                    "iterations": iteration + 1,
                    "corrections": corrections,
                    "final_context": context,
                    "status": "success"
                }
            
            # 5. 重新检索更好的上下文
            print(f"Iteration {iteration + 1}: Quality insufficient, re-retrieving...")
            
            # 使用答案中的关键词扩展检索
            expanded_query = self._expand_query(query, answer)
            new_context = await self.retriever.retrieve(expanded_query, top_k=5)
            
            # 合并并去重上下文
            context = self._merge_contexts(context, new_context)
        
        # 最大迭代次数后仍不满意，返回最后一次结果
        return {
            "answer": answer,
            "iterations": self.max_iterations,
            "corrections": corrections,
            "final_context": context,
            "status": "max_iterations_reached"
        }
    
    async def _generate_answer(self, query: str, context: list[str]) -> str:
        """生成答案（强调基于上下文）"""
        prompt = self._build_generation_prompt(query, context)
        
        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": """You are a helpful assistant that answers questions STRICTLY based on the provided context.
                    
Rules:
1. ONLY use information from the context
2. If the context doesn't contain the answer, say "I cannot answer based on the provided context"
3. Cite the context by using [Source N] notation
4. Do NOT make up information or use external knowledge"""
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            temperature=0.3  # 低温度减少创造性（降低幻觉）
        )
        
        return response.choices[0].message.content
    
    def _build_generation_prompt(self, query: str, context: list[str]) -> str:
        """构建生成提示词"""
        context_str = "\n\n".join([
            f"[Source {i+1}]\n{ctx}"
            for i, ctx in enumerate(context)
        ])
        
        return f"""Context:
{context_str}

Question: {query}

Instructions:
- Answer the question using ONLY the information from the context above
- Cite sources using [Source N] notation
- If the context is insufficient, explicitly state what's missing
- Be concise and precise

Answer:"""
    
    async def _grade_relevance(
        self,
        query: str,
        answer: str,
        context: list[str]
    ) -> float:
        """评估答案与查询的相关性（0-1）"""
        prompt = f"""Grade the relevance of the answer to the question.

Question: {query}

Answer: {answer}

Is the answer relevant and helpful for the question?
Respond with a score from 0.0 (not relevant) to 1.0 (highly relevant).
Only output the score as a number."""

        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        
        try:
            score = float(response.choices[0].message.content.strip())
            return max(0.0, min(1.0, score))  # 限制在 [0, 1]
        except ValueError:
            return 0.5  # 默认中等分数
    
    async def _detect_hallucination(self, answer: str, context: list[str]) -> bool:
        """检测幻觉（答案是否包含上下文中没有的信息）"""
        context_str = "\n\n".join(context)
        
        prompt = f"""Detect if the answer contains hallucinations (information not present in the context).

Context:
{context_str}

Answer:
{answer}

Does the answer contain information that is NOT in the context?
Respond with only "YES" or "NO"."""

        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        
        result = response.choices[0].message.content.strip().upper()
        return result == "YES"
    
    def _expand_query(self, original_query: str, answer: str) -> str:
        """从答案中提取关键词扩展查询"""
        # 简单实现：提取答案中的名词短语
        # 生产环境可用 NER 或关键词提取模型
        keywords = self._extract_keywords(answer)
        return f"{original_query} {' '.join(keywords[:3])}"
    
    def _merge_contexts(
        self,
        old_context: list[str],
        new_context: list[str]
    ) -> list[str]:
        """合并并去重上下文"""
        # 使用集合去重（基于内容哈希）
        seen = set()
        merged = []
        
        for ctx in old_context + new_context:
            ctx_hash = hash(ctx)
            if ctx_hash not in seen:
                seen.add(ctx_hash)
                merged.append(ctx)
        
        return merged[:10]  # 限制最大上下文数

# 使用示例
generator = SelfCorrectiveGenerator(
    llm_client=openai.AsyncOpenAI(),
    retriever=my_retriever,
    max_iterations=3
)

result = await generator.generate_with_correction(
    query="What is the capital of France?",
    initial_context=retrieved_docs
)

print(f"Final answer (after {result['iterations']} iterations):")
print(result['answer'])
print(f"\nCorrection history: {result['corrections']}")

核心优势：

✅ 自动检测并修正低质量回答
✅ 显著降低幻觉率（实验显示降低 40-60%）
✅ 适应复杂查询（多次迭代优化）

成本考虑：

❌ 每次迭代消耗额外 LLM 调用（评估 + 检测 + 生成）
❌ 平均延迟增加 2-3x
建议：仅对高价值查询启用（如客服、医疗咨询）

2. onyx：企业级多层级 Prompt 架构

设计理念：系统/任务/上下文三层分离 + 完整引用追踪

onyx/prompt_builder.py

class PromptBuilder:
    """
    onyx 多层级 Prompt 构建器
    
    三层架构：
    1. System Prompt：角色定义、全局规则
    2. Task Prompt：任务描述、格式要求
    3. Context Prompt：检索的文档上下文
    
    优势：
    - 分离关注点（易维护）
    - 支持动态调整（不同任务切换 Task Prompt）
    - 完整引用追踪（每个 chunk 带元数据）
    """
    
    def __init__(self, persona_config: dict = None):
        self.persona_config = persona_config or self._default_persona()
    
    def _default_persona(self) -> dict:
        """默认角色配置"""
        return {
            "name": "Knowledge Assistant",
            "role": "A helpful assistant that provides accurate answers based on retrieved documents",
            "traits": [
                "Precise and factual",
                "Cites sources",
                "Admits when information is unavailable"
            ]
        }
    
    def build_prompt(
        self,
        query: str,
        contexts: list[dict],  # [{"content": str, "doc_id": str, "score": float}, ...]
        task_type: str = "qa",  # qa / summarize / compare / code
        chat_history: list[dict] = None,
        custom_instructions: str = None
    ) -> list[dict]:
        """
        构建完整的多层级 prompt
        
        Returns:
            OpenAI 格式的 messages: [{"role": "system", "content": ...}, ...]
        """
        messages = []
        
        # 1. System Prompt（角色定义 + 全局规则）
        system_prompt = self._build_system_prompt(custom_instructions)
        messages.append({
            "role": "system",
            "content": system_prompt
        })
        
        # 2. 历史对话（如果有）
        if chat_history:
            messages.extend(self._format_chat_history(chat_history))
        
        # 3. Context Prompt（检索的文档）
        context_prompt = self._build_context_prompt(contexts)
        
        # 4. Task Prompt（任务描述 + 用户查询）
        task_prompt = self._build_task_prompt(query, task_type)
        
        # 5. 合并 Context + Task（作为最后一条 user message）
        final_user_message = f"{context_prompt}\n\n---\n\n{task_prompt}"
        messages.append({
            "role": "user",
            "content": final_user_message
        })
        
        return messages
    
    def _build_system_prompt(self, custom_instructions: str = None) -> str:
        """构建系统提示词（第一层）"""
        base_prompt = f"""You are {self.persona_config['name']}, {self.persona_config['role']}.

Your traits:
{chr(10).join(f"- {trait}" for trait in self.persona_config['traits'])}

Core Guidelines:
1. ONLY use information from the provided context documents
2. ALWAYS cite sources using [doc_N] notation (e.g., "According to [doc_1], ...")
3. If the context doesn't contain sufficient information, explicitly state:
   "Based on the available documents, I cannot fully answer this question. The documents do not contain information about [missing topic]."
4. Distinguish between facts (from documents) and inferences (clearly marked as such)
5. Be concise but complete - aim for clarity over length
6. If asked about something contradictory to the documents, prioritize the document information

Response Format:
- Start with a direct answer
- Provide supporting details with citations
- End with a summary if the answer is complex

Never:
- Make up information not in the documents
- Use external knowledge unless explicitly instructed
- Speculate without clearly marking it as inference"""

        if custom_instructions:
            base_prompt += f"\n\nAdditional Instructions:\n{custom_instructions}"
        
        return base_prompt
    
    def _build_context_prompt(self, contexts: list[dict]) -> str:
        """构建上下文提示词（第二层）"""
        if not contexts:
            return "No relevant documents were found for this query."
        
        # 按相关性分数排序
        sorted_contexts = sorted(contexts, key=lambda x: x.get("score", 0), reverse=True)
        
        # 格式化为编号文档
        context_lines = ["# Retrieved Documents\n"]
        
        for i, ctx in enumerate(sorted_contexts, 1):
            doc_id = ctx.get("doc_id", "unknown")
            score = ctx.get("score", 0.0)
            content = ctx["content"]
            
            # 添加元数据标记（便于引用追踪）
            context_lines.append(f"## [doc_{i}] (Document ID: {doc_id}, Relevance: {score:.2f})")
            context_lines.append(content)
            context_lines.append("")  # 空行分隔
        
        return "\n".join(context_lines)
    
    def _build_task_prompt(self, query: str, task_type: str) -> str:
        """构建任务提示词（第三层）"""
        task_instructions = {
            "qa": "Answer the following question based on the documents above:",
            "summarize": "Summarize the key information from the documents above regarding:",
            "compare": "Compare and contrast the information from the documents above about:",
            "code": "Provide a code example or explanation based on the documents above for:"
        }
        
        instruction = task_instructions.get(task_type, task_instructions["qa"])
        
        return f"""# Task

{instruction}

**User Query:** {query}

**Instructions:**
- Cite each fact using [doc_N] notation
- If multiple documents mention the same fact, cite all: [doc_1, doc_3]
- Structure your answer with clear sections if needed
- Use markdown formatting for readability

**Answer:**"""
    
    def _format_chat_history(self, chat_history: list[dict]) -> list[dict]:
        """格式化历史对话"""
        formatted = []
        for msg in chat_history[-10:]:  # 保留最近 10 轮
            formatted.append({
                "role": msg["role"],  # user / assistant
                "content": msg["content"]
            })
        return formatted

# 使用示例
builder = PromptBuilder(persona_config={
    "name": "Technical Documentation Assistant",
    "role": "An expert in explaining technical documentation",
    "traits": ["Precise", "Educational", "Code-focused"]
})

# 构建 prompt
messages = builder.build_prompt(
    query="How do I implement rate limiting?",
    contexts=[
        {
            "content": "Rate limiting can be implemented using token bucket algorithm...",
            "doc_id": "doc_123",
            "score": 0.92
        },
        {
            "content": "Common rate limiting strategies include fixed window, sliding window...",
            "doc_id": "doc_456",
            "score": 0.85
        }
    ],
    task_type="code",
    custom_instructions="Provide Python examples"
)

# 调用 LLM
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3
)

print(response.choices[0].message.content)

企业级特性：

✅ 分层设计（易维护和版本管理）
✅ 完整引用追踪（每个事实可溯源）
✅ 多任务支持（QA/摘要/比较/代码）
✅ 人格可配置（适配不同场景）

最佳实践：

将 System Prompt 存储在配置文件（支持 A/B 测试）
记录每次生成的完整 prompt（可复现和调试）
监控引用质量（是否所有事实都有引用）

3. LightRAG：全局/本地双模式 Prompt

设计理念：根据查询类型动态选择全局视角或局部细节

lightrag/prompt_templates.py

class LightRAGPromptManager:
    """
    LightRAG 双模式 Prompt 管理器
    
    模式：
    1. Global Mode：回答需要多文档综合的问题（"整体趋势"、"对比"）
    2. Local Mode：回答需要单文档细节的问题（"具体步骤"、"代码示例"）
    """
    
    def __init__(self):
        self.global_prompt_template = self._init_global_template()
        self.local_prompt_template = self._init_local_template()
    
    def _init_global_template(self) -> str:
        """全局模式模板（强调综合分析）"""
        return """# Role
You are a knowledge synthesis expert who excels at combining information from multiple sources to provide comprehensive answers.

# Context Documents
{context}

# Task
Answer the following question by SYNTHESIZING information across all provided documents:

{query}

# Instructions
1. **Cross-reference**: Compare and contrast information from different documents
2. **Identify patterns**: Look for common themes or trends
3. **Resolve conflicts**: If documents disagree, acknowledge and explain
4. **Comprehensive view**: Provide a holistic answer that leverages all sources
5. **Cite sources**: Use [doc_N] notation for each claim

# Output Format
- **Overview**: Start with a high-level summary
- **Key Points**: Break down into organized sections
- **Synthesis**: Connect insights from multiple documents
- **Citations**: Reference sources throughout

Answer:"""
    
    def _init_local_template(self) -> str:
        """本地模式模板（强调具体细节）"""
        return """# Role
You are a detail-oriented assistant who provides precise, specific answers based on exact information from documents.

# Context Documents
{context}

# Task
Answer the following question using SPECIFIC details from the documents:

{query}

# Instructions
1. **Precision**: Quote exact phrases or numbers when relevant
2. **Step-by-step**: If explaining a process, break it down clearly
3. **Examples**: Include concrete examples from the documents
4. **Context**: Provide enough context for standalone understanding
5. **Cite sources**: Use [doc_N] notation for each fact

# Output Format
- **Direct Answer**: Start with the specific answer
- **Supporting Details**: Provide concrete evidence from documents
- **Examples**: Include relevant examples or code snippets
- **Citations**: Reference sources throughout

Answer:"""
    
    def select_mode(self, query: str) -> str:
        """
        根据查询类型选择模式
        
        Global Mode 触发词：
        - 趋势、对比、总结、综合、整体、发展、变化、影响
        - trend, compare, overall, evolution, impact, summary
        
        Local Mode 触发词：
        - 如何、步骤、代码、示例、具体、详细
        - how to, step, code, example, specific, detail
        """
        query_lower = query.lower()
        
        global_keywords = [
            "trend", "compare", "overall", "evolution", "impact",
            "summary", "趋势", "对比", "整体", "发展", "影响", "总结"
        ]
        
        local_keywords = [
            "how", "step", "code", "example", "specific", "detail",
            "如何", "步骤", "代码", "示例", "具体", "详细"
        ]
        
        global_score = sum(1 for kw in global_keywords if kw in query_lower)
        local_score = sum(1 for kw in local_keywords if kw in query_lower)
        
        # 默认使用 local mode（更安全，避免过度推断）
        return "global" if global_score > local_score else "local"
    
    def build_prompt(
        self,
        query: str,
        contexts: list[str],
        mode: str = None
    ) -> str:
        """
        构建 prompt
        
        Args:
            mode: "global" / "local" / None (auto-detect)
        """
        # 自动检测模式
        if mode is None:
            mode = self.select_mode(query)
        
        # 格式化上下文
        context_str = self._format_contexts(contexts)
        
        # 选择模板
        template = (
            self.global_prompt_template if mode == "global"
            else self.local_prompt_template
        )
        
        # 填充模板
        prompt = template.format(context=context_str, query=query)
        
        print(f"[LightRAG] Selected mode: {mode}")
        return prompt
    
    def _format_contexts(self, contexts: list[str]) -> str:
        """格式化上下文（统一格式）"""
        formatted = []
        for i, ctx in enumerate(contexts, 1):
            formatted.append(f"[doc_{i}]\n{ctx}\n")
        return "\n".join(formatted)

# 使用示例
manager = LightRAGPromptManager()

# 示例1：全局查询（自动检测为 global mode）
query1 = "What are the overall trends in RAG research in 2024?"
prompt1 = manager.build_prompt(query1, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: global

# 示例2：本地查询（自动检测为 local mode）
query2 = "How do I implement vector search in Python?"
prompt2 = manager.build_prompt(query2, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: local

# 示例3：手动指定模式
prompt3 = manager.build_prompt(
    query="Explain RAG architecture",
    contexts=retrieved_docs,
    mode="local"  # 强制使用 local mode
)

优势：

✅ 自动适配查询类型（提升回答质量）
✅ 简单高效（无需复杂规则）
✅ 易扩展（添加新模式很简单）

延伸：

可结合查询分类器（ML 模型判断查询类型）
可添加更多模式（如 "creative mode" 用于头脑风暴）

流式输出优化

为什么需要流式输出？

传统的非流式输出需要等待 LLM 生成完整响应（可能 5-10 秒），用户体验差。流式输出可以：

✅ 降低首 token 延迟（从 5s 降至 500ms）
✅ 提升感知速度（逐字显示类似 ChatGPT）
✅ 允许用户提前中断（节省成本）

完整流式实现

streaming_generation.py

from typing import AsyncIterator
import asyncio

class StreamingGenerator:
    """流式生成器（支持 SSE / WebSocket）"""
    
    def __init__(self, llm_client):
        self.llm = llm_client
    
    async def generate_stream(
        self,
        messages: list[dict],
        model: str = "gpt-4"
    ) -> AsyncIterator[str]:
        """
        生成流式响应
        
        Yields:
            每个 token 或 token 组
        """
        response = await self.llm.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.3,
            stream=True  # 启用流式
        )
        
        async for chunk in response:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content
    
    async def generate_with_metadata(
        self,
        messages: list[dict],
        model: str = "gpt-4"
    ) -> AsyncIterator[dict]:
        """
        流式生成 + 元数据（token 计数、延迟等）
        
        Yields:
            {
                "type": "token" / "metadata" / "done",
                "content": str,
                "metadata": {...}
            }
        """
        import time
        
        start_time = time.time()
        first_token_time = None
        total_tokens = 0
        
        response = await self.llm.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.3,
            stream=True,
            stream_options={"include_usage": True}  # 包含 token 统计
        )
        
        async for chunk in response:
            # 记录首 token 延迟
            if first_token_time is None and chunk.choices[0].delta.content:
                first_token_time = time.time()
                yield {
                    "type": "metadata",
                    "metadata": {
                        "time_to_first_token": first_token_time - start_time
                    }
                }
            
            # 输出 token
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                total_tokens += len(content.split())  # 粗略估计
                
                yield {
                    "type": "token",
                    "content": content
                }
            
            # 完成时输出统计
            if chunk.choices[0].finish_reason:
                end_time = time.time()
                yield {
                    "type": "done",
                    "metadata": {
                        "total_time": end_time - start_time,
                        "time_to_first_token": first_token_time - start_time,
                        "total_tokens": total_tokens,
                        "tokens_per_second": total_tokens / (end_time - start_time)
                    }
                }

# FastAPI 集成示例（SSE）
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()
generator = StreamingGenerator(openai.AsyncOpenAI())

@app.post("/chat/stream")
async def chat_stream(request: dict):
    """流式聊天端点"""
    messages = request["messages"]
    
    async def event_stream():
        """SSE 事件流"""
        async for chunk in generator.generate_stream(messages):
            # SSE 格式：data: {content}\n\n
            yield f"data: {chunk}\n\n"
        
        # 结束标记
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream"
    )

# 客户端示例（JavaScript）
"""
const eventSource = new EventSource('/chat/stream', {
    method: 'POST',
    body: JSON.stringify({messages: [...]})
});

eventSource.onmessage = (event) => {
    if (event.data === '[DONE]') {
        eventSource.close();
        return;
    }
    
    // 逐字显示
    const token = event.data;
    document.getElementById('answer').textContent += token;
};
"""

高级技巧

1. 上下文压缩（处理超长上下文）

context_compression.py

class ContextCompressor:
    """上下文压缩器（减少 token 使用）"""
    
    def __init__(self, llm_client):
        self.llm = llm_client
    
    async def compress_contexts(
        self,
        query: str,
        contexts: list[str],
        max_tokens: int = 4000
    ) -> list[str]:
        """
        压缩上下文到指定 token 限制
        
        策略：
        1. 提取每个文档中与查询相关的句子
        2. 重排序句子（最相关的在前）
        3. 截断到 token 限制
        """
        compressed = []
        
        for ctx in contexts:
            # 提取相关句子
            relevant_sentences = await self._extract_relevant_sentences(
                query, ctx
            )
            compressed.extend(relevant_sentences)
        
        # 按相关性排序
        sorted_sentences = await self._rank_sentences(query, compressed)
        
        # 截断到 token 限制
        final_context = self._truncate_to_tokens(sorted_sentences, max_tokens)
        
        return final_context
    
    async def _extract_relevant_sentences(
        self,
        query: str,
        document: str
    ) -> list[str]:
        """提取与查询相关的句子"""
        sentences = self._split_sentences(document)
        
        # 计算每个句子与查询的相似度
        scores = []
        for sent in sentences:
            score = await self._compute_similarity(query, sent)
            scores.append((sent, score))
        
        # 返回相关度 > 阈值的句子
        relevant = [sent for sent, score in scores if score > 0.5]
        return relevant

2. Few-Shot 示例（提升特定任务质量）

few_shot_examples.py

FEW_SHOT_EXAMPLES = {
    "qa_with_citation": [
        {
            "query": "What is the capital of France?",
            "context": "[doc_1] France is a country in Europe. Paris is its capital and largest city.",
            "answer": "The capital of France is Paris [doc_1]."
        },
        {
            "query": "When was Python created?",
            "context": "[doc_1] Python was created by Guido van Rossum in 1991.",
            "answer": "Python was created in 1991 by Guido van Rossum [doc_1]."
        }
    ]
}

def build_few_shot_prompt(query: str, context: str, task: str = "qa_with_citation") -> str:
    """构建 Few-Shot Prompt"""
    examples = FEW_SHOT_EXAMPLES[task]
    
    prompt = "# Examples\n\n"
    for ex in examples:
        prompt += f"Context: {ex['context']}\n"
        prompt += f"Question: {ex['query']}\n"
        prompt += f"Answer: {ex['answer']}\n\n"
    
    prompt += f"# Your Turn\n\nContext: {context}\nQuestion: {query}\nAnswer:"
    
    return prompt

常见问题

如何控制答案长度

方法1：在 prompt 中明确要求（"in 2-3 sentences"）

方法2：使用 max_tokens 参数限制输出长度

方法3：后处理截断（保留完整句子）

如何减少幻觉

低温度：temperature=0.1-0.3（减少创造性）

明确指示：强调"仅使用上下文"、"不知道就说不知道"

自我验证：参考 Self-Corrective-Agentic-RAG 方案

如何优化引用质量

结构化上下文：每个文档编号 [doc_N]

Few-Shot：提供引用示例

后处理验证：检查答案中的引用是否真实存在

多语言生成如何处理

模型选择：使用多语言模型（GPT-4, Claude）

语言检测：自动检测查询语言并用同语言回答

翻译策略：查询翻译 → 英文检索 → 翻译回答

参考文献

Self-Corrective RAG (LangGraph examples) - Hallucination detection and corrective loops
OpenAI Prompting Guide - Structured prompts and citations
Anthropic Prompt Engineering - Safety and groundedness

下一步：进入性能优化实战了解延迟、成本、吞吐的全方位优化。

生成优化技巧

生成优化技巧

背景与核心挑战

RAG 生成阶段的独特挑战

关键优化维度

九大项目生成方案全景

核心实现深度对比

1. Self-Corrective-Agentic-RAG：自我修正循环

2. onyx：企业级多层级 Prompt 架构

3. LightRAG：全局/本地双模式 Prompt

流式输出优化

为什么需要流式输出？

完整流式实现

高级技巧

1. 上下文压缩（处理超长上下文）

2. Few-Shot 示例（提升特定任务质量）

常见问题

如何控制答案长度

如何减少幻觉

如何优化引用质量

多语言生成如何处理

延伸阅读

参考文献

On this page