Deeptoai RAG系列教程

生成优化技巧

LLM 生成质量提升的工程实战指南:从提示词到流式输出的完整优化方案

为什么生成质量是 RAG 的最后一公里

即使检索到了完美的上下文,如果 LLM 生成的答案不准确、不相关或不流畅,用户体验依然会很差。生成优化是 RAG 系统从"能用"到"好用"的关键一步,涉及提示词工程、上下文管理、流式输出、幻觉控制等多个维度。

生成优化技巧

背景与核心挑战

RAG 生成阶段的独特挑战

与通用 LLM 对话不同,RAG 生成需要处理:

  1. 上下文整合:如何让 LLM 充分利用检索到的文档
  2. 幻觉控制:如何避免 LLM 编造不在上下文中的信息
  3. 引用管理:如何标注答案来源以提升可信度
  4. 流式输出:如何优化首 token 延迟和整体体验
  5. 成本控制:如何在质量和成本之间取得平衡

关键优化维度

维度问题优化方向
提示词设计如何引导 LLM 基于上下文回答结构化提示词、Few-shot、角色设定
上下文管理如何处理过长/过短的上下文上下文压缩、排序、截断策略
幻觉控制如何减少编造信息明确指示、引用要求、自我验证
流式输出如何优化用户体验异步流式、分块传输、UI 反馈
质量保证如何评估和改进生成质量自动评估、人工反馈、A/B 测试

九大项目生成方案全景

项目提示词工程流式输出引用管理幻觉控制技术成熟度
LightRAG高级(全局/本地 prompt)基础⭐⭐⭐⭐⭐
onyx企业级(多层级 prompt)✅ 高级✅ 完整⭐⭐⭐⭐⭐
Self-Corrective-Agentic-RAG自我修正 prompt✅ 验证机制极高⭐⭐⭐⭐⭐
kotaemonLlamaIndex 集成⭐⭐⭐⭐
Verba简洁 prompt⭐⭐⭐⭐
ragflow多 Agent prompt✅ 高级中高⭐⭐⭐⭐
SurfSense浏览器场景优化⭐⭐⭐
UltraRAG基础 prompt基础⭐⭐⭐
RAG-Anything继承 LightRAG基础⭐⭐⭐⭐⭐

关键洞察

  • 最创新:Self-Corrective-Agentic-RAG 的自我修正循环(检测幻觉 → 重新检索 → 再生成)
  • 最完整:onyx 的多层级 prompt(系统/任务/上下文三层分离)+ 引用追踪
  • 最实用:LightRAG 的全局/本地双模式 prompt(适配不同查询类型)
  • 趋势:从单次生成向自我验证 + 迭代优化演进

核心实现深度对比

1. Self-Corrective-Agentic-RAG:自我修正循环

设计理念:生成 → 检测幻觉 → 重新检索 → 再生成的闭环优化

agentic_rag/self_corrective_generation.py
class SelfCorrectiveGenerator:
    """
    自我修正生成器
    
    流程:
    1. 基于上下文生成初始答案
    2. 评估答案相关性(Relevance Grader)
    3. 检测幻觉(Hallucination Detector)
    4. 如果不满意,重新检索更好的上下文
    5. 再次生成,最多尝试 N 次
    """
    
    def __init__(
        self,
        llm_client,
        retriever,
        max_iterations: int = 3,
        relevance_threshold: float = 0.7,
        enable_hallucination_check: bool = True
    ):
        self.llm = llm_client
        self.retriever = retriever
        self.max_iterations = max_iterations
        self.relevance_threshold = relevance_threshold
        self.enable_hallucination_check = enable_hallucination_check
        
        # 初始化评估器
        self.relevance_grader = self._init_relevance_grader()
        self.hallucination_detector = self._init_hallucination_detector()
    
    async def generate_with_correction(
        self,
        query: str,
        initial_context: list[str]
    ) -> dict:
        """
        自我修正生成
        
        Returns:
            {
                "answer": str,
                "iterations": int,
                "corrections": list[dict],
                "final_context": list[str]
            }
        """
        context = initial_context
        corrections = []
        
        for iteration in range(self.max_iterations):
            # 1. 生成答案
            answer = await self._generate_answer(query, context)
            
            # 2. 评估相关性
            relevance_score = await self._grade_relevance(query, answer, context)
            
            # 3. 检测幻觉
            has_hallucination = False
            if self.enable_hallucination_check:
                has_hallucination = await self._detect_hallucination(answer, context)
            
            # 记录本次迭代
            corrections.append({
                "iteration": iteration + 1,
                "relevance_score": relevance_score,
                "has_hallucination": has_hallucination,
                "answer_preview": answer[:100] + "..."
            })
            
            # 4. 判断是否需要修正
            if relevance_score >= self.relevance_threshold and not has_hallucination:
                # 质量合格,返回
                return {
                    "answer": answer,
                    "iterations": iteration + 1,
                    "corrections": corrections,
                    "final_context": context,
                    "status": "success"
                }
            
            # 5. 重新检索更好的上下文
            print(f"Iteration {iteration + 1}: Quality insufficient, re-retrieving...")
            
            # 使用答案中的关键词扩展检索
            expanded_query = self._expand_query(query, answer)
            new_context = await self.retriever.retrieve(expanded_query, top_k=5)
            
            # 合并并去重上下文
            context = self._merge_contexts(context, new_context)
        
        # 最大迭代次数后仍不满意,返回最后一次结果
        return {
            "answer": answer,
            "iterations": self.max_iterations,
            "corrections": corrections,
            "final_context": context,
            "status": "max_iterations_reached"
        }
    
    async def _generate_answer(self, query: str, context: list[str]) -> str:
        """生成答案(强调基于上下文)"""
        prompt = self._build_generation_prompt(query, context)
        
        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": """You are a helpful assistant that answers questions STRICTLY based on the provided context.
                    
Rules:
1. ONLY use information from the context
2. If the context doesn't contain the answer, say "I cannot answer based on the provided context"
3. Cite the context by using [Source N] notation
4. Do NOT make up information or use external knowledge"""
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            temperature=0.3  # 低温度减少创造性(降低幻觉)
        )
        
        return response.choices[0].message.content
    
    def _build_generation_prompt(self, query: str, context: list[str]) -> str:
        """构建生成提示词"""
        context_str = "\n\n".join([
            f"[Source {i+1}]\n{ctx}"
            for i, ctx in enumerate(context)
        ])
        
        return f"""Context:
{context_str}

Question: {query}

Instructions:
- Answer the question using ONLY the information from the context above
- Cite sources using [Source N] notation
- If the context is insufficient, explicitly state what's missing
- Be concise and precise

Answer:"""
    
    async def _grade_relevance(
        self,
        query: str,
        answer: str,
        context: list[str]
    ) -> float:
        """评估答案与查询的相关性(0-1)"""
        prompt = f"""Grade the relevance of the answer to the question.

Question: {query}

Answer: {answer}

Is the answer relevant and helpful for the question?
Respond with a score from 0.0 (not relevant) to 1.0 (highly relevant).
Only output the score as a number."""

        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        
        try:
            score = float(response.choices[0].message.content.strip())
            return max(0.0, min(1.0, score))  # 限制在 [0, 1]
        except ValueError:
            return 0.5  # 默认中等分数
    
    async def _detect_hallucination(self, answer: str, context: list[str]) -> bool:
        """检测幻觉(答案是否包含上下文中没有的信息)"""
        context_str = "\n\n".join(context)
        
        prompt = f"""Detect if the answer contains hallucinations (information not present in the context).

Context:
{context_str}

Answer:
{answer}

Does the answer contain information that is NOT in the context?
Respond with only "YES" or "NO"."""

        response = await self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        
        result = response.choices[0].message.content.strip().upper()
        return result == "YES"
    
    def _expand_query(self, original_query: str, answer: str) -> str:
        """从答案中提取关键词扩展查询"""
        # 简单实现:提取答案中的名词短语
        # 生产环境可用 NER 或关键词提取模型
        keywords = self._extract_keywords(answer)
        return f"{original_query} {' '.join(keywords[:3])}"
    
    def _merge_contexts(
        self,
        old_context: list[str],
        new_context: list[str]
    ) -> list[str]:
        """合并并去重上下文"""
        # 使用集合去重(基于内容哈希)
        seen = set()
        merged = []
        
        for ctx in old_context + new_context:
            ctx_hash = hash(ctx)
            if ctx_hash not in seen:
                seen.add(ctx_hash)
                merged.append(ctx)
        
        return merged[:10]  # 限制最大上下文数

# 使用示例
generator = SelfCorrectiveGenerator(
    llm_client=openai.AsyncOpenAI(),
    retriever=my_retriever,
    max_iterations=3
)

result = await generator.generate_with_correction(
    query="What is the capital of France?",
    initial_context=retrieved_docs
)

print(f"Final answer (after {result['iterations']} iterations):")
print(result['answer'])
print(f"\nCorrection history: {result['corrections']}")

核心优势

  • ✅ 自动检测并修正低质量回答
  • ✅ 显著降低幻觉率(实验显示降低 40-60%)
  • ✅ 适应复杂查询(多次迭代优化)

成本考虑

  • ❌ 每次迭代消耗额外 LLM 调用(评估 + 检测 + 生成)
  • ❌ 平均延迟增加 2-3x
  • 建议:仅对高价值查询启用(如客服、医疗咨询)

2. onyx:企业级多层级 Prompt 架构

设计理念:系统/任务/上下文三层分离 + 完整引用追踪

onyx/prompt_builder.py
class PromptBuilder:
    """
    onyx 多层级 Prompt 构建器
    
    三层架构:
    1. System Prompt:角色定义、全局规则
    2. Task Prompt:任务描述、格式要求
    3. Context Prompt:检索的文档上下文
    
    优势:
    - 分离关注点(易维护)
    - 支持动态调整(不同任务切换 Task Prompt)
    - 完整引用追踪(每个 chunk 带元数据)
    """
    
    def __init__(self, persona_config: dict = None):
        self.persona_config = persona_config or self._default_persona()
    
    def _default_persona(self) -> dict:
        """默认角色配置"""
        return {
            "name": "Knowledge Assistant",
            "role": "A helpful assistant that provides accurate answers based on retrieved documents",
            "traits": [
                "Precise and factual",
                "Cites sources",
                "Admits when information is unavailable"
            ]
        }
    
    def build_prompt(
        self,
        query: str,
        contexts: list[dict],  # [{"content": str, "doc_id": str, "score": float}, ...]
        task_type: str = "qa",  # qa / summarize / compare / code
        chat_history: list[dict] = None,
        custom_instructions: str = None
    ) -> list[dict]:
        """
        构建完整的多层级 prompt
        
        Returns:
            OpenAI 格式的 messages: [{"role": "system", "content": ...}, ...]
        """
        messages = []
        
        # 1. System Prompt(角色定义 + 全局规则)
        system_prompt = self._build_system_prompt(custom_instructions)
        messages.append({
            "role": "system",
            "content": system_prompt
        })
        
        # 2. 历史对话(如果有)
        if chat_history:
            messages.extend(self._format_chat_history(chat_history))
        
        # 3. Context Prompt(检索的文档)
        context_prompt = self._build_context_prompt(contexts)
        
        # 4. Task Prompt(任务描述 + 用户查询)
        task_prompt = self._build_task_prompt(query, task_type)
        
        # 5. 合并 Context + Task(作为最后一条 user message)
        final_user_message = f"{context_prompt}\n\n---\n\n{task_prompt}"
        messages.append({
            "role": "user",
            "content": final_user_message
        })
        
        return messages
    
    def _build_system_prompt(self, custom_instructions: str = None) -> str:
        """构建系统提示词(第一层)"""
        base_prompt = f"""You are {self.persona_config['name']}, {self.persona_config['role']}.

Your traits:
{chr(10).join(f"- {trait}" for trait in self.persona_config['traits'])}

Core Guidelines:
1. ONLY use information from the provided context documents
2. ALWAYS cite sources using [doc_N] notation (e.g., "According to [doc_1], ...")
3. If the context doesn't contain sufficient information, explicitly state:
   "Based on the available documents, I cannot fully answer this question. The documents do not contain information about [missing topic]."
4. Distinguish between facts (from documents) and inferences (clearly marked as such)
5. Be concise but complete - aim for clarity over length
6. If asked about something contradictory to the documents, prioritize the document information

Response Format:
- Start with a direct answer
- Provide supporting details with citations
- End with a summary if the answer is complex

Never:
- Make up information not in the documents
- Use external knowledge unless explicitly instructed
- Speculate without clearly marking it as inference"""

        if custom_instructions:
            base_prompt += f"\n\nAdditional Instructions:\n{custom_instructions}"
        
        return base_prompt
    
    def _build_context_prompt(self, contexts: list[dict]) -> str:
        """构建上下文提示词(第二层)"""
        if not contexts:
            return "No relevant documents were found for this query."
        
        # 按相关性分数排序
        sorted_contexts = sorted(contexts, key=lambda x: x.get("score", 0), reverse=True)
        
        # 格式化为编号文档
        context_lines = ["# Retrieved Documents\n"]
        
        for i, ctx in enumerate(sorted_contexts, 1):
            doc_id = ctx.get("doc_id", "unknown")
            score = ctx.get("score", 0.0)
            content = ctx["content"]
            
            # 添加元数据标记(便于引用追踪)
            context_lines.append(f"## [doc_{i}] (Document ID: {doc_id}, Relevance: {score:.2f})")
            context_lines.append(content)
            context_lines.append("")  # 空行分隔
        
        return "\n".join(context_lines)
    
    def _build_task_prompt(self, query: str, task_type: str) -> str:
        """构建任务提示词(第三层)"""
        task_instructions = {
            "qa": "Answer the following question based on the documents above:",
            "summarize": "Summarize the key information from the documents above regarding:",
            "compare": "Compare and contrast the information from the documents above about:",
            "code": "Provide a code example or explanation based on the documents above for:"
        }
        
        instruction = task_instructions.get(task_type, task_instructions["qa"])
        
        return f"""# Task

{instruction}

**User Query:** {query}

**Instructions:**
- Cite each fact using [doc_N] notation
- If multiple documents mention the same fact, cite all: [doc_1, doc_3]
- Structure your answer with clear sections if needed
- Use markdown formatting for readability

**Answer:**"""
    
    def _format_chat_history(self, chat_history: list[dict]) -> list[dict]:
        """格式化历史对话"""
        formatted = []
        for msg in chat_history[-10:]:  # 保留最近 10 轮
            formatted.append({
                "role": msg["role"],  # user / assistant
                "content": msg["content"]
            })
        return formatted

# 使用示例
builder = PromptBuilder(persona_config={
    "name": "Technical Documentation Assistant",
    "role": "An expert in explaining technical documentation",
    "traits": ["Precise", "Educational", "Code-focused"]
})

# 构建 prompt
messages = builder.build_prompt(
    query="How do I implement rate limiting?",
    contexts=[
        {
            "content": "Rate limiting can be implemented using token bucket algorithm...",
            "doc_id": "doc_123",
            "score": 0.92
        },
        {
            "content": "Common rate limiting strategies include fixed window, sliding window...",
            "doc_id": "doc_456",
            "score": 0.85
        }
    ],
    task_type="code",
    custom_instructions="Provide Python examples"
)

# 调用 LLM
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3
)

print(response.choices[0].message.content)

企业级特性

  • ✅ 分层设计(易维护和版本管理)
  • ✅ 完整引用追踪(每个事实可溯源)
  • ✅ 多任务支持(QA/摘要/比较/代码)
  • ✅ 人格可配置(适配不同场景)

最佳实践

  • 将 System Prompt 存储在配置文件(支持 A/B 测试)
  • 记录每次生成的完整 prompt(可复现和调试)
  • 监控引用质量(是否所有事实都有引用)

3. LightRAG:全局/本地双模式 Prompt

设计理念:根据查询类型动态选择全局视角或局部细节

lightrag/prompt_templates.py
class LightRAGPromptManager:
    """
    LightRAG 双模式 Prompt 管理器
    
    模式:
    1. Global Mode:回答需要多文档综合的问题("整体趋势"、"对比")
    2. Local Mode:回答需要单文档细节的问题("具体步骤"、"代码示例")
    """
    
    def __init__(self):
        self.global_prompt_template = self._init_global_template()
        self.local_prompt_template = self._init_local_template()
    
    def _init_global_template(self) -> str:
        """全局模式模板(强调综合分析)"""
        return """# Role
You are a knowledge synthesis expert who excels at combining information from multiple sources to provide comprehensive answers.

# Context Documents
{context}

# Task
Answer the following question by SYNTHESIZING information across all provided documents:

{query}

# Instructions
1. **Cross-reference**: Compare and contrast information from different documents
2. **Identify patterns**: Look for common themes or trends
3. **Resolve conflicts**: If documents disagree, acknowledge and explain
4. **Comprehensive view**: Provide a holistic answer that leverages all sources
5. **Cite sources**: Use [doc_N] notation for each claim

# Output Format
- **Overview**: Start with a high-level summary
- **Key Points**: Break down into organized sections
- **Synthesis**: Connect insights from multiple documents
- **Citations**: Reference sources throughout

Answer:"""
    
    def _init_local_template(self) -> str:
        """本地模式模板(强调具体细节)"""
        return """# Role
You are a detail-oriented assistant who provides precise, specific answers based on exact information from documents.

# Context Documents
{context}

# Task
Answer the following question using SPECIFIC details from the documents:

{query}

# Instructions
1. **Precision**: Quote exact phrases or numbers when relevant
2. **Step-by-step**: If explaining a process, break it down clearly
3. **Examples**: Include concrete examples from the documents
4. **Context**: Provide enough context for standalone understanding
5. **Cite sources**: Use [doc_N] notation for each fact

# Output Format
- **Direct Answer**: Start with the specific answer
- **Supporting Details**: Provide concrete evidence from documents
- **Examples**: Include relevant examples or code snippets
- **Citations**: Reference sources throughout

Answer:"""
    
    def select_mode(self, query: str) -> str:
        """
        根据查询类型选择模式
        
        Global Mode 触发词:
        - 趋势、对比、总结、综合、整体、发展、变化、影响
        - trend, compare, overall, evolution, impact, summary
        
        Local Mode 触发词:
        - 如何、步骤、代码、示例、具体、详细
        - how to, step, code, example, specific, detail
        """
        query_lower = query.lower()
        
        global_keywords = [
            "trend", "compare", "overall", "evolution", "impact",
            "summary", "趋势", "对比", "整体", "发展", "影响", "总结"
        ]
        
        local_keywords = [
            "how", "step", "code", "example", "specific", "detail",
            "如何", "步骤", "代码", "示例", "具体", "详细"
        ]
        
        global_score = sum(1 for kw in global_keywords if kw in query_lower)
        local_score = sum(1 for kw in local_keywords if kw in query_lower)
        
        # 默认使用 local mode(更安全,避免过度推断)
        return "global" if global_score > local_score else "local"
    
    def build_prompt(
        self,
        query: str,
        contexts: list[str],
        mode: str = None
    ) -> str:
        """
        构建 prompt
        
        Args:
            mode: "global" / "local" / None (auto-detect)
        """
        # 自动检测模式
        if mode is None:
            mode = self.select_mode(query)
        
        # 格式化上下文
        context_str = self._format_contexts(contexts)
        
        # 选择模板
        template = (
            self.global_prompt_template if mode == "global"
            else self.local_prompt_template
        )
        
        # 填充模板
        prompt = template.format(context=context_str, query=query)
        
        print(f"[LightRAG] Selected mode: {mode}")
        return prompt
    
    def _format_contexts(self, contexts: list[str]) -> str:
        """格式化上下文(统一格式)"""
        formatted = []
        for i, ctx in enumerate(contexts, 1):
            formatted.append(f"[doc_{i}]\n{ctx}\n")
        return "\n".join(formatted)

# 使用示例
manager = LightRAGPromptManager()

# 示例1:全局查询(自动检测为 global mode)
query1 = "What are the overall trends in RAG research in 2024?"
prompt1 = manager.build_prompt(query1, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: global

# 示例2:本地查询(自动检测为 local mode)
query2 = "How do I implement vector search in Python?"
prompt2 = manager.build_prompt(query2, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: local

# 示例3:手动指定模式
prompt3 = manager.build_prompt(
    query="Explain RAG architecture",
    contexts=retrieved_docs,
    mode="local"  # 强制使用 local mode
)

优势

  • ✅ 自动适配查询类型(提升回答质量)
  • ✅ 简单高效(无需复杂规则)
  • ✅ 易扩展(添加新模式很简单)

延伸

  • 可结合查询分类器(ML 模型判断查询类型)
  • 可添加更多模式(如 "creative mode" 用于头脑风暴)

流式输出优化

为什么需要流式输出?

传统的非流式输出需要等待 LLM 生成完整响应(可能 5-10 秒),用户体验差。流式输出可以:

  • ✅ 降低首 token 延迟(从 5s 降至 500ms)
  • ✅ 提升感知速度(逐字显示类似 ChatGPT)
  • ✅ 允许用户提前中断(节省成本)

完整流式实现

streaming_generation.py
from typing import AsyncIterator
import asyncio

class StreamingGenerator:
    """流式生成器(支持 SSE / WebSocket)"""
    
    def __init__(self, llm_client):
        self.llm = llm_client
    
    async def generate_stream(
        self,
        messages: list[dict],
        model: str = "gpt-4"
    ) -> AsyncIterator[str]:
        """
        生成流式响应
        
        Yields:
            每个 token 或 token 组
        """
        response = await self.llm.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.3,
            stream=True  # 启用流式
        )
        
        async for chunk in response:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content
    
    async def generate_with_metadata(
        self,
        messages: list[dict],
        model: str = "gpt-4"
    ) -> AsyncIterator[dict]:
        """
        流式生成 + 元数据(token 计数、延迟等)
        
        Yields:
            {
                "type": "token" / "metadata" / "done",
                "content": str,
                "metadata": {...}
            }
        """
        import time
        
        start_time = time.time()
        first_token_time = None
        total_tokens = 0
        
        response = await self.llm.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.3,
            stream=True,
            stream_options={"include_usage": True}  # 包含 token 统计
        )
        
        async for chunk in response:
            # 记录首 token 延迟
            if first_token_time is None and chunk.choices[0].delta.content:
                first_token_time = time.time()
                yield {
                    "type": "metadata",
                    "metadata": {
                        "time_to_first_token": first_token_time - start_time
                    }
                }
            
            # 输出 token
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                total_tokens += len(content.split())  # 粗略估计
                
                yield {
                    "type": "token",
                    "content": content
                }
            
            # 完成时输出统计
            if chunk.choices[0].finish_reason:
                end_time = time.time()
                yield {
                    "type": "done",
                    "metadata": {
                        "total_time": end_time - start_time,
                        "time_to_first_token": first_token_time - start_time,
                        "total_tokens": total_tokens,
                        "tokens_per_second": total_tokens / (end_time - start_time)
                    }
                }

# FastAPI 集成示例(SSE)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()
generator = StreamingGenerator(openai.AsyncOpenAI())

@app.post("/chat/stream")
async def chat_stream(request: dict):
    """流式聊天端点"""
    messages = request["messages"]
    
    async def event_stream():
        """SSE 事件流"""
        async for chunk in generator.generate_stream(messages):
            # SSE 格式:data: {content}\n\n
            yield f"data: {chunk}\n\n"
        
        # 结束标记
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream"
    )

# 客户端示例(JavaScript)
"""
const eventSource = new EventSource('/chat/stream', {
    method: 'POST',
    body: JSON.stringify({messages: [...]})
});

eventSource.onmessage = (event) => {
    if (event.data === '[DONE]') {
        eventSource.close();
        return;
    }
    
    // 逐字显示
    const token = event.data;
    document.getElementById('answer').textContent += token;
};
"""

高级技巧

1. 上下文压缩(处理超长上下文)

context_compression.py
class ContextCompressor:
    """上下文压缩器(减少 token 使用)"""
    
    def __init__(self, llm_client):
        self.llm = llm_client
    
    async def compress_contexts(
        self,
        query: str,
        contexts: list[str],
        max_tokens: int = 4000
    ) -> list[str]:
        """
        压缩上下文到指定 token 限制
        
        策略:
        1. 提取每个文档中与查询相关的句子
        2. 重排序句子(最相关的在前)
        3. 截断到 token 限制
        """
        compressed = []
        
        for ctx in contexts:
            # 提取相关句子
            relevant_sentences = await self._extract_relevant_sentences(
                query, ctx
            )
            compressed.extend(relevant_sentences)
        
        # 按相关性排序
        sorted_sentences = await self._rank_sentences(query, compressed)
        
        # 截断到 token 限制
        final_context = self._truncate_to_tokens(sorted_sentences, max_tokens)
        
        return final_context
    
    async def _extract_relevant_sentences(
        self,
        query: str,
        document: str
    ) -> list[str]:
        """提取与查询相关的句子"""
        sentences = self._split_sentences(document)
        
        # 计算每个句子与查询的相似度
        scores = []
        for sent in sentences:
            score = await self._compute_similarity(query, sent)
            scores.append((sent, score))
        
        # 返回相关度 > 阈值的句子
        relevant = [sent for sent, score in scores if score > 0.5]
        return relevant

2. Few-Shot 示例(提升特定任务质量)

few_shot_examples.py
FEW_SHOT_EXAMPLES = {
    "qa_with_citation": [
        {
            "query": "What is the capital of France?",
            "context": "[doc_1] France is a country in Europe. Paris is its capital and largest city.",
            "answer": "The capital of France is Paris [doc_1]."
        },
        {
            "query": "When was Python created?",
            "context": "[doc_1] Python was created by Guido van Rossum in 1991.",
            "answer": "Python was created in 1991 by Guido van Rossum [doc_1]."
        }
    ]
}

def build_few_shot_prompt(query: str, context: str, task: str = "qa_with_citation") -> str:
    """构建 Few-Shot Prompt"""
    examples = FEW_SHOT_EXAMPLES[task]
    
    prompt = "# Examples\n\n"
    for ex in examples:
        prompt += f"Context: {ex['context']}\n"
        prompt += f"Question: {ex['query']}\n"
        prompt += f"Answer: {ex['answer']}\n\n"
    
    prompt += f"# Your Turn\n\nContext: {context}\nQuestion: {query}\nAnswer:"
    
    return prompt

常见问题

延伸阅读

参考文献

  • Self-Corrective RAG (LangGraph examples) - Hallucination detection and corrective loops
  • OpenAI Prompting Guide - Structured prompts and citations
  • Anthropic Prompt Engineering - Safety and groundedness

下一步:进入 性能优化实战 了解延迟、成本、吞吐的全方位优化。