生成优化技巧
LLM 生成质量提升的工程实战指南:从提示词到流式输出的完整优化方案
为什么生成质量是 RAG 的最后一公里
即使检索到了完美的上下文,如果 LLM 生成的答案不准确、不相关或不流畅,用户体验依然会很差。生成优化是 RAG 系统从"能用"到"好用"的关键一步,涉及提示词工程、上下文管理、流式输出、幻觉控制等多个维度。
生成优化技巧
背景与核心挑战
RAG 生成阶段的独特挑战
与通用 LLM 对话不同,RAG 生成需要处理:
- 上下文整合:如何让 LLM 充分利用检索到的文档
- 幻觉控制:如何避免 LLM 编造不在上下文中的信息
- 引用管理:如何标注答案来源以提升可信度
- 流式输出:如何优化首 token 延迟和整体体验
- 成本控制:如何在质量和成本之间取得平衡
关键优化维度
| 维度 | 问题 | 优化方向 |
|---|---|---|
| 提示词设计 | 如何引导 LLM 基于上下文回答 | 结构化提示词、Few-shot、角色设定 |
| 上下文管理 | 如何处理过长/过短的上下文 | 上下文压缩、排序、截断策略 |
| 幻觉控制 | 如何减少编造信息 | 明确指示、引用要求、自我验证 |
| 流式输出 | 如何优化用户体验 | 异步流式、分块传输、UI 反馈 |
| 质量保证 | 如何评估和改进生成质量 | 自动评估、人工反馈、A/B 测试 |
九大项目生成方案全景
| 项目 | 提示词工程 | 流式输出 | 引用管理 | 幻觉控制 | 技术成熟度 |
|---|---|---|---|---|---|
| LightRAG | 高级(全局/本地 prompt) | ✅ | 基础 | 中 | ⭐⭐⭐⭐⭐ |
| onyx | 企业级(多层级 prompt) | ✅ 高级 | ✅ 完整 | 高 | ⭐⭐⭐⭐⭐ |
| Self-Corrective-Agentic-RAG | 自我修正 prompt | ✅ | ✅ 验证机制 | 极高 | ⭐⭐⭐⭐⭐ |
| kotaemon | LlamaIndex 集成 | ✅ | ✅ | 中 | ⭐⭐⭐⭐ |
| Verba | 简洁 prompt | ✅ | ✅ | 中 | ⭐⭐⭐⭐ |
| ragflow | 多 Agent prompt | ✅ | ✅ 高级 | 中高 | ⭐⭐⭐⭐ |
| SurfSense | 浏览器场景优化 | ✅ | ✅ | 中 | ⭐⭐⭐ |
| UltraRAG | 基础 prompt | ✅ | 基础 | 低 | ⭐⭐⭐ |
| RAG-Anything | 继承 LightRAG | ✅ | 基础 | 中 | ⭐⭐⭐⭐⭐ |
关键洞察
- 最创新:Self-Corrective-Agentic-RAG 的自我修正循环(检测幻觉 → 重新检索 → 再生成)
- 最完整:onyx 的多层级 prompt(系统/任务/上下文三层分离)+ 引用追踪
- 最实用:LightRAG 的全局/本地双模式 prompt(适配不同查询类型)
- 趋势:从单次生成向自我验证 + 迭代优化演进
核心实现深度对比
1. Self-Corrective-Agentic-RAG:自我修正循环
设计理念:生成 → 检测幻觉 → 重新检索 → 再生成的闭环优化
class SelfCorrectiveGenerator:
"""
自我修正生成器
流程:
1. 基于上下文生成初始答案
2. 评估答案相关性(Relevance Grader)
3. 检测幻觉(Hallucination Detector)
4. 如果不满意,重新检索更好的上下文
5. 再次生成,最多尝试 N 次
"""
def __init__(
self,
llm_client,
retriever,
max_iterations: int = 3,
relevance_threshold: float = 0.7,
enable_hallucination_check: bool = True
):
self.llm = llm_client
self.retriever = retriever
self.max_iterations = max_iterations
self.relevance_threshold = relevance_threshold
self.enable_hallucination_check = enable_hallucination_check
# 初始化评估器
self.relevance_grader = self._init_relevance_grader()
self.hallucination_detector = self._init_hallucination_detector()
async def generate_with_correction(
self,
query: str,
initial_context: list[str]
) -> dict:
"""
自我修正生成
Returns:
{
"answer": str,
"iterations": int,
"corrections": list[dict],
"final_context": list[str]
}
"""
context = initial_context
corrections = []
for iteration in range(self.max_iterations):
# 1. 生成答案
answer = await self._generate_answer(query, context)
# 2. 评估相关性
relevance_score = await self._grade_relevance(query, answer, context)
# 3. 检测幻觉
has_hallucination = False
if self.enable_hallucination_check:
has_hallucination = await self._detect_hallucination(answer, context)
# 记录本次迭代
corrections.append({
"iteration": iteration + 1,
"relevance_score": relevance_score,
"has_hallucination": has_hallucination,
"answer_preview": answer[:100] + "..."
})
# 4. 判断是否需要修正
if relevance_score >= self.relevance_threshold and not has_hallucination:
# 质量合格,返回
return {
"answer": answer,
"iterations": iteration + 1,
"corrections": corrections,
"final_context": context,
"status": "success"
}
# 5. 重新检索更好的上下文
print(f"Iteration {iteration + 1}: Quality insufficient, re-retrieving...")
# 使用答案中的关键词扩展检索
expanded_query = self._expand_query(query, answer)
new_context = await self.retriever.retrieve(expanded_query, top_k=5)
# 合并并去重上下文
context = self._merge_contexts(context, new_context)
# 最大迭代次数后仍不满意,返回最后一次结果
return {
"answer": answer,
"iterations": self.max_iterations,
"corrections": corrections,
"final_context": context,
"status": "max_iterations_reached"
}
async def _generate_answer(self, query: str, context: list[str]) -> str:
"""生成答案(强调基于上下文)"""
prompt = self._build_generation_prompt(query, context)
response = await self.llm.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """You are a helpful assistant that answers questions STRICTLY based on the provided context.
Rules:
1. ONLY use information from the context
2. If the context doesn't contain the answer, say "I cannot answer based on the provided context"
3. Cite the context by using [Source N] notation
4. Do NOT make up information or use external knowledge"""
},
{
"role": "user",
"content": prompt
}
],
temperature=0.3 # 低温度减少创造性(降低幻觉)
)
return response.choices[0].message.content
def _build_generation_prompt(self, query: str, context: list[str]) -> str:
"""构建生成提示词"""
context_str = "\n\n".join([
f"[Source {i+1}]\n{ctx}"
for i, ctx in enumerate(context)
])
return f"""Context:
{context_str}
Question: {query}
Instructions:
- Answer the question using ONLY the information from the context above
- Cite sources using [Source N] notation
- If the context is insufficient, explicitly state what's missing
- Be concise and precise
Answer:"""
async def _grade_relevance(
self,
query: str,
answer: str,
context: list[str]
) -> float:
"""评估答案与查询的相关性(0-1)"""
prompt = f"""Grade the relevance of the answer to the question.
Question: {query}
Answer: {answer}
Is the answer relevant and helpful for the question?
Respond with a score from 0.0 (not relevant) to 1.0 (highly relevant).
Only output the score as a number."""
response = await self.llm.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
score = float(response.choices[0].message.content.strip())
return max(0.0, min(1.0, score)) # 限制在 [0, 1]
except ValueError:
return 0.5 # 默认中等分数
async def _detect_hallucination(self, answer: str, context: list[str]) -> bool:
"""检测幻觉(答案是否包含上下文中没有的信息)"""
context_str = "\n\n".join(context)
prompt = f"""Detect if the answer contains hallucinations (information not present in the context).
Context:
{context_str}
Answer:
{answer}
Does the answer contain information that is NOT in the context?
Respond with only "YES" or "NO"."""
response = await self.llm.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
result = response.choices[0].message.content.strip().upper()
return result == "YES"
def _expand_query(self, original_query: str, answer: str) -> str:
"""从答案中提取关键词扩展查询"""
# 简单实现:提取答案中的名词短语
# 生产环境可用 NER 或关键词提取模型
keywords = self._extract_keywords(answer)
return f"{original_query} {' '.join(keywords[:3])}"
def _merge_contexts(
self,
old_context: list[str],
new_context: list[str]
) -> list[str]:
"""合并并去重上下文"""
# 使用集合去重(基于内容哈希)
seen = set()
merged = []
for ctx in old_context + new_context:
ctx_hash = hash(ctx)
if ctx_hash not in seen:
seen.add(ctx_hash)
merged.append(ctx)
return merged[:10] # 限制最大上下文数
# 使用示例
generator = SelfCorrectiveGenerator(
llm_client=openai.AsyncOpenAI(),
retriever=my_retriever,
max_iterations=3
)
result = await generator.generate_with_correction(
query="What is the capital of France?",
initial_context=retrieved_docs
)
print(f"Final answer (after {result['iterations']} iterations):")
print(result['answer'])
print(f"\nCorrection history: {result['corrections']}")核心优势:
- ✅ 自动检测并修正低质量回答
- ✅ 显著降低幻觉率(实验显示降低 40-60%)
- ✅ 适应复杂查询(多次迭代优化)
成本考虑:
- ❌ 每次迭代消耗额外 LLM 调用(评估 + 检测 + 生成)
- ❌ 平均延迟增加 2-3x
- 建议:仅对高价值查询启用(如客服、医疗咨询)
2. onyx:企业级多层级 Prompt 架构
设计理念:系统/任务/上下文三层分离 + 完整引用追踪
class PromptBuilder:
"""
onyx 多层级 Prompt 构建器
三层架构:
1. System Prompt:角色定义、全局规则
2. Task Prompt:任务描述、格式要求
3. Context Prompt:检索的文档上下文
优势:
- 分离关注点(易维护)
- 支持动态调整(不同任务切换 Task Prompt)
- 完整引用追踪(每个 chunk 带元数据)
"""
def __init__(self, persona_config: dict = None):
self.persona_config = persona_config or self._default_persona()
def _default_persona(self) -> dict:
"""默认角色配置"""
return {
"name": "Knowledge Assistant",
"role": "A helpful assistant that provides accurate answers based on retrieved documents",
"traits": [
"Precise and factual",
"Cites sources",
"Admits when information is unavailable"
]
}
def build_prompt(
self,
query: str,
contexts: list[dict], # [{"content": str, "doc_id": str, "score": float}, ...]
task_type: str = "qa", # qa / summarize / compare / code
chat_history: list[dict] = None,
custom_instructions: str = None
) -> list[dict]:
"""
构建完整的多层级 prompt
Returns:
OpenAI 格式的 messages: [{"role": "system", "content": ...}, ...]
"""
messages = []
# 1. System Prompt(角色定义 + 全局规则)
system_prompt = self._build_system_prompt(custom_instructions)
messages.append({
"role": "system",
"content": system_prompt
})
# 2. 历史对话(如果有)
if chat_history:
messages.extend(self._format_chat_history(chat_history))
# 3. Context Prompt(检索的文档)
context_prompt = self._build_context_prompt(contexts)
# 4. Task Prompt(任务描述 + 用户查询)
task_prompt = self._build_task_prompt(query, task_type)
# 5. 合并 Context + Task(作为最后一条 user message)
final_user_message = f"{context_prompt}\n\n---\n\n{task_prompt}"
messages.append({
"role": "user",
"content": final_user_message
})
return messages
def _build_system_prompt(self, custom_instructions: str = None) -> str:
"""构建系统提示词(第一层)"""
base_prompt = f"""You are {self.persona_config['name']}, {self.persona_config['role']}.
Your traits:
{chr(10).join(f"- {trait}" for trait in self.persona_config['traits'])}
Core Guidelines:
1. ONLY use information from the provided context documents
2. ALWAYS cite sources using [doc_N] notation (e.g., "According to [doc_1], ...")
3. If the context doesn't contain sufficient information, explicitly state:
"Based on the available documents, I cannot fully answer this question. The documents do not contain information about [missing topic]."
4. Distinguish between facts (from documents) and inferences (clearly marked as such)
5. Be concise but complete - aim for clarity over length
6. If asked about something contradictory to the documents, prioritize the document information
Response Format:
- Start with a direct answer
- Provide supporting details with citations
- End with a summary if the answer is complex
Never:
- Make up information not in the documents
- Use external knowledge unless explicitly instructed
- Speculate without clearly marking it as inference"""
if custom_instructions:
base_prompt += f"\n\nAdditional Instructions:\n{custom_instructions}"
return base_prompt
def _build_context_prompt(self, contexts: list[dict]) -> str:
"""构建上下文提示词(第二层)"""
if not contexts:
return "No relevant documents were found for this query."
# 按相关性分数排序
sorted_contexts = sorted(contexts, key=lambda x: x.get("score", 0), reverse=True)
# 格式化为编号文档
context_lines = ["# Retrieved Documents\n"]
for i, ctx in enumerate(sorted_contexts, 1):
doc_id = ctx.get("doc_id", "unknown")
score = ctx.get("score", 0.0)
content = ctx["content"]
# 添加元数据标记(便于引用追踪)
context_lines.append(f"## [doc_{i}] (Document ID: {doc_id}, Relevance: {score:.2f})")
context_lines.append(content)
context_lines.append("") # 空行分隔
return "\n".join(context_lines)
def _build_task_prompt(self, query: str, task_type: str) -> str:
"""构建任务提示词(第三层)"""
task_instructions = {
"qa": "Answer the following question based on the documents above:",
"summarize": "Summarize the key information from the documents above regarding:",
"compare": "Compare and contrast the information from the documents above about:",
"code": "Provide a code example or explanation based on the documents above for:"
}
instruction = task_instructions.get(task_type, task_instructions["qa"])
return f"""# Task
{instruction}
**User Query:** {query}
**Instructions:**
- Cite each fact using [doc_N] notation
- If multiple documents mention the same fact, cite all: [doc_1, doc_3]
- Structure your answer with clear sections if needed
- Use markdown formatting for readability
**Answer:**"""
def _format_chat_history(self, chat_history: list[dict]) -> list[dict]:
"""格式化历史对话"""
formatted = []
for msg in chat_history[-10:]: # 保留最近 10 轮
formatted.append({
"role": msg["role"], # user / assistant
"content": msg["content"]
})
return formatted
# 使用示例
builder = PromptBuilder(persona_config={
"name": "Technical Documentation Assistant",
"role": "An expert in explaining technical documentation",
"traits": ["Precise", "Educational", "Code-focused"]
})
# 构建 prompt
messages = builder.build_prompt(
query="How do I implement rate limiting?",
contexts=[
{
"content": "Rate limiting can be implemented using token bucket algorithm...",
"doc_id": "doc_123",
"score": 0.92
},
{
"content": "Common rate limiting strategies include fixed window, sliding window...",
"doc_id": "doc_456",
"score": 0.85
}
],
task_type="code",
custom_instructions="Provide Python examples"
)
# 调用 LLM
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages,
temperature=0.3
)
print(response.choices[0].message.content)企业级特性:
- ✅ 分层设计(易维护和版本管理)
- ✅ 完整引用追踪(每个事实可溯源)
- ✅ 多任务支持(QA/摘要/比较/代码)
- ✅ 人格可配置(适配不同场景)
最佳实践:
- 将 System Prompt 存储在配置文件(支持 A/B 测试)
- 记录每次生成的完整 prompt(可复现和调试)
- 监控引用质量(是否所有事实都有引用)
3. LightRAG:全局/本地双模式 Prompt
设计理念:根据查询类型动态选择全局视角或局部细节
class LightRAGPromptManager:
"""
LightRAG 双模式 Prompt 管理器
模式:
1. Global Mode:回答需要多文档综合的问题("整体趋势"、"对比")
2. Local Mode:回答需要单文档细节的问题("具体步骤"、"代码示例")
"""
def __init__(self):
self.global_prompt_template = self._init_global_template()
self.local_prompt_template = self._init_local_template()
def _init_global_template(self) -> str:
"""全局模式模板(强调综合分析)"""
return """# Role
You are a knowledge synthesis expert who excels at combining information from multiple sources to provide comprehensive answers.
# Context Documents
{context}
# Task
Answer the following question by SYNTHESIZING information across all provided documents:
{query}
# Instructions
1. **Cross-reference**: Compare and contrast information from different documents
2. **Identify patterns**: Look for common themes or trends
3. **Resolve conflicts**: If documents disagree, acknowledge and explain
4. **Comprehensive view**: Provide a holistic answer that leverages all sources
5. **Cite sources**: Use [doc_N] notation for each claim
# Output Format
- **Overview**: Start with a high-level summary
- **Key Points**: Break down into organized sections
- **Synthesis**: Connect insights from multiple documents
- **Citations**: Reference sources throughout
Answer:"""
def _init_local_template(self) -> str:
"""本地模式模板(强调具体细节)"""
return """# Role
You are a detail-oriented assistant who provides precise, specific answers based on exact information from documents.
# Context Documents
{context}
# Task
Answer the following question using SPECIFIC details from the documents:
{query}
# Instructions
1. **Precision**: Quote exact phrases or numbers when relevant
2. **Step-by-step**: If explaining a process, break it down clearly
3. **Examples**: Include concrete examples from the documents
4. **Context**: Provide enough context for standalone understanding
5. **Cite sources**: Use [doc_N] notation for each fact
# Output Format
- **Direct Answer**: Start with the specific answer
- **Supporting Details**: Provide concrete evidence from documents
- **Examples**: Include relevant examples or code snippets
- **Citations**: Reference sources throughout
Answer:"""
def select_mode(self, query: str) -> str:
"""
根据查询类型选择模式
Global Mode 触发词:
- 趋势、对比、总结、综合、整体、发展、变化、影响
- trend, compare, overall, evolution, impact, summary
Local Mode 触发词:
- 如何、步骤、代码、示例、具体、详细
- how to, step, code, example, specific, detail
"""
query_lower = query.lower()
global_keywords = [
"trend", "compare", "overall", "evolution", "impact",
"summary", "趋势", "对比", "整体", "发展", "影响", "总结"
]
local_keywords = [
"how", "step", "code", "example", "specific", "detail",
"如何", "步骤", "代码", "示例", "具体", "详细"
]
global_score = sum(1 for kw in global_keywords if kw in query_lower)
local_score = sum(1 for kw in local_keywords if kw in query_lower)
# 默认使用 local mode(更安全,避免过度推断)
return "global" if global_score > local_score else "local"
def build_prompt(
self,
query: str,
contexts: list[str],
mode: str = None
) -> str:
"""
构建 prompt
Args:
mode: "global" / "local" / None (auto-detect)
"""
# 自动检测模式
if mode is None:
mode = self.select_mode(query)
# 格式化上下文
context_str = self._format_contexts(contexts)
# 选择模板
template = (
self.global_prompt_template if mode == "global"
else self.local_prompt_template
)
# 填充模板
prompt = template.format(context=context_str, query=query)
print(f"[LightRAG] Selected mode: {mode}")
return prompt
def _format_contexts(self, contexts: list[str]) -> str:
"""格式化上下文(统一格式)"""
formatted = []
for i, ctx in enumerate(contexts, 1):
formatted.append(f"[doc_{i}]\n{ctx}\n")
return "\n".join(formatted)
# 使用示例
manager = LightRAGPromptManager()
# 示例1:全局查询(自动检测为 global mode)
query1 = "What are the overall trends in RAG research in 2024?"
prompt1 = manager.build_prompt(query1, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: global
# 示例2:本地查询(自动检测为 local mode)
query2 = "How do I implement vector search in Python?"
prompt2 = manager.build_prompt(query2, contexts=retrieved_docs)
# Output: [LightRAG] Selected mode: local
# 示例3:手动指定模式
prompt3 = manager.build_prompt(
query="Explain RAG architecture",
contexts=retrieved_docs,
mode="local" # 强制使用 local mode
)优势:
- ✅ 自动适配查询类型(提升回答质量)
- ✅ 简单高效(无需复杂规则)
- ✅ 易扩展(添加新模式很简单)
延伸:
- 可结合查询分类器(ML 模型判断查询类型)
- 可添加更多模式(如 "creative mode" 用于头脑风暴)
流式输出优化
为什么需要流式输出?
传统的非流式输出需要等待 LLM 生成完整响应(可能 5-10 秒),用户体验差。流式输出可以:
- ✅ 降低首 token 延迟(从 5s 降至 500ms)
- ✅ 提升感知速度(逐字显示类似 ChatGPT)
- ✅ 允许用户提前中断(节省成本)
完整流式实现
from typing import AsyncIterator
import asyncio
class StreamingGenerator:
"""流式生成器(支持 SSE / WebSocket)"""
def __init__(self, llm_client):
self.llm = llm_client
async def generate_stream(
self,
messages: list[dict],
model: str = "gpt-4"
) -> AsyncIterator[str]:
"""
生成流式响应
Yields:
每个 token 或 token 组
"""
response = await self.llm.chat.completions.create(
model=model,
messages=messages,
temperature=0.3,
stream=True # 启用流式
)
async for chunk in response:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
async def generate_with_metadata(
self,
messages: list[dict],
model: str = "gpt-4"
) -> AsyncIterator[dict]:
"""
流式生成 + 元数据(token 计数、延迟等)
Yields:
{
"type": "token" / "metadata" / "done",
"content": str,
"metadata": {...}
}
"""
import time
start_time = time.time()
first_token_time = None
total_tokens = 0
response = await self.llm.chat.completions.create(
model=model,
messages=messages,
temperature=0.3,
stream=True,
stream_options={"include_usage": True} # 包含 token 统计
)
async for chunk in response:
# 记录首 token 延迟
if first_token_time is None and chunk.choices[0].delta.content:
first_token_time = time.time()
yield {
"type": "metadata",
"metadata": {
"time_to_first_token": first_token_time - start_time
}
}
# 输出 token
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
total_tokens += len(content.split()) # 粗略估计
yield {
"type": "token",
"content": content
}
# 完成时输出统计
if chunk.choices[0].finish_reason:
end_time = time.time()
yield {
"type": "done",
"metadata": {
"total_time": end_time - start_time,
"time_to_first_token": first_token_time - start_time,
"total_tokens": total_tokens,
"tokens_per_second": total_tokens / (end_time - start_time)
}
}
# FastAPI 集成示例(SSE)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
generator = StreamingGenerator(openai.AsyncOpenAI())
@app.post("/chat/stream")
async def chat_stream(request: dict):
"""流式聊天端点"""
messages = request["messages"]
async def event_stream():
"""SSE 事件流"""
async for chunk in generator.generate_stream(messages):
# SSE 格式:data: {content}\n\n
yield f"data: {chunk}\n\n"
# 结束标记
yield "data: [DONE]\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream"
)
# 客户端示例(JavaScript)
"""
const eventSource = new EventSource('/chat/stream', {
method: 'POST',
body: JSON.stringify({messages: [...]})
});
eventSource.onmessage = (event) => {
if (event.data === '[DONE]') {
eventSource.close();
return;
}
// 逐字显示
const token = event.data;
document.getElementById('answer').textContent += token;
};
"""高级技巧
1. 上下文压缩(处理超长上下文)
class ContextCompressor:
"""上下文压缩器(减少 token 使用)"""
def __init__(self, llm_client):
self.llm = llm_client
async def compress_contexts(
self,
query: str,
contexts: list[str],
max_tokens: int = 4000
) -> list[str]:
"""
压缩上下文到指定 token 限制
策略:
1. 提取每个文档中与查询相关的句子
2. 重排序句子(最相关的在前)
3. 截断到 token 限制
"""
compressed = []
for ctx in contexts:
# 提取相关句子
relevant_sentences = await self._extract_relevant_sentences(
query, ctx
)
compressed.extend(relevant_sentences)
# 按相关性排序
sorted_sentences = await self._rank_sentences(query, compressed)
# 截断到 token 限制
final_context = self._truncate_to_tokens(sorted_sentences, max_tokens)
return final_context
async def _extract_relevant_sentences(
self,
query: str,
document: str
) -> list[str]:
"""提取与查询相关的句子"""
sentences = self._split_sentences(document)
# 计算每个句子与查询的相似度
scores = []
for sent in sentences:
score = await self._compute_similarity(query, sent)
scores.append((sent, score))
# 返回相关度 > 阈值的句子
relevant = [sent for sent, score in scores if score > 0.5]
return relevant2. Few-Shot 示例(提升特定任务质量)
FEW_SHOT_EXAMPLES = {
"qa_with_citation": [
{
"query": "What is the capital of France?",
"context": "[doc_1] France is a country in Europe. Paris is its capital and largest city.",
"answer": "The capital of France is Paris [doc_1]."
},
{
"query": "When was Python created?",
"context": "[doc_1] Python was created by Guido van Rossum in 1991.",
"answer": "Python was created in 1991 by Guido van Rossum [doc_1]."
}
]
}
def build_few_shot_prompt(query: str, context: str, task: str = "qa_with_citation") -> str:
"""构建 Few-Shot Prompt"""
examples = FEW_SHOT_EXAMPLES[task]
prompt = "# Examples\n\n"
for ex in examples:
prompt += f"Context: {ex['context']}\n"
prompt += f"Question: {ex['query']}\n"
prompt += f"Answer: {ex['answer']}\n\n"
prompt += f"# Your Turn\n\nContext: {context}\nQuestion: {query}\nAnswer:"
return prompt常见问题
如何控制答案长度
方法1:在 prompt 中明确要求("in 2-3 sentences")
方法2:使用 max_tokens 参数限制输出长度
方法3:后处理截断(保留完整句子)
如何减少幻觉
低温度:temperature=0.1-0.3(减少创造性)
明确指示:强调"仅使用上下文"、"不知道就说不知道"
自我验证:参考 Self-Corrective-Agentic-RAG 方案
如何优化引用质量
结构化上下文:每个文档编号 [doc_N]
Few-Shot:提供引用示例
后处理验证:检查答案中的引用是否真实存在
多语言生成如何处理
模型选择:使用多语言模型(GPT-4, Claude)
语言检测:自动检测查询语言并用同语言回答
翻译策略:查询翻译 → 英文检索 → 翻译回答
延伸阅读
- 提示词工程 - 深入的提示词设计技巧
- 如何提高 RAG 性能 - 包含生成优化章节
- 构建高质量的 RAG 系统 - 质量保证方法
参考文献
- Self-Corrective RAG (LangGraph examples) - Hallucination detection and corrective loops
- OpenAI Prompting Guide - Structured prompts and citations
- Anthropic Prompt Engineering - Safety and groundedness
下一步:进入 性能优化实战 了解延迟、成本、吞吐的全方位优化。