故障排查指南

本页汇总 Services 模块的常见问题与处理方法。

嵌入服务问题

问题 1：模型加载超时

现象：TimeoutError: Embedding model loading timed out after 300s

原因：网络慢、模型大或 Hugging Face Hub 不可用。

解决：

提高超时：

export TIMEOUT_EMBEDDING_MODEL_LOAD_SECONDS=600

预下载模型：

python -c "
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model.save('models/embedding/paraphrase-multilingual-MiniLM-L12-v2')
"

使用镜像：

export HF_ENDPOINT=https://hf-mirror.com

问题 2：离线模式找不到模型

现象：OSError: Can't load model from local files

原因：本地缓存不存在。

解决：

export EMBEDDING_OFFLINE_MODE=false
# 运行一次应用以下载模型
export EMBEDDING_OFFLINE_MODE=true

问题 3：维度不匹配

现象：嵌入维度不一致警告。

原因：EMBEDDING_DIM 与模型输出不一致。

解决：

# all-MiniLM-L6-v2
export EMBEDDING_DIM=384

# all-mpnet-base-v2
export EMBEDDING_DIM=768

问题 4：空文本错误

现象：ValueError: Cannot embed empty text

原因：传入空或仅空白文本。

解决：

text = text.strip()
if not text:
    logger.warning("Skipping empty text")
    return None

embedding = embedding_service.embed_text(text)

摄取服务问题

问题 5：文档处理超时

现象：TimeoutError 超过默认 600 秒。

原因：文档过大或依赖服务慢。

解决：

提高超时：

export TIMEOUT_INGESTION_DOCUMENT_SECONDS=1200

上传前拆分文档
检查 MinIO 与 Qdrant 网络

问题 6：任务卡在 RUNNING

原因：服务中途崩溃。

解决：

from ai_service.services.ingestion import cleanup_stale_ingestion_jobs
cleaned = cleanup_stale_ingestion_jobs()
print(f"Cleaned up {cleaned} stale jobs")

问题 7：解析失败

现象：Unsupported content type

原因：格式未支持。

解决：

from ai_service.services.parsers import is_supported_content_type

if not is_supported_content_type(content_type):
    print(f"Unsupported format: {content_type}")
    print("Supported: .txt, .md, .pdf, .docx")

问题 8：分块为空

现象：文档索引完成但分块数为 0。

原因：解析结果为空。

解决：

from ai_service.services.parsers import get_parser

parser = get_parser(content_type)
result = parser.parse(file_data, filename)
print(f"Extracted text length: {len(result.text)}")

分块服务问题

问题 9：NLTK 数据缺失

现象：使用 sentence 或 semantic 策略时出现 LookupError: Resource punkt not found

原因：NLTK 的 punkt 数据未下载。

解决：

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')  # 某些版本需要

或在应用启动时自动下载：

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

问题 10：分块策略不存在

现象：ValueError: Unknown chunking strategy: xxx

原因：策略名称拼写错误或使用了未实现的策略。

解决：

检查策略名称，使用以下之一：

valid_strategies = ["fixed_size", "sentence", "paragraph", "semantic", "recursive"]

from ai_service.services.chunking import get_chunking_strategy

strategy = get_chunking_strategy("sentence")  # 正确
# strategy = get_chunking_strategy("sentences")  # 错误

问题 11：分块参数无效

现象：策略初始化失败或参数被忽略。

原因：参数名称与策略不匹配。

解决：

策略	有效参数
fixed_size	`chunk_size`, `chunk_overlap`
sentence	`target_size`, `max_sentences_per_chunk`
paragraph	`target_size`, `max_paragraphs_per_chunk`
recursive	`chunk_size`, `chunk_overlap`, `separators`
semantic	`similarity_threshold`, `min_chunk_size`, `max_chunk_size`

# 正确示例
from ai_service.services.chunking import get_chunking_strategy

strategy = get_chunking_strategy(
    "sentence",
    params={"target_size": 512, "max_sentences_per_chunk": 10}
)

问题 12：分块过小

原因：文本包含大量短行或换行。

解决：

# 固定大小策略
export CHUNK_SIZE=1500
export CHUNKING_OVERLAP=300

# 或使用段落策略
export CHUNK_DEFAULT_STRATEGY=paragraph
export CHUNK_PARAGRAPH_TARGET_SIZE=2048

问题 13：句子被截断

原因：句子长度超过 chunk_size。

解决：使用句子级分块策略：

from ai_service.services.chunking import get_chunking_strategy

strategy = get_chunking_strategy("sentence")
chunks = strategy.chunk_text(text, document_id, document_name, source_id)

问题 14：递归策略内存占用高

现象：处理大文档时内存占用较高。

原因：递归策略的多层级处理需要额外内存。

解决：

减小文档大小
使用 fixed_size 或 sentence 策略
增加系统内存

检索服务问题

问题 15：没有检索结果

原因：

Agent 未挂载知识源
相似度阈值过高
查询语义差异大

解决：

from ai_service.storage.models import get_active_source_ids_for_agent
from ai_service.utils.database import SessionLocal

db = SessionLocal()
sources = get_active_source_ids_for_agent(db, agent_id="agent-123")
print(f"Mounted sources: {sources}")
db.close()

chunks = rag_service.retrieve_context(
    query=query,
    agent_id=agent_id,
    score_threshold=0.1
)

问题 16：结果不相关

解决：提高 score_threshold 或降低 top_k。

问题 17：检索很慢

原因：集合过大、连接池不足或模型未预热。

解决：预热模型，检查 Qdrant 性能与数据库连接池。

配置问题

问题 18：配置未生效

现象：修改环境变量后配置未改变。

原因：配置在导入时被缓存。

解决：

# 确保在应用启动前设置环境变量
export CHUNK_DEFAULT_STRATEGY=recursive
export CHUNK_SIZE=1024

# 然后启动应用
uv run python -m ai_service

问题 19：配置值类型错误

现象：ValidationError: Input should be a valid integer

原因：环境变量值为字符串，需要正确类型。

解决：

# 正确
export CHUNK_SIZE=512
export CHUNK_OVERLAP=50

# 错误（包含非数字字符）
export CHUNK_SIZE="512 chars"

调试技巧

开启调试日志

import logging
logging.basicConfig(level=logging.DEBUG)

检查服务状态

from ai_service.services.embedding import get_embedding_service

result = get_embedding_service().check_readiness()
print(f"Ready: {result.is_ready}")
print(f"Load time: {result.load_time_seconds}s")

监控任务进度

from ai_service.storage.models import get_ingestion_job_record
from ai_service.utils.database import SessionLocal

db = SessionLocal()
job = get_ingestion_job_record(db, job_id="job-123")
print(f"Status: {job.status}")
print(f"Progress: {job.documents_done}/{job.documents_total}")
print(f"Message: {job.status_message}")
print(f"Chunking Strategy: {job.chunking_strategy}")  # 新增
db.close()

验证分块策略

from ai_service.services.chunking import get_chunking_strategy

# 测试策略
text = "This is a test. This is another sentence. " * 10
strategy = get_chunking_strategy("sentence")
chunks = strategy.chunk_text(text, "doc-1", "test.txt", "source-1")

print(f"Strategy: {strategy.strategy_name}")
print(f"Chunks: {len(chunks)}")
for chunk in chunks:
    print(f"  - {len(chunk.content)} chars: {chunk.content[:50]}...")

获取帮助

查看日志中的详细错误信息
回顾架构设计
查看对应服务文档
检查配置参考