单证提取智能体 (Document Extraction Agent)
概述
document_extraction 是平台支持的第三种智能体类型,专注于从 PDF 运输单证中提取结构化数据。通过多模态视觉语言模型分析 PDF 页面图像,自动提取订舱信息、各方联系方式、港口、船舶、集装箱、货物等 100+ 个字段。
插件架构
class DocumentExtractionPlugin:
agent_type = "document_extraction"
display_name = "Document Extraction Agent"
description = "Extract structured shipping booking data from PDF documents"
实现了 AgentTypePlugin 协议,通过 AgentTypeRegistry 注册:
# ai_service/agents/document_extraction/__init__.py
AgentTypeRegistry.register("document_extraction", DocumentExtractionPlugin())
模块结构
ai_service/agents/document_extraction/
├── __init__.py # 插件自动注册
├── graph.py # LangGraph 工作流构建
├── nodes.py # 5 个处理节点
├── state.py # DocumentExtractionState TypedDict
├── schemas.py # ShippingBookingData Pydantic 模型
└── pdf_utils.py # PDF 转图片与 Prompt 构建
LangGraph 工作流
stateDiagram-v2
[*] --> pdf_ingest
pdf_ingest --> pdf_processor
pdf_processor --> data_extraction : success
pdf_processor --> pdf_reporter : failed
data_extraction --> validation : success
data_extraction --> pdf_processor : retry
data_extraction --> pdf_reporter : max retries
validation --> pdf_reporter : success / max retries
validation --> pdf_processor : retry
pdf_reporter --> [*]
数据库模型
DocumentExtractionJob
| 列名 | 类型 | 说明 |
|---|---|---|
| id | VARCHAR(64) PK | 任务 UUID |
| agent_id | VARCHAR(64) FK | 关联智能体 |
| source_object_key | VARCHAR(1024) | MinIO 源文件路径 |
| output_object_key | VARCHAR(1024) | MinIO 结果路径 |
| status | VARCHAR(32) | pending/running/completed/failed |
| error_message | TEXT | 错误信息 |
| created_at | TIMESTAMP | 创建时间 |
| completed_at | TIMESTAMP | 完成时间 |
配置参考
[agent_types.document_extraction]
default_model = "qwen3-vl-plus-2025-12-19" # 视觉语言模型
default_provider = "dashscope"
max_retries = 3 # 全局重试次数
max_pages = 20 # 最大处理页数
dpi = 300 # PDF 渲染分辨率
temp_dir = "/tmp/doc_extraction_workdir"
API 参考
Document extraction agent type plugin.
Builds a multi-node LangGraph that processes PDF documents through ingestion, image conversion, LLM extraction, validation, and reporting.
属性:
| 名称 | 类型 | 描述 |
|---|---|---|
agent_type |
str
|
|
display_name |
str
|
Human-readable name. |
description |
str
|
Brief description. |
build_graph
build_graph(**kwargs)
Build and compile the document extraction LangGraph.
The graph implements a multi-node pipeline with retry logic: pdf_ingest -> pdf_processor -> data_extraction -> validation -> pdf_reporter
参数:
| 名称 | 类型 | 描述 | 默认 |
|---|---|---|---|
**kwargs
|
Any
|
Reserved for future configuration. |
{}
|
返回:
| 名称 | 类型 | 描述 |
|---|---|---|
CompiledStateGraph |
CompiledStateGraph
|
Compiled document extraction processing graph. |
get_state_schema
get_state_schema()
Return the DocumentExtractionState schema.
返回:
| 名称 | 类型 | 描述 |
|---|---|---|
type |
type
|
:class: |
Bases: BaseModel
Complete shipping booking data model.
Comprehensive model for structured shipping document extraction, containing 100+ fields covering booking metadata, parties, ports, vessel info, containers, cargo, compliance, and logistics details.
to_dict
to_dict()
Return a JSON-serializable dict representation.
返回:
| 类型 | 描述 |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Model data as dictionary with aliases. |
prompt_schema
prompt_schema()
Return a JSON schema block for prompt injection with extraction instructions.
返回:
| 名称 | 类型 | 描述 |
|---|---|---|
str |
str
|
Formatted schema string with field type instructions. |