单证提取智能体 (Document Extraction Agent)

概述

document_extraction 是平台支持的第三种智能体类型，专注于从 PDF 运输单证中提取结构化数据。通过多模态视觉语言模型分析 PDF 页面图像，自动提取订舱信息、各方联系方式、港口、船舶、集装箱、货物等 100+ 个字段。

插件架构

class DocumentExtractionPlugin:
    agent_type = "document_extraction"
    display_name = "Document Extraction Agent"
    description = "Extract structured shipping booking data from PDF documents"

实现了 AgentTypePlugin 协议，通过 AgentTypeRegistry 注册：

# ai_service/agents/document_extraction/__init__.py
AgentTypeRegistry.register("document_extraction", DocumentExtractionPlugin())

模块结构

ai_service/agents/document_extraction/
├── __init__.py      # 插件自动注册
├── graph.py         # LangGraph 工作流构建
├── nodes.py         # 5 个处理节点
├── state.py         # DocumentExtractionState TypedDict
├── schemas.py       # ShippingBookingData Pydantic 模型
└── pdf_utils.py     # PDF 转图片与 Prompt 构建

LangGraph 工作流

stateDiagram-v2
    [*] --> pdf_ingest
    pdf_ingest --> pdf_processor
    pdf_processor --> data_extraction : success
    pdf_processor --> pdf_reporter : failed
    data_extraction --> validation : success
    data_extraction --> pdf_processor : retry
    data_extraction --> pdf_reporter : max retries
    validation --> pdf_reporter : success / max retries
    validation --> pdf_processor : retry
    pdf_reporter --> [*]

数据库模型

DocumentExtractionJob

列名	类型	说明
id	VARCHAR(64) PK	任务 UUID
agent_id	VARCHAR(64) FK	关联智能体
source_object_key	VARCHAR(1024)	MinIO 源文件路径
output_object_key	VARCHAR(1024)	MinIO 结果路径
status	VARCHAR(32)	pending/running/completed/failed
error_message	TEXT	错误信息
created_at	TIMESTAMP	创建时间
completed_at	TIMESTAMP	完成时间

配置参考

[agent_types.document_extraction]
default_model = "qwen3-vl-plus-2025-12-19"  # 视觉语言模型
default_provider = "dashscope"
max_retries = 3        # 全局重试次数
max_pages = 20         # 最大处理页数
dpi = 300              # PDF 渲染分辨率
temp_dir = "/tmp/doc_extraction_workdir"

API 参考

Document extraction agent type plugin.

Builds a multi-node LangGraph that processes PDF documents through ingestion, image conversion, LLM extraction, validation, and reporting.

属性：

名称	类型	描述
`agent_type`	`str`	`"document_extraction"`.
`display_name`	`str`	Human-readable name.
`description`	`str`	Brief description.

build_graph

build_graph(**kwargs)

Build and compile the document extraction LangGraph.

The graph implements a multi-node pipeline with retry logic: pdf_ingest -> pdf_processor -> data_extraction -> validation -> pdf_reporter

参数：

名称	类型	描述	默认
`**kwargs`	`Any`	Reserved for future configuration.	`{}`

返回：

名称	类型	描述
`CompiledStateGraph`	`CompiledStateGraph`	Compiled document extraction processing graph.

get_state_schema

get_state_schema()

Return the DocumentExtractionState schema.

返回：

名称	类型	描述
`type`	`type`	:class:`DocumentExtractionState`.

Bases: BaseModel

Complete shipping booking data model.

Comprehensive model for structured shipping document extraction, containing 100+ fields covering booking metadata, parties, ports, vessel info, containers, cargo, compliance, and logistics details.

to_dict

to_dict()

Return a JSON-serializable dict representation.

返回：

类型	描述
`Dict[str, Any]`	Dict[str, Any]: Model data as dictionary with aliases.

prompt_schema

prompt_schema()

Return a JSON schema block for prompt injection with extraction instructions.

返回：

名称	类型	描述
`str`	`str`	Formatted schema string with field type instructions.