跳转至

单证提取智能体 (Document Extraction Agent)

概述

document_extraction 是平台支持的第三种智能体类型,专注于从 PDF 运输单证中提取结构化数据。通过多模态视觉语言模型分析 PDF 页面图像,自动提取订舱信息、各方联系方式、港口、船舶、集装箱、货物等 100+ 个字段。

插件架构

class DocumentExtractionPlugin:
    agent_type = "document_extraction"
    display_name = "Document Extraction Agent"
    description = "Extract structured shipping booking data from PDF documents"

实现了 AgentTypePlugin 协议,通过 AgentTypeRegistry 注册:

# ai_service/agents/document_extraction/__init__.py
AgentTypeRegistry.register("document_extraction", DocumentExtractionPlugin())

模块结构

ai_service/agents/document_extraction/
├── __init__.py      # 插件自动注册
├── graph.py         # LangGraph 工作流构建
├── nodes.py         # 5 个处理节点
├── state.py         # DocumentExtractionState TypedDict
├── schemas.py       # ShippingBookingData Pydantic 模型
└── pdf_utils.py     # PDF 转图片与 Prompt 构建

LangGraph 工作流

stateDiagram-v2
    [*] --> pdf_ingest
    pdf_ingest --> pdf_processor
    pdf_processor --> data_extraction : success
    pdf_processor --> pdf_reporter : failed
    data_extraction --> validation : success
    data_extraction --> pdf_processor : retry
    data_extraction --> pdf_reporter : max retries
    validation --> pdf_reporter : success / max retries
    validation --> pdf_processor : retry
    pdf_reporter --> [*]

数据库模型

DocumentExtractionJob

列名 类型 说明
id VARCHAR(64) PK 任务 UUID
agent_id VARCHAR(64) FK 关联智能体
source_object_key VARCHAR(1024) MinIO 源文件路径
output_object_key VARCHAR(1024) MinIO 结果路径
status VARCHAR(32) pending/running/completed/failed
error_message TEXT 错误信息
created_at TIMESTAMP 创建时间
completed_at TIMESTAMP 完成时间

配置参考

[agent_types.document_extraction]
default_model = "qwen3-vl-plus-2025-12-19"  # 视觉语言模型
default_provider = "dashscope"
max_retries = 3        # 全局重试次数
max_pages = 20         # 最大处理页数
dpi = 300              # PDF 渲染分辨率
temp_dir = "/tmp/doc_extraction_workdir"

API 参考

Document extraction agent type plugin.

Builds a multi-node LangGraph that processes PDF documents through ingestion, image conversion, LLM extraction, validation, and reporting.

属性:

名称 类型 描述
agent_type str

"document_extraction".

display_name str

Human-readable name.

description str

Brief description.

build_graph

build_graph(**kwargs)

Build and compile the document extraction LangGraph.

The graph implements a multi-node pipeline with retry logic: pdf_ingest -> pdf_processor -> data_extraction -> validation -> pdf_reporter

参数:

名称 类型 描述 默认
**kwargs Any

Reserved for future configuration.

{}

返回:

名称 类型 描述
CompiledStateGraph CompiledStateGraph

Compiled document extraction processing graph.

get_state_schema

get_state_schema()

Return the DocumentExtractionState schema.

返回:

名称 类型 描述
type type

:class:DocumentExtractionState.

Bases: BaseModel

Complete shipping booking data model.

Comprehensive model for structured shipping document extraction, containing 100+ fields covering booking metadata, parties, ports, vessel info, containers, cargo, compliance, and logistics details.

to_dict

to_dict()

Return a JSON-serializable dict representation.

返回:

类型 描述
Dict[str, Any]

Dict[str, Any]: Model data as dictionary with aliases.

prompt_schema

prompt_schema()

Return a JSON schema block for prompt injection with extraction instructions.

返回:

名称 类型 描述
str str

Formatted schema string with field type instructions.