Jean's Blog

一个专注软件测试开发技术的个人博客

0%

RAG组件--数据导入之解析pdf中的表格数据

PDF是一个带表格的结构化数据,所以直接使用SimpleDirectoryReader加载并用SentenceSplitter进行分块后,信息可能被破坏,从而导致问答效果不佳。

  • 重要性:解析PDF表格数据是实际需求中最具挑战性的部分,需要保持表格数据结构的完整性才能确保问答系统给出精准答案。
  • 难点:PDF表格是结构化数据和非结构化数据的混合体,需要将表格内容与上下文信息(如年份标题)正确关联才能准确回答问题。

通过Camelot工具提取表格数据

  • 功能特点:
    • 老牌PDF表格提取工具,可直接将表格保存为CSV格式
    • 能完整提取多个表格数据(任务1)
    • 但无法自动关联表格上下文信息(任务2)
  • 安装要求:
    • 需要先安装Ghostscript:sudo apt-get install ghostscript / brew install ghostscript
    • 然后安装Python包:pip install ghostscript 和 pip install camelot-py
    • 兼容性问题:与部分PDF阅读器不兼容,建议创建独立环境使用

示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import camelot
import pandas as pd
# from ctypes.util import find_library
# find_library("gs")
import time

pdf_path = "../data/复杂PDF/billionaires_page-1-5.pdf"

start_time = time.time()
tables = camelot.read_pdf(pdf_path, pages="all")
end_time = time.time()
print(f"PDF表格解析耗时: {end_time - start_time:.2f}秒")

# 转换所有表格为 DataFrame
if tables:
# 遍历所有表格
for i, table in enumerate(tables, 1):
# 将表格转换为 DataFrame
df = table.df

# 打印当前表格数据
print(f"\n表格 {i} 数据:")
print(df)

# 显示基本信息
print(f"\n表格 {i} 基本信息:")
print(df.info())

# 保存到CSV文件
csv_filename = f"output/billionaires_table_{i}.csv"
df.to_csv(csv_filename, index=False)
print(f"\n表格 {i} 数据已保存到 {csv_filename}")

执行结果,输出内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PDF表格解析耗时: 2.89秒

表格 1 数据:
0 1
0 Icon Description
1 Has not changed from the previous ranking.
2 Has increased from the previous ranking.
3 Has decreased from the previous ranking.

表格 1 基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 4 non-null object
1 1 4 non-null object
dtypes: object(2)
memory usage: 196.0+ bytes
None

表格 1 数据已保存到 output/billionaires_table_1.csv

表格 2 数据:
0 1 2 3 4 \
...
memory usage: 660.0+ bytes
None

表格 6 数据已保存到 output/billionaires_table_6.csv
……
  • 基本用法:
    • 使用camelot.read_pdf()方法读取PDF表格
    • 可以指定pages=”all”参数提取所有页面的表格
    • 提取结果可直接转换为pandas DataFrame格式
  • 性能表现:
    • 解析5页PDF耗时2.89秒
    • 能准确提取结构完整的表格数据
  • 输出格式:
    • 支持将表格数据保存为CSV文件(代码示例中在output目录下保存输出的csv文件)
    • 提供表格基本信息输出功能
  • 局限性:
    • 无法提取非标准表格(如边缘表格)
    • 与其他Python包可能存在环境冲突

通过PDFPlumber工具提取表格数据

  • 优势:
    • 提取效率和准确度略优于Camelot
    • 处理速度较快
  • 局限性:与Camelot类似,仅专注于表格内容提取,不处理上下文关联
  • 适用场景:适合只需要提取表格内容,不关心上下文信息的简单项目

提取PDF表格

示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pdfplumber
import pandas as pd
import time

# 记录开始时间
start_time = time.time()

# 打开PDF文件
pdf = pdfplumber.open("../data/复杂PDF/billionaires_page-1-5.pdf")

# 遍历每一页
for page in pdf.pages:
# 提取表格
tables = page.extract_tables()

# 检查是否找到表格
if tables:
print(f"在第 {page.page_number} 页找到 {len(tables)} 个表格")

# 遍历该页的所有表格
for i, table in enumerate(tables):
print(f"\n处理第 {i+1} 个表格:")

# 将表格转换为DataFrame
df = pd.DataFrame(table)

# 如果第一行包含列名,可以设置为列名
if len(df) > 0:
df.columns = df.iloc[0]
df = df.iloc[1:] # 删除重复的列名行

print(df)
print("-" * 50)

# 关闭PDF
pdf.close()

# 记录结束时间并计算总耗时
end_time = time.time()
print(f"\nPDF表格提取总耗时: {end_time - start_time:.2f}秒")

执行结果,输出内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
在第 1 页找到 1 个表格

处理第 1 个表格:
0 List of the world's billionaires, ranked in order of net worth
1 The net worth of the world's billionaires incr...
2 Publication details
3 Publisher Whale Media Investments\nForbes family
4 Publication Forbes
5 First published March 1987[1]
6 Latest publication April 4, 2023
7 Current list details (2023)[2]
8 Wealthiest Bernard Arnault
9 Net worth (1st) US$211 billion
10 Number of 2,640 (from 2668)\nbillionaires
11 Total list net worth US$12.2 trillion (from US...
12 Number of women 337
13 New members to the 150\nlist
14 Forbes: The World's Billionaires website (http...
--------------------------------------------------
在第 2 页找到 1 个表格

处理第 1 个表格:
0 Icon Description
1 Has not changed from the previous ranking.
2 Has increased from the previous ranking.
...
10 Alphabet Inc.
--------------------------------------------------

PDF表格提取总耗时: 0.38秒
  • 性能优势:
    • 解析相同5页PDF仅需0.38秒
    • 速度是camelot的3-4倍
  • 提取能力:
    • 能提取所有类型的表格,包括边缘表格
    • 提取结果同样可转换为DataFrame
  • 环境兼容性:
    • 安装简单,与其他包冲突较少
    • 可直接与LlamaIndex等工具集成
  • 问答应用:
    • 提取的表格数据可用于构建问答系统
    • 但缺乏上下文关联可能导致回答不准确

提取PDF表格并问答

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import pdfplumber
import pandas as pd
from llama_index.core import VectorStoreIndex
from llama_index.core import Document
from typing import List

pdf_path = "../data/复杂PDF/billionaires_page-1-5.pdf"

# 打开 PDF 并解析表格
with pdfplumber.open(pdf_path) as pdf:
tables = []
for page in pdf.pages:
table = page.extract_table()
if table:
tables.append(table)

# 转换所有表格为 DataFrame 并构建文档
documents: List[Document] = []
if tables:
# 遍历所有表格
for i, table in enumerate(tables, 1):
# 将表格转换为 DataFrame
df = pd.DataFrame(table)

# 保存到CSV文件
# csv_filename = f"billionaires_table_{i}.csv"
# df.to_csv(csv_filename, index=False)
# print(f"\n表格 {i} 数据已保存到 {csv_filename}")

# 将DataFrame转换为文本
text = df.to_string()

# 创建Document对象
doc = Document(text=text, metadata={"source": f"表格{i}"})
documents.append(doc)

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 创建查询引擎
query_engine = index.as_query_engine()

# 示例问答
questions = [
"2023年谁是最富有的人?",
"最年轻的富豪是谁?"
]

print("\n===== 问答演示 =====")
for question in questions:
response = query_engine.query(question)
print(f"\n问题: {question}")
print(f"回答: {response}")

通过Unstructured提取表格标题

  • 核心优势:
    • 不仅能提取表格内容,还能获取表格元数据和父节点信息
    • 可将表格转换为HTML或文本格式,保持一定结构
  • 优化策略:
    • 使用cleaners过滤页眉页脚
    • 通过coordinates去除干扰区域
    • 利用metadata识别并排除无关信息
  • 局限性:当标题和表格不在同一页时,关联准确性下降
  • 注意点:
    • 表格中可能包含多年份的数据,每张表结构相似但年份不同
    • 表格可能跨页,增加标题关联难度
  • 关键挑战:
    • 必须正确识别表格对应的年份才能回答准确问题
    • 页眉页脚信息可能干扰标题识别
  • 解决方案:
    • 利用Unstructured的parent_id字段建立父子文档关系
    • 通过coordinates优化排除干扰区域

表格提取

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import os
import sys
from dotenv import load_dotenv
from pathlib import Path
from unstructured.partition.pdf import partition_pdf

# 加载环境变量
load_dotenv()

# 确保工作目录正确
# 获取脚本所在目录的父目录(项目根目录)
print(os.getcwd())
script_dir = Path("/Users/jinglv/PycharmProjects/llm-rag-system")
if script_dir.exists():
os.chdir(script_dir)
print(f"工作目录设置为: {os.getcwd()}")

# 导入 LlamaIndex 相关模块
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# 全局设置
Settings.llm = OpenAI(
model=os.getenv("DEEPSEEK_MODEL_NAME"), # DeepSeek API 支持的模型名称
api_key=os.getenv("DEEPSEEK_API_KEY"), # 从环境变量加载API key
base_url=os.getenv("DEEPSEEK_BASE_URL")
)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# 解析 PDF 结构,提取文本和表格
# 使用相对路径,确保从项目根目录开始
file_path = "data/复杂PDF/billionaires_page-1-5.pdf"

# 检查文件是否存在
if not os.path.exists(file_path):
print(f"错误: 文件不存在 - {file_path}")
print(f"当前工作目录: {os.getcwd()}")
print("请确保:")
print("1. 在项目根目录运行脚本")
print("2. PDF文件路径正确")
sys.exit(1)

print(f"正在处理文件: {file_path}")

elements = partition_pdf(
file_path,
strategy="hi_res", # 使用高精度策略
) # 解析PDF文档

# 创建一个元素ID到元素的映射
element_map = {element.id: element for element in elements if hasattr(element, 'id')}

for element in elements:
if element.category == "Table": # 只打印表格数据
print("\n表格数据:")
print("表格元数据:", vars(element.metadata)) # 使用vars()显示所有元数据属性
print("表格内容:")
print(element.text) # 打印表格文本内容

# 获取并打印父节点信息
parent_id = getattr(element.metadata, 'parent_id', None)
if parent_id and parent_id in element_map:
parent_element = element_map[parent_id]
print("\n父节点信息:")
print(f"类型: {parent_element.category}")
print(f"内容: {parent_element.text}")
if hasattr(parent_element, 'metadata'):
print(f"父节点元数据: {vars(parent_element.metadata)}") # 同样使用vars()显示所有元数据
else:
print(f"未找到父节点 (ID: {parent_id})")
print("-" * 50)

text_elements = [el for el in elements if el.category == "Text"]
table_elements = [el for el in elements if el.category == "Table"]

执行结果,输出内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/Users/jinglv/PycharmProjects/llm-rag-system
工作目录设置为: /Users/jinglv/PycharmProjects/llm-rag-system
正在处理文件: data/复杂PDF/billionaires_page-1-5.pdf
Warning: No languages specified, defaulting to English.

表格数据:
表格元数据: {'detection_class_prob': 0.5629849433898926, 'coordinates': CoordinatesMetadata(points=((np.float64(839.0340576171875), np.float64(1001.4764404296875)), (np.float64(839.0340576171875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1001.4764404296875))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e27b0>), 'links': [{'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}, {'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'b379afb3fd875896f879cc4cfc441305'}
表格内容:
Publisher Whale Media Investments Forbes family Publication First published Forbes March 1987[1] Latest publication April 4, 2023 Current list details (2023)[2] Wealthiest Bernard Arnault Net worth (1st) US$211 billion Number of billionaires 2,640 (from 2668) Total list net worth value US$12.2 trillion (from US$ 12.7 trillion) Number of women 337 New members to the 150 list Forbes: The World's Billionaires website (https://www.forb es.com/billionaires/)

父节点信息:
类型: Title
内容: Methodology
父节点元数据: {'detection_class_prob': 0.7953538298606873, 'coordinates': CoordinatesMetadata(points=((np.float64(98.62982255709053), np.float64(1510.925048828125)), (np.float64(98.62982255709053), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1510.925048828125))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e0620>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf'}
--------------------------------------------------

表格数据:
表格元数据: {'detection_class_prob': 0.8833286762237549, 'coordinates': CoordinatesMetadata(points=((np.float64(102.6220932006836), np.float64(929.1338500976562)), (np.float64(102.6220932006836), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(929.1338500976562))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e1340>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'feb0e76147077250a13cdb0f842659a3'}
表格内容:
Icon Description Has not changed from the previous ranking. Has increased from the previous ranking. Has decreased from the previous ranking.

父节点信息:
类型: Title
内容: Legend
父节点元数据: {'detection_class_prob': 0.7951710224151611, 'coordinates': CoordinatesMetadata(points=((np.float64(96.45149230957031), np.float64(847.3286743164062)), (np.float64(96.45149230957031), np.float64(897.3561401367188)), (np.float64(236.98038987264118), np.float64(897.3561401367188)), (np.float64(236.98038987264118), np.float64(847.3286743164062))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e3590>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': '759e1c4a2a3cf37f98db95d0877208f1'}
...
类型: Title
内容: 2019
父节点元数据: {'detection_class_prob': 0.7982921600341797, 'coordinates': CoordinatesMetadata(points=((np.float64(95.56612396240234), np.float64(971.339599609375)), (np.float64(95.56612396240234), np.float64(1019.1573486328125)), (np.float64(188.0115203857422), np.float64(1019.1573486328125)), (np.float64(188.0115203857422), np.float64(971.339599609375))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8012b0>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 5, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'a69407466af63af128c0512ca7f72190'}
--------------------------------------------------
……
  • 表格提取方法:通过element.category==”Table”筛选表格元素,使用vars(element.metadata)显示元数据,element.text获取表格内容
  • 父节点查找:通过parent_id获取表格的父节点信息,可判断表格所属上下文关系
  • 实际应用问题:2019年表格能准确找到父节点”2019”标题,但2018年表格会被错误关联到页眉”The World’s Billionaires - Wikipedia”(需要消除页眉页脚)
  • 滑动窗口策略:提取表格元素后,向上抓取前3个节点内容,尝试跨越页眉页脚找到正确标题年份
  • 优化建议:需要定位坐标消除页眉页脚,整合连续页面,建立title与跨页表格的关联

表格提取 + 上下文

示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import os
from dotenv import load_dotenv
from unstructured.partition.pdf import partition_pdf
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# 加载环境变量
load_dotenv()

# 全局设置
Settings.llm = OpenAI(
model=os.getenv("DEEPSEEK_MODEL_NAME"), # DeepSeek API 支持的模型名称
api_key=os.getenv("DEEPSEEK_API_KEY"), # 从环境变量加载API key
base_url=os.getenv("DEEPSEEK_BASE_URL")
)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# 解析 PDF 结构,提取文本和表格
file_path = "data/复杂PDF/billionaires_page-1-5.pdf" # 修改为你的文件路径

elements = partition_pdf(
file_path,
strategy="hi_res", # 使用高精度策略
) # 解析PDF文档

# 创建一个元素ID到元素的映射
element_map = {element.id: element for element in elements if hasattr(element, 'id')}

# 创建一个元素索引到元素的映射
element_index_map = {i: element for i, element in enumerate(elements)}

for i, element in enumerate(elements):
if element.category == "Table":
print("\n表格数据:")
print("表格元数据:", vars(element.metadata)) # 使用vars()显示所有元数据属性
print("表格内容:")
print(element.text) # 打印表格文本内容

# 获取并打印父节点信息
parent_id = getattr(element.metadata, 'parent_id', None)
if parent_id and parent_id in element_map:
parent_element = element_map[parent_id]
print("\n父节点信息:")
print(f"类型: {parent_element.category}")
print(f"内容: {parent_element.text}")
if hasattr(parent_element, 'metadata'):
print(f"父节点元数据: {vars(parent_element.metadata)}")
else:
print(f"未找到父节点 (ID: {parent_id})")

# 打印表格前3个节点的内容
print("\n表格前3个节点内容:")
for j in range(max(0, i-3), i):
prev_element = element_index_map.get(j)
if prev_element:
print(f"节点 {j} ({prev_element.category}):")
print(prev_element.text)

print("-" * 50)

text_elements = [el for el in elements if el.category == "Text"]
table_elements = [el for el in elements if el.category == "Table"]

执行结果,输出内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
表格数据:
表格元数据: {'detection_class_prob': 0.5629849433898926, 'coordinates': CoordinatesMetadata(points=((np.float64(839.0340576171875), np.float64(1001.4764404296875)), (np.float64(839.0340576171875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1001.4764404296875))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b5252b0>), 'links': [{'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}, {'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'b379afb3fd875896f879cc4cfc441305'}
表格内容:
Publisher Whale Media Investments Forbes family Publication First published Forbes March 1987[1] Latest publication April 4, 2023 Current list details (2023)[2] Wealthiest Bernard Arnault Net worth (1st) US$211 billion Number of billionaires 2,640 (from 2668) Total list net worth value US$12.2 trillion (from US$ 12.7 trillion) Number of women 337 New members to the 150 list Forbes: The World's Billionaires website (https://www.forb es.com/billionaires/)

父节点信息:
类型: Title
内容: Methodology
父节点元数据: {'detection_class_prob': 0.7953538298606873, 'coordinates': CoordinatesMetadata(points=((np.float64(98.62982255709053), np.float64(1510.925048828125)), (np.float64(98.62982255709053), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1510.925048828125))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b5266f0>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf'}

表格前3个节点内容:
节点 12 (Title):
Methodology
节点 13 (NarrativeText):
Each year, Forbes employs a team of over 50 reporters from a variety of countries to track the activity of the world's wealthiest individuals[7] and sometimes groups or families – who share wealth. Preliminary surveys are sent to those who may qualify for the list. According to Forbes, they received three types of responses – some people try to inflate their wealth, others cooperate but leave out details,
节点 14 (UncategorizedText):
Publication details
--------------------------------------------------

表格数据:
表格元数据: {'detection_class_prob': 0.8833286762237549, 'coordinates': CoordinatesMetadata(points=((np.float64(102.6220932006836), np.float64(929.1338500976562)), (np.float64(102.6220932006836), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(929.1338500976562))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b61dd60>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'feb0e76147077250a13cdb0f842659a3'}
表格内容:
Icon Description Has not changed from the previous ranking. Has increased from the previous ranking. Has decreased from the previous ranking.
...
2019
节点 54 (NarrativeText):
In the 33rd annual Forbes list of the world's billionaires, the list included 2,153 billionaires with a total net wealth of $8.7 trillion, down 55 members and $400 billion from 2018.[14] The U.S. continued to have the most billionaires in the world, with a record of 609, while China dropped to 324 (when not including Hong Kong, Macau and Taiwan).[14]
--------------------------------------------------
……

通过调用LlamaParser来解析表格

  • 核心功能:
    • 自动将表格转换为markdown格式
    • 尝试识别并添加表格标题
    • 保持表格结构化输出
  • API参数:
    • preserve_layout_alignment_across_pages=True处理跨页表格
    • take_screenshot=True可获取页面截图辅助分析
  • 局限性:
    • 需要调用API,可能产生费用
    • 标题识别仍存在误判情况(如将页眉误认为年份)

示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import os
import time
from dotenv import load_dotenv
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
from llama_parse import LlamaParse


# 加载环境变量(确保有OpenAI API密钥)
load_dotenv()

# 设置基础模型
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(
model=os.getenv("DEEPSEEK_MODEL_NAME"), # DeepSeek API 支持的模型名称
api_key=os.getenv("DEEPSEEK_API_KEY"), # 从环境变量加载API key
base_url=os.getenv("DEEPSEEK_BASE_URL")
)

Settings.llm = llm
Settings.embed_model = embed_model

# 定义PDF路径
pdf_path = "data/复杂PDF/billionaires_page-1-5.pdf"

# 记录开始时间
start_time = time.time()

# 使用LlamaParse解析PDF, 需要付费,注意数据安全使用
documents = LlamaParse(result_type="markdown").load_data(pdf_path)

# 记录结束时间
end_time = time.time()
print(f"PDF解析耗时: {end_time - start_time:.2f}秒")

# 打印解析结果
print("\n解析后的文档内容:")
for i, doc in enumerate(documents, 1):
print(f"\n文档 {i} 内容:")
print(doc.text)
  • 核心能力:专为LLM优化的文档解析平台,处理含表格/图表/图像的复杂文档
  • 页眉页脚处理:提供消除页眉页脚的配置选项,可改善跨页表格的识别准确率
  • 生产环境适用:每周处理1000万+文档,支持企业级应用场景

工具对比及选择建议

  • 简单表格提取:Camelot或PDFPlumber
  • 需要上下文关联:Unstructured或LlamaParser
  • 跨页复杂表格:优先考虑LlamaParser