1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| /Users/jinglv/PycharmProjects/llm-rag-system 工作目录设置为: /Users/jinglv/PycharmProjects/llm-rag-system 正在处理文件: data/复杂PDF/billionaires_page-1-5.pdf Warning: No languages specified, defaulting to English.
表格数据: 表格元数据: {'detection_class_prob': 0.5629849433898926, 'coordinates': CoordinatesMetadata(points=((np.float64(839.0340576171875), np.float64(1001.4764404296875)), (np.float64(839.0340576171875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1920.586669921875)), (np.float64(1587.2479248046875), np.float64(1001.4764404296875))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e27b0>), 'links': [{'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}, {'text': "Forbes : The World ' s Billionaires website ( https :// www . forb", 'url': 'https://www.forbes.com/billionaires/', 'start_index': 0}], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'b379afb3fd875896f879cc4cfc441305'} 表格内容: Publisher Whale Media Investments Forbes family Publication First published Forbes March 1987[1] Latest publication April 4, 2023 Current list details (2023)[2] Wealthiest Bernard Arnault Net worth (1st) US$211 billion Number of billionaires 2,640 (from 2668) Total list net worth value US$12.2 trillion (from US$ 12.7 trillion) Number of women 337 New members to the 150 list Forbes: The World's Billionaires website (https://www.forb es.com/billionaires/)
父节点信息: 类型: Title 内容: Methodology 父节点元数据: {'detection_class_prob': 0.7953538298606873, 'coordinates': CoordinatesMetadata(points=((np.float64(98.62982255709053), np.float64(1510.925048828125)), (np.float64(98.62982255709053), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1569.5582275390625)), (np.float64(435.21123011467625), np.float64(1510.925048828125))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e0620>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf'} --------------------------------------------------
表格数据: 表格元数据: {'detection_class_prob': 0.8833286762237549, 'coordinates': CoordinatesMetadata(points=((np.float64(102.6220932006836), np.float64(929.1338500976562)), (np.float64(102.6220932006836), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(1154.2158203125)), (np.float64(765.6319580078125), np.float64(929.1338500976562))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e1340>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'feb0e76147077250a13cdb0f842659a3'} 表格内容: Icon Description Has not changed from the previous ranking. Has increased from the previous ranking. Has decreased from the previous ranking.
父节点信息: 类型: Title 内容: Legend 父节点元数据: {'detection_class_prob': 0.7951710224151611, 'coordinates': CoordinatesMetadata(points=((np.float64(96.45149230957031), np.float64(847.3286743164062)), (np.float64(96.45149230957031), np.float64(897.3561401367188)), (np.float64(236.98038987264118), np.float64(897.3561401367188)), (np.float64(236.98038987264118), np.float64(847.3286743164062))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8e3590>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': '759e1c4a2a3cf37f98db95d0877208f1'} ... 类型: Title 内容: 2019 父节点元数据: {'detection_class_prob': 0.7982921600341797, 'coordinates': CoordinatesMetadata(points=((np.float64(95.56612396240234), np.float64(971.339599609375)), (np.float64(95.56612396240234), np.float64(1019.1573486328125)), (np.float64(188.0115203857422), np.float64(1019.1573486328125)), (np.float64(188.0115203857422), np.float64(971.339599609375))), system=<unstructured.documents.coordinates.PixelSpace object at 0x15b8012b0>), 'links': [], 'last_modified': '2025-07-16T19:30:42', '_known_field_names': frozenset({'filename', 'file_directory', 'attached_to_filename', 'link_texts', 'image_url', 'emphasized_text_tags', 'link_start_indexes', 'sent_to', 'signature', 'cc_recipient', 'key_value_pairs', 'image_mime_type', 'page_name', 'coordinates', 'detection_class_prob', 'subject', 'parent_id', 'url', 'detection_origin', 'header_footer_type', 'link_urls', 'bcc_recipient', 'table_as_cells', 'emphasized_text_contents', 'image_base64', 'is_continuation', 'category_depth', 'orig_elements', 'image_path', 'data_source', 'email_message_id', 'filetype', 'page_number', 'text_as_html', 'last_modified', 'links', 'sent_from', 'languages'}), 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 5, 'file_directory': 'data/复杂PDF', 'filename': 'billionaires_page-1-5.pdf', 'parent_id': 'a69407466af63af128c0512ca7f72190'} -------------------------------------------------- ……
|