bug/word document converted from Pages app with a single table in it cannot be extracted by unstructured #3875

samuelan · 2025-01-17T20:00:22Z

Describe the bug
This is all on my Macbook. I have created a Numbers file and on the spreadsheet, I put a few rows, and each row has 2 columns. then I copy and pasted the rows into Pages app, so it becomes a table in pages. Then I export it to Word document. On the Word document, I could see the table. However, when I try to use unstructured to extract info, it was not able to.

To Reproduce
Here is the code I use to extract info, but not able to extract.

from unstructured.partition.docx import partition_docx
import os

def extract_elements_from_docx(file_path: str):
    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        return

    try:
        elements = partition_docx(file=file_path, 
        infer_table_structure=True,
            include_page_breaks=True,
            content_extraction=True)
        
        if not elements:
            print("No elements were extracted from the document.")
            return
        
        print("Extracted Elements:")
        for element in elements:
            if element.type == "Table":
                print("Table Detected:")
                print(element.to_dict())
            else:
                print(f"{element.type}: {element.text}")

    except Exception as e:
        print(f"An error occurred while extracting elements: {e}")

if __name__ == "__main__":
    docx_file_path = "copied_from_numbers.docx" 
    extract_elements_from_docx(docx_file_path)

Expected behavior
The table should be extracted.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
Please run python scripts/collect_env.py and paste the output here.
This will help us understand more about the environment in which the bug occurred.

Additional context

copied_from_numbers.docx

The text was updated successfully, but these errors were encountered:

samuelan added the bug Something isn't working label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/word document converted from Pages app with a single table in it cannot be extracted by unstructured #3875

bug/word document converted from Pages app with a single table in it cannot be extracted by unstructured #3875

samuelan commented Jan 17, 2025 •

edited

Loading

bug/word document converted from Pages app with a single table in it cannot be extracted by unstructured #3875

bug/word document converted from Pages app with a single table in it cannot be extracted by unstructured #3875

Comments

samuelan commented Jan 17, 2025 • edited Loading

samuelan commented Jan 17, 2025 •

edited

Loading