Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/content within custom HTML tags is skipped #3708

Closed
lwollenbergfuzzy opened this issue Oct 9, 2024 · 7 comments
Closed

bug/content within custom HTML tags is skipped #3708

lwollenbergfuzzy opened this issue Oct 9, 2024 · 7 comments
Labels
bug Something isn't working html

Comments

@lwollenbergfuzzy
Copy link

While reading html files we encountered the problem that we end up with an empty list.

Here is a small example:

from unstructured.partition.html import partition_html
html_content="""
<!DOCTYPE html>
<html class="client-nojs" lang="de" dir="ltr">
<head>
</head>
<body>
  <div id="content" class="mw-body" role="main">
    <Seitenname>Bestellvorschläge weiterbearbeiten</Seitenname>
    <hr>
    <div class="content">
      <hauptteil_AB>
        <div class="rumpftabelle"></div>
        <table>
          <tbody>
            <tr>
              <th>Intern</th>
              <th>Feldwerte
              </th>
            </tr>
            <tr>
              <td>J</td>
              <td>Ja
              </td>
            </tr>
            <tr>
              <td>N</td>
              <td>Nein
              </td>
            </tr>
          </tbody>
        </table>
    </div>
    </hauptteil_AB>
    <fussteil_B></fussteil_B>
  </div>
  </div>
</body>
</html>
"""
elements = partition_html(text=html_content)
print(elements)
out[0]: []

We would expect something like

out[0]: [<unstructured.documents.elements.Title object>, <unstructured.documents.elements.Table object>]

We use version 0.15.13 of unstructured.

Through try and error, the problem seems to come from the custom tags <hauptteil_AB> and <fussteil_B>.
We appreciate any help on this issue.

@lwollenbergfuzzy lwollenbergfuzzy added the bug Something isn't working label Oct 9, 2024
@deku0818
Copy link

The same problem.

@MiloMoerkerke
Copy link

Same here

@PhorstenkampFuzzy
Copy link

+1

@PhorstenkampFuzzy
Copy link

Anything new here? Even a coment would help on how to preporcess something like this.

@ajainfuzzy
Copy link

+1

@jjleng
Copy link

jjleng commented Nov 18, 2024

Same problem.

I guess the cause is here:

It seems this is a feature, but it makes the HTML parser unable to parse custom HTML tags appropriately.

@scanny scanny added the html label Dec 16, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 18, 2024

Closing in place of feature request #3842. Feel free to add remarks or +1s there :)

@scanny scanny closed this as completed Dec 18, 2024
@scanny scanny changed the title bug/reading html file returns empty list bug/content within custom HTML tags is skipped Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

No branches or pull requests

8 participants
@jjleng @scanny @deku0818 @MiloMoerkerke @PhorstenkampFuzzy @lwollenbergfuzzy @ajainfuzzy and others