Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number getting converted into scientific notation in metadata.text_as_html #3871

Open
sahil0094 opened this issue Jan 17, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@sahil0094
Copy link

sahil0094 commented Jan 17, 2025

Problem Description

When using partition_html() and extracting table metadata via chunk.metadata.text_as_html, numeric values are being automatically converted to exponential notation.

Example

  • Input Number: 478923
  • Converted Output: 4.7e+05

Steps to Reproduce

  1. Use partition_html() on an HTML file
  2. Chunking using chunk by title function and extracting tabular data
  3. Access chunk.metadata.text_as_html
  4. Observe numeric value conversion

Expected Behavior

  • Numeric values should be preserved in their original format
  • No automatic scientific notation conversion

Environment Details

  • Unstructured Library Version: 0.10.28
  • Python Version: 3.11.0rc1
  • Environment: databricks runtime 15.4 LTS ML

Potential Impact

This automatic conversion can cause data integrity issues, especially in financial or scientific data processing.

Suggested Investigation

  • Review number parsing/serialization logic
  • Check type conversion mechanisms in metadata handling
@sahil0094 sahil0094 added the bug Something isn't working label Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant