Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkValue converter Timestamp Issue #11840

Open
1 of 3 tasks
kakavenkat opened this issue Dec 20, 2024 · 2 comments
Open
1 of 3 tasks

SparkValue converter Timestamp Issue #11840

kakavenkat opened this issue Dec 20, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@kakavenkat
Copy link

kakavenkat commented Dec 20, 2024

Apache Iceberg version

1.7.1 (latest release)

Query engine

PrestoDB

Please describe the bug 🐞

While converting Spark data into Iceberg records, timestamps and dates are being converted into long values, which is confusing for users. This behavior makes it difficult to interpret the actual timestamp or date values in the records. Additionally, some engines, such as Presto, are unable to understand the timestamp value in long format, making it unreadable. Users expect these fields to retain their original formats or be represented in a more user-friendly way.

Steps to Reproduce:

  1. Read a CSV data containing timestamp into a DataFrame and pass Row data into the SparkValueConverter convert method, then insert it into an Iceberg table.
  2. Inspect the resulting Iceberg records.
  3. Observe that timestamp and date fields are stored as long values.
  4. Try to run the query from Presto engine. will get the timestamp issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@kakavenkat kakavenkat added the bug Something isn't working label Dec 20, 2024
@kakavenkat
Copy link
Author

kakavenkat commented Dec 20, 2024

PR for this issue #11841

@RussellSpitzer
Copy link
Member

I am not sure I understand the issue here.

Iceberg defines Timestamp as microseconds from epoch. When we actually store it in files it is up to that file format to define how that type is serialized. In this case of Parquet, this is also defined as an Int64.

If Presto has a bug decoding this that is indicative of a far deeper issue with the Presto read code for parquet files.

I'm also unsure how changing the underlying representation would help since it's up to the reader to decide how to display the value to users since native parquet files are not readable by humans (at least not me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants