Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More clarification needed for Avro to iceberg data type conversion for timestamp variants #11890

Open
Shekharrajak opened this issue Dec 30, 2024 · 0 comments
Labels
question Further information is requested

Comments

@Shekharrajak
Copy link

Query engine

No response

Question

We have iceberg data type to spark data type mapping : https://iceberg.apache.org/docs/1.4.2/spark-writes/#iceberg-type-to-spark-type, which mentions:

timestamp with timezone	timestamp	
timestamp without timezone	timestamp_ntz

If I have avro schema like this :

{
  "type": "record",
  "name": "MyRecord",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "timestampWithZone",
      "type": {
        "type": "long",
        "logicalType": "timestamp-micros"
      }
    },
    {
      "name": "timestampWithoutZone",
      "type": {
        "type": "long",
        "logicalType": "timestamp-micros"
      }
    }
  ]
}

How can I tell iceberg which epoch time to consider as with timezone (timestamp_tz, timestamp_ntz) ?

I see there is Avro spec: https://iceberg.apache.org/spec/#avro but "adjust-to-utc": false but I find that in to_avro in spark handles it differently : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L169C1-L190C8

          // For backward compatibility, if the Avro type is Long and it is not logical type
          // (the `null` case), output the timestamp value as with millisecond precision.
          case null | _: TimestampMillis => (getter, ordinal) =>
            DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
          case _: TimestampMicros => (getter, ordinal) =>
            timestampRebaseFunc(getter.getLong(ordinal))
          case other => throw new IncompatibleSchemaException(errorPrefix +
            s"SQL type ${TimestampType.sql} cannot be converted to Avro logical type $other")
        }

      case (TimestampNTZType, LONG) => avroType.getLogicalType match {
        // To keep consistent with TimestampType, if the Avro type is Long and it is not
        // logical type (the `null` case), output the TimestampNTZ as long value
        // in millisecond precision.
        case null | _: LocalTimestampMillis => (getter, ordinal) =>
          DateTimeUtils.microsToMillis(getter.getLong(ordinal))
        case _: LocalTimestampMicros => (getter, ordinal) =>
          getter.getLong(ordinal)
        case other => throw new IncompatibleSchemaException(errorPrefix +
          s"SQL type ${TimestampNTZType.sql} cannot be converted to Avro logical type $other")
      }

Means:

  1. It look for logical type LocalTimestampMillis and LocalTimestampMicros while considering it as TimestampNTZType .
  2. If logical type is TimestampMillis or TimestampMicros then only it is a TimestampType datatype.

Do we have specific reason to implement in this way ?

@Shekharrajak Shekharrajak added the question Further information is requested label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant