AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175

belugabehr · 2024-09-25T01:29:18Z

What is the purpose of the change

This pull request improves file read performance by using the JDK hashcode implementation, fixing AVRO-4060.*

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Documentation

Does this pull request introduce a new feature? no

KalleOlaviNiemitalo · 2024-09-25T01:56:32Z

I feel there should be a test to verify that the result of Utf8.hashCode does not depend on whether the array has unused bytes in it. TestUtf8.hashCodeReused seems to largely cover that by hardcoding specific hash codes but it does not compute hash codes for equal sequences of bytes in two ways. I'm thinking about test code like

Utf8 u1 = new Utf8("abcdefghi"); // length=9, bytes.length=9
u1.setByteLength(8); // length=8, bytes.length=9
Utf8 u2 = new Utf8("abcdefgh"); // length=8, bytes.length=8
assertEquals(u1.hashCode(), u2.getHashCode());

or with a hardcoded hash code in the test.

belugabehr · 2024-09-25T02:42:41Z

Hello,

Thank you for the feedback. I agree a unit test is appropriate here.

Ideally, the hashcode implementation should return the same value for the same contents regardless on the size of the array. However, that is not currently the case because the two implementations are different. I believe AVRO-4061 will fix the discrepancy between the two implementations and then this unit test becomes possible.

I also need to look at the setLength method. It's very confusing to me. If the buffer is made smaller, then the underlying string is truncated, if the buffer is made larger, than the underlying string is padded with zero (it's not just that the buffer is expanded). I don't really understand the use case of this method just yet. I feel like it shouldn't bother copying anything into the newly sized array since the end result is somewhat confusing.

belugabehr · 2024-09-25T02:43:52Z

I have a unit test that confirms the behavior, but it does require AVRO-4061 to be merged first.

belugabehr · 2024-09-25T02:51:44Z

Unit test will fail on this PR until AVRO-401 is addressed.

belugabehr · 2024-09-30T01:18:30Z

AVRO-4061 is now fixed.

Added unit test for coverage.

belugabehr · 2024-12-29T05:12:50Z

@KalleOlaviNiemitalo and @martin-g - Another pass on this one please?

KalleOlaviNiemitalo

Some nits on the javadoc but otherwise it looks OK. I'm not a professional Java programmer, though.

KalleOlaviNiemitalo · 2024-12-29T08:01:34Z

lang/java/avro/src/test/java/org/apache/avro/util/TestUtf8.java

+   * There are two different code paths that hashcode() can call depending on the
+   * state of the internal buffer. If the buffer is full (string length eq. buffer
+   * length) then the JDK hashcode function can be used. This function can is
+   * vectorized JDK 21+ and therefore should be preferable. However, if the buffer
+   * is not full (string length le. buffer length), then the JDK does not support


"~~hashcode~~hashCode" (twice)

"function can isbe vectorized in JDK 21+"

The "eq." and "le." abbreviations look confusing. Does "le." mean "less than or equal to"? But the equality case was already described.

martin-g · 2024-12-30T09:37:50Z

lang/java/avro/src/test/java/org/apache/avro/util/TestUtf8.java

+  /**
+   * There are two different code paths that hashcode() can call depending on the
+   * state of the internal buffer. If the buffer is full (string length eq. buffer
+   * length) then the JDK hashcode function can be used. This function can is


This function can is vectorized ... sounds incorrect

martin-g · 2024-12-30T09:55:53Z

lang/java/avro/src/main/java/org/apache/avro/util/Utf8.java

+      // If the array is filled, use the underlying JDK hash functionality.
+      // Starting with JDK 21, the underlying implementation is vectorized.
+      if (length > 7 && bytes.length == length) {
+        h = Arrays.hashCode(bytes);


How does Arrays.hashCode(bytes) behave when the length is smaller ?
Doesn't it fall back to serial execution internally for anything that is not vectorizable ?

It does fall back, but in my micro benchmarks, I found that it is better to implement this skip logic before bothering to jump into the method call itself especially since the length needs to be interrogated anyway.

github-actions bot added the Java Pull Requests for Java binding label Sep 25, 2024

belugabehr force-pushed the belugabehr/AVRO-4060 branch from eba2337 to ba04247 Compare September 25, 2024 02:51

belugabehr force-pushed the belugabehr/AVRO-4060 branch 3 times, most recently from 408af58 to 64765e9 Compare September 28, 2024 02:21

belugabehr requested a review from martin-g October 2, 2024 03:05

belugabehr force-pushed the belugabehr/AVRO-4060 branch 3 times, most recently from 19f72ff to 8d95711 Compare December 29, 2024 05:12

KalleOlaviNiemitalo reviewed Dec 29, 2024

View reviewed changes

martin-g approved these changes Dec 30, 2024

View reviewed changes

AVRO-4060: Use JDK to Hash Byte Array in UTF8

a67e68a

belugabehr force-pushed the belugabehr/AVRO-4060 branch from 8d95711 to a67e68a Compare January 4, 2025 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175

AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175

belugabehr commented Sep 25, 2024

KalleOlaviNiemitalo commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 30, 2024

belugabehr commented Dec 29, 2024

KalleOlaviNiemitalo left a comment

KalleOlaviNiemitalo Dec 29, 2024

martin-g Dec 30, 2024

martin-g Dec 30, 2024

belugabehr Jan 4, 2025

AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175

Are you sure you want to change the base?

AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175

Conversation

belugabehr commented Sep 25, 2024

What is the purpose of the change

Verifying this change

Documentation

KalleOlaviNiemitalo commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 25, 2024

belugabehr commented Sep 30, 2024

belugabehr commented Dec 29, 2024

KalleOlaviNiemitalo left a comment

Choose a reason for hiding this comment

KalleOlaviNiemitalo Dec 29, 2024

Choose a reason for hiding this comment

martin-g Dec 30, 2024

Choose a reason for hiding this comment

martin-g Dec 30, 2024

Choose a reason for hiding this comment

belugabehr Jan 4, 2025

Choose a reason for hiding this comment