-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-4060: Use JDK to Hash Byte Array in UTF8 #3175
base: main
Are you sure you want to change the base?
Conversation
I feel there should be a test to verify that the result of Utf8.hashCode does not depend on whether the array has unused bytes in it. TestUtf8.hashCodeReused seems to largely cover that by hardcoding specific hash codes but it does not compute hash codes for equal sequences of bytes in two ways. I'm thinking about test code like Utf8 u1 = new Utf8("abcdefghi"); // length=9, bytes.length=9
u1.setByteLength(8); // length=8, bytes.length=9
Utf8 u2 = new Utf8("abcdefgh"); // length=8, bytes.length=8
assertEquals(u1.hashCode(), u2.getHashCode()); or with a hardcoded hash code in the test. |
Hello, Thank you for the feedback. I agree a unit test is appropriate here. Ideally, the hashcode implementation should return the same value for the same contents regardless on the size of the array. However, that is not currently the case because the two implementations are different. I believe AVRO-4061 will fix the discrepancy between the two implementations and then this unit test becomes possible. I also need to look at the |
I have a unit test that confirms the behavior, but it does require AVRO-4061 to be merged first. |
eba2337
to
ba04247
Compare
Unit test will fail on this PR until AVRO-401 is addressed. |
408af58
to
64765e9
Compare
AVRO-4061 is now fixed. Added unit test for coverage. |
19f72ff
to
8d95711
Compare
@KalleOlaviNiemitalo and @martin-g - Another pass on this one please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits on the javadoc but otherwise it looks OK. I'm not a professional Java programmer, though.
* There are two different code paths that hashcode() can call depending on the | ||
* state of the internal buffer. If the buffer is full (string length eq. buffer | ||
* length) then the JDK hashcode function can be used. This function can is | ||
* vectorized JDK 21+ and therefore should be preferable. However, if the buffer | ||
* is not full (string length le. buffer length), then the JDK does not support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hashcodehashCode" (twice)
"function can isbe vectorized in JDK 21+"
The "eq." and "le." abbreviations look confusing. Does "le." mean "less than or equal to"? But the equality case was already described.
/** | ||
* There are two different code paths that hashcode() can call depending on the | ||
* state of the internal buffer. If the buffer is full (string length eq. buffer | ||
* length) then the JDK hashcode function can be used. This function can is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can is vectorized ...
sounds incorrect
// If the array is filled, use the underlying JDK hash functionality. | ||
// Starting with JDK 21, the underlying implementation is vectorized. | ||
if (length > 7 && bytes.length == length) { | ||
h = Arrays.hashCode(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Arrays.hashCode(bytes)
behave when the length is smaller ?
Doesn't it fall back to serial execution internally for anything that is not vectorizable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does fall back, but in my micro benchmarks, I found that it is better to implement this skip logic before bothering to jump into the method call itself especially since the length needs to be interrogated anyway.
8d95711
to
a67e68a
Compare
What is the purpose of the change
Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Documentation