-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ and Java implementations of fastpfor + varbyte unable to decode each others output #42
Comments
It is certainly the case that the C++ code and the Java code is not interoperable. If you find a claim that it is the case somewhere, then this claim is false and will need to be removed. If these distinct libraries were interoperable, we would have actual tests. It is certainly possible to make it interoperable. For example, the Parquet folks started from some code in the Java library you mentioned, and then they transformed so it could be interoperable with fast and efficient C++ code on little-endian processors. However, it is not trivial to make Java and C++ interoperable in this setting. You have to account for various things such as the fact that Java is big-endian whereas C++ endianness is typically determined by the underlying processor (and not defined by the language), and the fact that Java does not do pointer casting, SIMD instructions and pointer arithmetic, so some optimizations that are trivial in C++ are just not possible in Java. So if cross-language interoperability is a feature you need, we do not currently provide it and nobody is working on it. There has never been any such request before. Pull requests are invited, however. Now, there are interoperable (Java/C++) code out there. For example, Masked VByte is a C (C++) decoder for the varint format used by Lucene (Java). This is used in production at Indeed. https://github.com/lemire/MaskedVByte And, if this need saying, the Roaring format that Druid already uses is interoperable (C,C++,Python, Java and so forth) across languages. |
If you have picked a particular format you care about, and you interested in helping write a interoperable Java codec, I am certainly available to help with that. I can help make sure that if you are a capable Java/C++ coder, it won't take more than a couple of days of work. It is not going to be just turning on a switch, however. And the lack of interoperability right now is not a bug. It is a missing feature. |
Oops, I meant to put in my writeup that I had no explicit reason to expect that this should work, but I got distracted collecting the details and forgot to put that in along the way. My decision to feed the outputs of one into the other was more of an ad hoc 'hmm, I wonder what happens when I do this', coupled with a perhaps naive hope that maybe the composition of fastpfor256 and variablebyte was more of a specification than it is, and did the writeup in the event it was supposed to be the case. In other words, don't worry, the docs didn't mislead me into doing this 😅 I don't know if having a fallback to java is strictly necessary for us, I'll need to discuss this with other Druid commiters. Most of my experimentation has been with fastpfor+variablebyte, so a java implementation that could decode Thanks for the quick response! (and for all the research papers and code, it's quite fascinating!) |
Ok. As I wrote, it is quite doable to have cross-language interoperability. Just a few days of work. |
hi @lemire , do you have any pointer to the Parquet team's work on making Java implementation interoperable with c++? I am interested in looking into this problem. |
Can you elaborate? Parquet’s C++ and Java code is on GitHub. |
Ok, after some research, it seems to me that the endian is handled by IntBasedBitPackingGenerator.java (https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-generator/src/main/java/org/apache/parquet/encoding/bitpacking/IntBasedBitPackingGenerator.java). The class generates codes for both little endian and big endian packing/unpacking methods. Besides the endian issue, is there anything affects the interoperability? I am asking this is because we are adapting FastPFor C++. But potentially we will have a Java reader in the future. |
It is difficult in the abstract to list everything one needs to take into account but you have: Java does not have native unsigned integers, packed data structures having various field widths are difficult to do efficiently in Java, Java does not have SIMD access in the same way you do in C++... |
I've been investigating using the FastPFor algorithm for integer columns in Druid, following up on the discussion and experimentation mentioned in this issue. Initially, I was solely using your Java port of this library, and it's the basis of a PR I currently have open. However, curiosity got the better of me and I wanted to compare performance with the c++ implementation, sooo, I wrote a JNI wrapper to allow me to call this library from java, resulting in this branch which is a sort of alternate implementation of the PR. That's the short of how I encountered the issue mentioned in the title - seeing what happened when I fed one into the other. I have not had a chance to dig in to determine if the issue is in fact in this library or the other one, but it can be replicated by modifying the example program to write encoded data to a file, then loading that into a bytebuffer in java and attempting to decode.
and then adjust the
encodedSize
variable to match the output of the c++ program in the following javaThe exception in this case was
but during experimentation I recall seeing exceptions coming from the
VariableByte
implementation as well, so the exact error may be dependent on length and composition of the input. I don't have one of these exceptions handy unfortunately, I will attempt to trigger it again and update the ticket. The JNI wrapper version can correctly read and decode the file data generated by the example program.Of potential interest: the size of output varies slightly between the implementations, with c++ being a handful of int32 sized values larger. The full set of compatibility tests I threw together can be found here with the c++ snippet above, here. These tests that pass for me are java -> java, jni -> jni, external native encoded file -> jni, and the failing tests are jni -> java, java -> jni, external native encoded file -> java.
Another thing, which might be potentially related and indicative of a bug here (or a bug somewhere in my code), I experience an
assert
failure at the end of decodeArray function when using simd versions of fastpfor that only popped up whenever I would attempt to decode arbitrary offsets of a bytebuffer (from a memory mapped file). I did not experience this issue during initial testing of my JNI wrapper which was using newly allocated buffers populated with data. I explored briefly; commenting out theassert
statement in the header and rebuilding the library resulted in the output decoded without exploding, but the final value would be incorrect when I went to validate the output. Since I experienced this issue only when decoding the mapped buffer where an encoded chunk could be located at arbitrary offsets, I speculated it was an issue with alignment, and indeed copying the values to a 16 byte aligned chunk has caused the assert failure to not appear again. That said, I haven't had the opportunity to try to replicate this behavior purely in c++ yet, so it is possible this issue is one of my own rather than this library, but wanted to throw it out there in the event it helps isolate why the java and c++ outputs are incompatible.In general I've tested my branches with both my mac os laptop with
and ubuntu linux with
but will double check that both issues happen on linux as soon as possible, and will attempt to dig into this deeper in general whenever I get the chance.
I'm quite excited to get this stuff worked into Druid, and using this version of the library is currently my preference since it's benchmarking faster and makes the implementation of the java part of my PR a bit more straightforward, but I think it's probably important to have the ability to fallback to a java only implementation to make the decision a little less permanent.
The text was updated successfully, but these errors were encountered: