Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ and Java implementations of fastpfor + varbyte unable to decode each others output #42

Open
clintropolis opened this issue Aug 30, 2018 · 8 comments

Comments

@clintropolis
Copy link

I've been investigating using the FastPFor algorithm for integer columns in Druid, following up on the discussion and experimentation mentioned in this issue. Initially, I was solely using your Java port of this library, and it's the basis of a PR I currently have open. However, curiosity got the better of me and I wanted to compare performance with the c++ implementation, sooo, I wrote a JNI wrapper to allow me to call this library from java, resulting in this branch which is a sort of alternate implementation of the PR. That's the short of how I encountered the issue mentioned in the title - seeing what happened when I fed one into the other. I have not had a chance to dig in to determine if the issue is in fact in this library or the other one, but it can be replicated by modifying the example program to write encoded data to a file, then loading that into a bytebuffer in java and attempting to decode.

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>

#include "codecfactory.h"
using namespace std;
using namespace FastPForLib;

int main()
{
  IntegerCODEC &codec = *CODECFactory::getFromName("simdfastpfor256");
  size_t N = 10000;
  std::vector<uint32_t> mydata(N);

  srand (time(NULL));

  for (uint32_t i = 0; i < N; i++)
    mydata[i] = rand() % 100;

  std::vector<uint32_t> compressed_output(N + 1024);
  size_t compressedsize = compressed_output.size();
  codec.encodeArray(mydata.data(), mydata.size(), compressed_output.data(),
                    compressedsize);
  ofstream out("numbers.bin", ios::out | ios::binary);
  if(!out) {
    cout << "Cannot open file.";
    return 1;
   }

  out.write((char *)compressed_output.data(), N * sizeof(uint32_t));

  out.close();

  cout << "length " << compressedsize << "\n";
  return 0;
}

and then adjust the encodedSize variable to match the output of the c++ program in the following java

import com.google.common.io.Files;
import me.lemire.integercompression.FastPFOR;
import me.lemire.integercompression.IntWrapper;
import me.lemire.integercompression.SkippableComposition;
import me.lemire.integercompression.SkippableIntegerCODEC;
import me.lemire.integercompression.VariableByte;

...
    int buffSize = 1 << 16;

    int numValues = 10000;
    int maxNumValues = buffSize >> 2;

    SkippableIntegerCODEC codec = new SkippableComposition(new FastPFOR(), new VariableByte());
    int encodedSize = 2214;

    ByteBuffer encodedValues = Files.map(new File("numbers.bin"));

    int[] valueArray = new int[numValues];
    int[] encodedValueArray = new int[encodedSize];

    // copy encoded buffer to int array
    for (int i = 0; i < encodedSize; i++) {
      encodedValueArray[i] = encodedValues.getInt();
    }

    // decode with java

    IntWrapper outPos = new IntWrapper(0);
    codec.headlessUncompress(encodedValueArray, new IntWrapper(0), encodedSize, valueArray, outPos, numValues);
    // explodes before we get here
    assert (numValues == outPos.get());

The exception in this case was

java.lang.ArrayIndexOutOfBoundsException: 2555904
	at me.lemire.integercompression.FastPFOR.decodePage(FastPFOR.java:239)
	at me.lemire.integercompression.FastPFOR.headlessUncompress(FastPFOR.java:229)
	at me.lemire.integercompression.SkippableComposition.headlessUncompress(SkippableComposition.java:55)

but during experimentation I recall seeing exceptions coming from the VariableByte implementation as well, so the exact error may be dependent on length and composition of the input. I don't have one of these exceptions handy unfortunately, I will attempt to trigger it again and update the ticket. The JNI wrapper version can correctly read and decode the file data generated by the example program.

Of potential interest: the size of output varies slightly between the implementations, with c++ being a handful of int32 sized values larger. The full set of compatibility tests I threw together can be found here with the c++ snippet above, here. These tests that pass for me are java -> java, jni -> jni, external native encoded file -> jni, and the failing tests are jni -> java, java -> jni, external native encoded file -> java.

Another thing, which might be potentially related and indicative of a bug here (or a bug somewhere in my code), I experience an assert failure at the end of decodeArray function when using simd versions of fastpfor that only popped up whenever I would attempt to decode arbitrary offsets of a bytebuffer (from a memory mapped file). I did not experience this issue during initial testing of my JNI wrapper which was using newly allocated buffers populated with data. I explored briefly; commenting out the assert statement in the header and rebuilding the library resulted in the output decoded without exploding, but the final value would be incorrect when I went to validate the output. Since I experienced this issue only when decoding the mapped buffer where an encoded chunk could be located at arbitrary offsets, I speculated it was an issue with alignment, and indeed copying the values to a 16 byte aligned chunk has caused the assert failure to not appear again. That said, I haven't had the opportunity to try to replicate this behavior purely in c++ yet, so it is possible this issue is one of my own rather than this library, but wanted to throw it out there in the event it helps isolate why the java and c++ outputs are incompatible.

In general I've tested my branches with both my mac os laptop with

Apple LLVM version 10.0.0 (clang-1000.10.25.5)
g++-8 (Homebrew GCC 8.2.0) 8.2.0

and ubuntu linux with

g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

but will double check that both issues happen on linux as soon as possible, and will attempt to dig into this deeper in general whenever I get the chance.

I'm quite excited to get this stuff worked into Druid, and using this version of the library is currently my preference since it's benchmarking faster and makes the implementation of the java part of my PR a bit more straightforward, but I think it's probably important to have the ability to fallback to a java only implementation to make the decision a little less permanent.

@lemire
Copy link
Member

lemire commented Aug 31, 2018

It is certainly the case that the C++ code and the Java code is not interoperable. If you find a claim that it is the case somewhere, then this claim is false and will need to be removed.

If these distinct libraries were interoperable, we would have actual tests.

It is certainly possible to make it interoperable. For example, the Parquet folks started from some code in the Java library you mentioned, and then they transformed so it could be interoperable with fast and efficient C++ code on little-endian processors.

However, it is not trivial to make Java and C++ interoperable in this setting. You have to account for various things such as the fact that Java is big-endian whereas C++ endianness is typically determined by the underlying processor (and not defined by the language), and the fact that Java does not do pointer casting, SIMD instructions and pointer arithmetic, so some optimizations that are trivial in C++ are just not possible in Java.

So if cross-language interoperability is a feature you need, we do not currently provide it and nobody is working on it. There has never been any such request before. Pull requests are invited, however.

Now, there are interoperable (Java/C++) code out there. For example, Masked VByte is a C (C++) decoder for the varint format used by Lucene (Java). This is used in production at Indeed.

http://maskedvbyte.org

https://github.com/lemire/MaskedVByte

And, if this need saying, the Roaring format that Druid already uses is interoperable (C,C++,Python, Java and so forth) across languages.

@lemire
Copy link
Member

lemire commented Aug 31, 2018

If you have picked a particular format you care about, and you interested in helping write a interoperable Java codec, I am certainly available to help with that. I can help make sure that if you are a capable Java/C++ coder, it won't take more than a couple of days of work.

It is not going to be just turning on a switch, however. And the lack of interoperability right now is not a bug. It is a missing feature.

@clintropolis
Copy link
Author

clintropolis commented Sep 3, 2018

Oops, I meant to put in my writeup that I had no explicit reason to expect that this should work, but I got distracted collecting the details and forgot to put that in along the way. My decision to feed the outputs of one into the other was more of an ad hoc 'hmm, I wonder what happens when I do this', coupled with a perhaps naive hope that maybe the composition of fastpfor256 and variablebyte was more of a specification than it is, and did the writeup in the event it was supposed to be the case. In other words, don't worry, the docs didn't mislead me into doing this 😅

I don't know if having a fallback to java is strictly necessary for us, I'll need to discuss this with other Druid commiters. Most of my experimentation has been with fastpfor+variablebyte, so a java implementation that could decode simdfastpfor256 would probably be what I am after if a java fallback is necessary. I will follow up once I get some feedback one way or another.

Thanks for the quick response! (and for all the research papers and code, it's quite fascinating!)

@lemire
Copy link
Member

lemire commented Sep 4, 2018

Ok. As I wrote, it is quite doable to have cross-language interoperability. Just a few days of work.

@xndai
Copy link
Contributor

xndai commented Feb 15, 2019

hi @lemire , do you have any pointer to the Parquet team's work on making Java implementation interoperable with c++? I am interested in looking into this problem.

@lemire
Copy link
Member

lemire commented Feb 15, 2019

@xndai

Can you elaborate? Parquet’s C++ and Java code is on GitHub.

@xndai
Copy link
Contributor

xndai commented Feb 15, 2019

Ok, after some research, it seems to me that the endian is handled by IntBasedBitPackingGenerator.java (https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-generator/src/main/java/org/apache/parquet/encoding/bitpacking/IntBasedBitPackingGenerator.java). The class generates codes for both little endian and big endian packing/unpacking methods.

Besides the endian issue, is there anything affects the interoperability? I am asking this is because we are adapting FastPFor C++. But potentially we will have a Java reader in the future.

@lemire
Copy link
Member

lemire commented Feb 16, 2019

It is difficult in the abstract to list everything one needs to take into account but you have: Java does not have native unsigned integers, packed data structures having various field widths are difficult to do efficiently in Java, Java does not have SIMD access in the same way you do in C++...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants