Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compressed array become bigger #11

Open
koron opened this issue Sep 4, 2013 · 7 comments
Open

compressed array become bigger #11

koron opened this issue Sep 4, 2013 · 7 comments
Assignees

Comments

@koron
Copy link
Collaborator

koron commented Sep 4, 2013

There are some cases that compressed array becomes bigger than original.
You can see sample for it here: https://github.com/koron/JavaFastPFOR/compare/adhoc-tests
Of course I know it is very rare and not target of this library.
But it may cause a kind of uncovenience for users.

There are some workarounds:

  • Write it in document.
  • Make a utility methods to calculate preferred length for output array.
  • Introduce auto extendable buffer, it would be called IntArrayOutputStream or so.

What workaround do we choose?

@koron
Copy link
Collaborator Author

koron commented Sep 4, 2013

I have forgot 4th workaround: DO NOTHING 👅

@lemire
Copy link
Member

lemire commented Sep 4, 2013

In case this is useful to you, I have now made you a collaborator on my version of the project. This means that you can work directly on the main code without going through a pull request. (I trust you.)

Let us review this problem.

First of all, you can use a fairly large output buffer... one that is larger than the input buffer. Say the input buffer contains N integers, you can just make sure that the output buffer contains at least 1.1 * N + 1024 integers and you are good with high probability.

When the output buffer is too small, an exception is thrown. One can catch this exception and realize that the output buffer was too small... one can then extend the buffer and try again. Yes, it is ugly... but in a given application, this can be made very unlikely.

How can we make this less ugly?

  1. I think keep things as they are is a good option. That is, you can add a new approach, but I don't think that the current approach is broken. For many people, it could be the best approach.
  2. Yes, you could have a helper class that tells you how bad things could be given an IntegerCODEC. This would require us to look at the IntegerCODEC and compute a bound on how badly the compression can go. This is not hard and could be very useful.
  3. A good approach from a software engineering perspective would be to write to an extensible buffer but this is likely to generate quite a bit of overhead. It is also not the most flexible.
  4. I think that the very best approach, from a software engineering perspective, is to use IntInputStream and IntOutputStream. This could be done as part of a distinct package for people who want nice streams.

To minimize overhead, you might want to have an API that allows you to read and write data in blocks. For example, you might want to have something like:

ReadableIntView readBlock(int number);

and

WritableIntView writeBlock(int number)

(This is just a vague idea.)

The goal would be that the overhead of using a stream interface would be tiny compared to the array-based interface.

Getting this right might require non-trivial engineering, but I think it is doable.

@koron
Copy link
Collaborator Author

koron commented Sep 5, 2013

I'll make some trials for this with these priority

  1. Don't break or delete anything.
  2. Keep efficiency of memory and speed.
  3. Increase usability and maintenance ability

@ghost ghost assigned koron Sep 5, 2013
@lemire
Copy link
Member

lemire commented Sep 5, 2013

We are in agreement.

@kk00ss
Copy link

kk00ss commented Dec 9, 2015

I repeatedly get compressed array larger than original, while compressing random integers. Tested with versions 0.1.5 and 0.1.6 for a number of codecs. For which of them the output size is identical.
I suspect I'm missing something about how to use your library ?
Yes, I've checked there is nothing looking like padding, no leading or trailing zeros.
Here is my code

  val codecs = Array(
    new IntegratedComposition(new IntegratedBinaryPacking(),
      new IntegratedVariableByte()),
    new JustCopy(),
    new VariableByte(),
    new IntegratedVariableByte(),
    new Composition(new BinaryPacking(), new VariableByte()),
    new Composition(new NewPFD(), new VariableByte()),
    new Composition(new NewPFDS16(), new VariableByte()),
    new Composition(new OptPFDS9(), new VariableByte()),
    new Composition(new OptPFDS16(), new VariableByte()),
    new Composition(new FastPFOR128(), new VariableByte()),
    new Composition(new FastPFOR(), new VariableByte()),
    new Simple9(),
    new Simple16(),
    new Composition(new XorBinaryPacking(), new VariableByte()),
    new Composition(new DeltaZigzagBinaryPacking(),
      new DeltaZigzagVariableByte()))

  def test2() = {
    val N = 100
    val r = new Random(System.nanoTime())
    val keys = (1 to N).map(i => {
      r.nextInt(Int.MaxValue)})
      .distinct.toArray
    println(keys.toList)
    println(s"size before compression ${keys.size}")
    codecs.map(codec => {
      print(s"${codec.toString} , ")
      val iic = new IntegratedIntCompressor()
      var start = System.nanoTime()
      val compressed = iic.compress(keys) // compressed array
      val encodeTime = System.nanoTime() - start
      print(s" $encodeTime , ")
      print(s" ${compressed.size} , ")
      start = System.nanoTime()
      val recov = iic.uncompress(compressed)
      val decodeTime = System.nanoTime() - start
      println(s" $decodeTime , ")
      //println(compressed.toList)
    })
  }

@lemire
Copy link
Member

lemire commented Dec 9, 2015

@kk00ss

Please consult our README on what the library does:

https://github.com/lemire/JavaFastPFOR#what-does-this-do

@kk00ss
Copy link

kk00ss commented Dec 9, 2015

Sorry, my bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants