AWS test environment + automated test/benchmark in Jenkins #73

svanoort · 2016-04-01T16:08:56Z

I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.

Work progress:

Hardware/specs:

Storage: use SSD instance storage to benchmark (limits instance types). General purpose EBS SSD storage is generally slower and would run out of I/O credits after 1/2 hour (benchmarks need several hours).
Memory: either 7 GB (small datasets or where memory is not needed) or 15 GB (large or high-memory datasets).
CPUs: 2 or 4 core.
Instance types: m3.large (2-core, 7. 5GB RAM) for the small datasets, and r3.large (2-core, 15.25 GB ram) for big. If we do lots of parallelized implementations, add m3.xlarge (4-core, 15 GB RAM).
Cost: I am not spending more than $10-15/month on it, beyond my existing Jenkins host (reserved t2.micro) and domain/S3 hosting. Instances will be created to run a set of benchmarks and then terminated, with frequency to keep costs within limits.

Architecture:

Instances are spun up by my Jenkins host, with an appropriate IAM role or credentials to do this in a limited way.
Benchmark datasets will be self-hosted to not hit their sources hard. They won't be fully public unless small.
Instance gets an IAM role that allows uploading to a public (?) S3 results bucket.
Instance runs benchmarks on instance storage
Instance will upload each result to the S3 bucket as it completes, stamped with the git commit hash, timestamp run, language, etc.
All testing will use a reasonable timeout for both individual tests and the whole test set, if it hangs it is killed or skipped.
All testing uses the docker image, for reproducibility across hardware.

Two options for how to set it up:

EBS based & on-demand instaces:
- Use an EBS volume containing benchmark data and preconfigured system, and just start/stop the instance.
- When run, the git repo is cloned, the dataset is copied to the data folder, and tests are run & uploaded.
- Easier to set up and run, but more expensive.
S3 based/spot instances:
- cheaper (1/4 the instance price) but more maintenance.
- Submit spot bids, instances are configured using the "user data" field to submit a startup script which sets up and runs benchmarks.
- Private S3 buckets host compressed corpus data, these are fetched and decompressed.

Open questions:

What to use for controlling instances?
- AWS CLI is easy
- Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
- Ansible is kind of amazing and easy to work with

Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).

juditacs · 2016-04-04T08:58:36Z

This is great, you really put a lot of effort into this.

I'm a little bit afraid that we won't get many more submissions. The only reason this became popular is that the PHP 5 vs 7 improvement made it onto reddit. It would be perhaps more interesting to create a second challenge (and I have ideas), but I'm afraid it takes way too much time to manage this.

If we're building this environment for our own education (i.e. I've never used Jenkins), than by all means, let's do it. We just shouldn't expect heavy usage.

Notes

There should be two kinds of test data: one that fits into the memory and one that doesn't. This can be done with either different instance types or different datasets. Too large datasets might not be convenient: downloading it every time (slow) vs storing on S3 (expensive?).

What to use for controlling instances?

AWS CLI is easy
Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
Ansible is kind of amazing and easy to work with

AWS CLI sounds more than enough for our purposes.

svanoort · 2016-04-04T13:09:24Z

This is great, you really put a lot of effort into this.

Thanks, we'll say that this isn't solely for the wordcount use, though it would be the initial use case; I've been doing an increasing amount of benchmarking/testing and it's quite painful to do so locally due to interference from running applications. Chrome, backup daemons, and IDEs are the worst offenders.

Basically, needed something like that for a while, this is just the excuse to getting around to setting it up.

Too large datasets might not be convenient: downloading it every time (slow) vs storing on S3 (expensive?).

I was afraid of that too, but it turns out to be pretty speedy, especially after using xz -9 to compress data (which gets it 25-30% smaller than standard bzip2). The issue is that due to costs it's not super practical to make the larger compressed datasets publicly downloadable; might be able to sidestep this by only allowing torrent download from S3 (reducing costs).

svanoort · 2016-04-05T01:24:32Z

(I would also be curious what your other challenge is. Especially if it can scratch the "must go fast!" itch...)

juditacs · 2016-04-05T08:05:00Z

The image should also build fast on a normal PC and the data should be downloaded within reasonable time. Huwiki is borderline acceptable IMHO.

We talked about a new challenge with @gaebor. It should involve handling a variable length encoding such as utf-8. For example, split on unicode whitespaces (downside: we would have to create artificial data for this) and count the length of words. The output is the histogram of word length. By length, we mean unicode length, so len("álom") == 4 not 5. Later I realized, this is too easy: simple raw utf8 handling is enough and you can get away without a hash table. Anyway, something similar would be nice.

svanoort · 2016-04-05T16:45:25Z

The image should also build fast on a normal PC and the data should be downloaded within reasonable time. Huwiki is borderline acceptable IMHO.

I agree that the full Huwiki is a bit too large in Bzip2 format. When cleaned, it compresses to about 500 MB with xz, using the -9 (highest non-extreme compression) argument, and I imagine the original is about the same. I think that's a reasonable cap on corpus size, and recompressing makes sense.

Looking at hosting again, it's cost-prohibitive to host corpuses publicly in AWS for me (~$0.10/GB bandwidth outbound pricing), but the smallest Digital Ocean droplet is $5/month and offers 1 TB of transfer, so that might be an option.

split on unicode whitespaces (downside: we would have to create artificial data for this) and count the length of words. The output is the histogram of word length. By length, we mean unicode length, so len("álom") == 4 not 5.

I like this as a basis, maybe not splitting on all unicode whitespaces but having some that must be handled. We could force count of codepoints (which includes 4-byte unicode characters outside the BMP). Generating some synthetic test cases would not be too hard, though we maybe don't want to force handling of all Unicode situations, since I doubt most languages do everything 100% right.

Perhaps if we added a stipulation that multibyte encodings of normally single-byte characters must be converted to a single byte for counting, that would remove the ability to just use raw bytes and look at if they're above a certain value?

juditacs · 2016-04-05T18:28:48Z

Perhaps if we added a stipulation that multibyte encodings of normally single-byte characters must be converted to a single byte for counting, that would remove the ability to just use raw bytes and look at if they're above a certain value?

What do you mean by 'normally single-byte characters'? Codepoints under 256?
Character counting in utf-8 is pretty easy, the first byte of a characters tells how many mores come.

svanoort · 2016-04-08T13:41:41Z

Specifically, I mean canonicalization of character representations (preventing use of invalid representation). This is sort of a random thought - I am far from an expert in the details of Unicode (usually I delegate this to builtin libraries and only have to care where specific cases pose application or data issues, such as use of unprintable characters in usernames).

Perhaps there's a better way to force "true" Unicode handling?

svanoort · 2016-04-28T13:26:37Z

On hold for now due to work commitments, I will revisit once things have settled down a little bit.

svanoort changed the title ~~AWS test environment, thoughts?~~ AWS test environment + automated test/benchmark in Jenkins Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS test environment + automated test/benchmark in Jenkins #73

AWS test environment + automated test/benchmark in Jenkins #73

svanoort commented Apr 1, 2016

juditacs commented Apr 4, 2016

svanoort commented Apr 4, 2016

svanoort commented Apr 5, 2016

juditacs commented Apr 5, 2016

svanoort commented Apr 5, 2016

juditacs commented Apr 5, 2016

svanoort commented Apr 8, 2016

svanoort commented Apr 28, 2016

AWS test environment + automated test/benchmark in Jenkins #73

AWS test environment + automated test/benchmark in Jenkins #73

Comments

svanoort commented Apr 1, 2016

juditacs commented Apr 4, 2016

Notes

svanoort commented Apr 4, 2016

svanoort commented Apr 5, 2016

juditacs commented Apr 5, 2016

svanoort commented Apr 5, 2016

juditacs commented Apr 5, 2016

svanoort commented Apr 8, 2016

svanoort commented Apr 28, 2016