Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS test environment + automated test/benchmark in Jenkins #73

Open
11 of 18 tasks
svanoort opened this issue Apr 1, 2016 · 8 comments
Open
11 of 18 tasks

AWS test environment + automated test/benchmark in Jenkins #73

svanoort opened this issue Apr 1, 2016 · 8 comments

Comments

@svanoort
Copy link
Contributor

svanoort commented Apr 1, 2016

I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.

Work progress:

  • Create an r3.large instance with ephemeral storage and assigned benchmarking-specific IAM role
  • Create setup script that installs docker, xz, git, starts docker, and pulls the ubuntu-16 allthelanguages docker image
  • Create and attach policy to IAM role for benchmarking that allows read of the dataset S3 bucket, and write of the results bucket
  • Compress huwiki, huwikisource, cleaned huwiki with xz -9 (smallest size) and upload to new S3 buckets (data is a private bucket, benchmark results is initially private, but later public).
  • Add script commands to setup script that will download data from S3 and decompress it
  • Set the AWS host to use ephemeral (instance) storage for /tmp folder
  • Run benchmark using docker
  • Upload first result to S3 - available here
  • Create scripting to grab instance + package info to metadata file
    • Git hash used in build
    • Timestamp
    • Host type, from aws cli
    • Hash of input file
  • Timeouts and resource limits on individual runs (Node.js for example hung on the instance, and needed to be manually killed, another one ran out of RAM and broke the Docker session)
  • Create scripting to name results by run/host info individually
  • Jenkins: job to run tests (inside a resource-limited container) against main wordcount branch + PRs
  • Jenkins - role or similar to allow control of benchmarking host?
    • Public view-only access to builds now enabled on dynamic.codeablereason.com/jenkins
    • HTTPS access added to dynamic.codeablereason.com (with LetsEncrypt)
    • Enforce HTTPS for all but badges/static resources on Jenkins (for performance/access reasons)
    • Enable limited-access users for wordcount use
  • Jenkins - job to fire benchmarks (github triggering)

Hardware/specs:

  • Storage: use SSD instance storage to benchmark (limits instance types). General purpose EBS SSD storage is generally slower and would run out of I/O credits after 1/2 hour (benchmarks need several hours).
  • Memory: either 7 GB (small datasets or where memory is not needed) or 15 GB (large or high-memory datasets).
  • CPUs: 2 or 4 core.
  • Instance types: m3.large (2-core, 7. 5GB RAM) for the small datasets, and r3.large (2-core, 15.25 GB ram) for big. If we do lots of parallelized implementations, add m3.xlarge (4-core, 15 GB RAM).
  • Cost: I am not spending more than $10-15/month on it, beyond my existing Jenkins host (reserved t2.micro) and domain/S3 hosting. Instances will be created to run a set of benchmarks and then terminated, with frequency to keep costs within limits.

Architecture:

  • Instances are spun up by my Jenkins host, with an appropriate IAM role or credentials to do this in a limited way.
  • Benchmark datasets will be self-hosted to not hit their sources hard. They won't be fully public unless small.
  • Instance gets an IAM role that allows uploading to a public (?) S3 results bucket.
  • Instance runs benchmarks on instance storage
  • Instance will upload each result to the S3 bucket as it completes, stamped with the git commit hash, timestamp run, language, etc.
  • All testing will use a reasonable timeout for both individual tests and the whole test set, if it hangs it is killed or skipped.
  • All testing uses the docker image, for reproducibility across hardware.

Two options for how to set it up:

  • EBS based & on-demand instaces:
    • Use an EBS volume containing benchmark data and preconfigured system, and just start/stop the instance.
    • When run, the git repo is cloned, the dataset is copied to the data folder, and tests are run & uploaded.
    • Easier to set up and run, but more expensive.
  • S3 based/spot instances:
    • cheaper (1/4 the instance price) but more maintenance.
    • Submit spot bids, instances are configured using the "user data" field to submit a startup script which sets up and runs benchmarks.
    • Private S3 buckets host compressed corpus data, these are fetched and decompressed.

Open questions:

  • What to use for controlling instances?
    • AWS CLI is easy
    • Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
    • Ansible is kind of amazing and easy to work with

Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).

@juditacs
Copy link
Owner

juditacs commented Apr 4, 2016

This is great, you really put a lot of effort into this.

I'm a little bit afraid that we won't get many more submissions. The only reason this became popular is that the PHP 5 vs 7 improvement made it onto reddit. It would be perhaps more interesting to create a second challenge (and I have ideas), but I'm afraid it takes way too much time to manage this.

If we're building this environment for our own education (i.e. I've never used Jenkins), than by all means, let's do it. We just shouldn't expect heavy usage.

Notes

There should be two kinds of test data: one that fits into the memory and one that doesn't. This can be done with either different instance types or different datasets. Too large datasets might not be convenient: downloading it every time (slow) vs storing on S3 (expensive?).

What to use for controlling instances?

AWS CLI is easy
Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
Ansible is kind of amazing and easy to work with

AWS CLI sounds more than enough for our purposes.

@svanoort
Copy link
Contributor Author

svanoort commented Apr 4, 2016

This is great, you really put a lot of effort into this.

Thanks, we'll say that this isn't solely for the wordcount use, though it would be the initial use case; I've been doing an increasing amount of benchmarking/testing and it's quite painful to do so locally due to interference from running applications. Chrome, backup daemons, and IDEs are the worst offenders.

Basically, needed something like that for a while, this is just the excuse to getting around to setting it up.

Too large datasets might not be convenient: downloading it every time (slow) vs storing on S3 (expensive?).

I was afraid of that too, but it turns out to be pretty speedy, especially after using xz -9 to compress data (which gets it 25-30% smaller than standard bzip2). The issue is that due to costs it's not super practical to make the larger compressed datasets publicly downloadable; might be able to sidestep this by only allowing torrent download from S3 (reducing costs).

@svanoort
Copy link
Contributor Author

svanoort commented Apr 5, 2016

(I would also be curious what your other challenge is. Especially if it can scratch the "must go fast!" itch...)

@juditacs
Copy link
Owner

juditacs commented Apr 5, 2016

The image should also build fast on a normal PC and the data should be downloaded within reasonable time. Huwiki is borderline acceptable IMHO.

We talked about a new challenge with @gaebor. It should involve handling a variable length encoding such as utf-8. For example, split on unicode whitespaces (downside: we would have to create artificial data for this) and count the length of words. The output is the histogram of word length. By length, we mean unicode length, so len("álom") == 4 not 5. Later I realized, this is too easy: simple raw utf8 handling is enough and you can get away without a hash table. Anyway, something similar would be nice.

@svanoort
Copy link
Contributor Author

svanoort commented Apr 5, 2016

The image should also build fast on a normal PC and the data should be downloaded within reasonable time. Huwiki is borderline acceptable IMHO.

I agree that the full Huwiki is a bit too large in Bzip2 format. When cleaned, it compresses to about 500 MB with xz, using the -9 (highest non-extreme compression) argument, and I imagine the original is about the same. I think that's a reasonable cap on corpus size, and recompressing makes sense.

Looking at hosting again, it's cost-prohibitive to host corpuses publicly in AWS for me (~$0.10/GB bandwidth outbound pricing), but the smallest Digital Ocean droplet is $5/month and offers 1 TB of transfer, so that might be an option.

split on unicode whitespaces (downside: we would have to create artificial data for this) and count the length of words. The output is the histogram of word length. By length, we mean unicode length, so len("álom") == 4 not 5.

I like this as a basis, maybe not splitting on all unicode whitespaces but having some that must be handled. We could force count of codepoints (which includes 4-byte unicode characters outside the BMP). Generating some synthetic test cases would not be too hard, though we maybe don't want to force handling of all Unicode situations, since I doubt most languages do everything 100% right.

Perhaps if we added a stipulation that multibyte encodings of normally single-byte characters must be converted to a single byte for counting, that would remove the ability to just use raw bytes and look at if they're above a certain value?

@juditacs
Copy link
Owner

juditacs commented Apr 5, 2016

Perhaps if we added a stipulation that multibyte encodings of normally single-byte characters must be converted to a single byte for counting, that would remove the ability to just use raw bytes and look at if they're above a certain value?

What do you mean by 'normally single-byte characters'? Codepoints under 256?
Character counting in utf-8 is pretty easy, the first byte of a characters tells how many mores come.

@svanoort svanoort changed the title AWS test environment, thoughts? AWS test environment + automated test/benchmark in Jenkins Apr 8, 2016
@svanoort
Copy link
Contributor Author

svanoort commented Apr 8, 2016

Specifically, I mean canonicalization of character representations (preventing use of invalid representation). This is sort of a random thought - I am far from an expert in the details of Unicode (usually I delegate this to builtin libraries and only have to care where specific cases pose application or data issues, such as use of unprintable characters in usernames).

Perhaps there's a better way to force "true" Unicode handling?

@svanoort
Copy link
Contributor Author

On hold for now due to work commitments, I will revisit once things have settled down a little bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants