-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS test environment + automated test/benchmark in Jenkins #73
Comments
This is great, you really put a lot of effort into this. I'm a little bit afraid that we won't get many more submissions. The only reason this became popular is that the PHP 5 vs 7 improvement made it onto reddit. It would be perhaps more interesting to create a second challenge (and I have ideas), but I'm afraid it takes way too much time to manage this. If we're building this environment for our own education (i.e. I've never used Jenkins), than by all means, let's do it. We just shouldn't expect heavy usage. NotesThere should be two kinds of test data: one that fits into the memory and one that doesn't. This can be done with either different instance types or different datasets. Too large datasets might not be convenient: downloading it every time (slow) vs storing on S3 (expensive?).
AWS CLI sounds more than enough for our purposes. |
Thanks, we'll say that this isn't solely for the wordcount use, though it would be the initial use case; I've been doing an increasing amount of benchmarking/testing and it's quite painful to do so locally due to interference from running applications. Chrome, backup daemons, and IDEs are the worst offenders. Basically, needed something like that for a while, this is just the excuse to getting around to setting it up.
I was afraid of that too, but it turns out to be pretty speedy, especially after using xz -9 to compress data (which gets it 25-30% smaller than standard bzip2). The issue is that due to costs it's not super practical to make the larger compressed datasets publicly downloadable; might be able to sidestep this by only allowing torrent download from S3 (reducing costs). |
(I would also be curious what your other challenge is. Especially if it can scratch the "must go fast!" itch...) |
The image should also build fast on a normal PC and the data should be downloaded within reasonable time. Huwiki is borderline acceptable IMHO. We talked about a new challenge with @gaebor. It should involve handling a variable length encoding such as utf-8. For example, split on unicode whitespaces (downside: we would have to create artificial data for this) and count the length of words. The output is the histogram of word length. By length, we mean unicode length, so len("álom") == 4 not 5. Later I realized, this is too easy: simple raw utf8 handling is enough and you can get away without a hash table. Anyway, something similar would be nice. |
I agree that the full Huwiki is a bit too large in Bzip2 format. When cleaned, it compresses to about 500 MB with xz, using the -9 (highest non-extreme compression) argument, and I imagine the original is about the same. I think that's a reasonable cap on corpus size, and recompressing makes sense. Looking at hosting again, it's cost-prohibitive to host corpuses publicly in AWS for me (~$0.10/GB bandwidth outbound pricing), but the smallest Digital Ocean droplet is $5/month and offers 1 TB of transfer, so that might be an option.
I like this as a basis, maybe not splitting on all unicode whitespaces but having some that must be handled. We could force count of codepoints (which includes 4-byte unicode characters outside the BMP). Generating some synthetic test cases would not be too hard, though we maybe don't want to force handling of all Unicode situations, since I doubt most languages do everything 100% right. Perhaps if we added a stipulation that multibyte encodings of normally single-byte characters must be converted to a single byte for counting, that would remove the ability to just use raw bytes and look at if they're above a certain value? |
What do you mean by 'normally single-byte characters'? Codepoints under 256? |
Specifically, I mean canonicalization of character representations (preventing use of invalid representation). This is sort of a random thought - I am far from an expert in the details of Unicode (usually I delegate this to builtin libraries and only have to care where specific cases pose application or data issues, such as use of unprintable characters in usernames). Perhaps there's a better way to force "true" Unicode handling? |
On hold for now due to work commitments, I will revisit once things have settled down a little bit. |
I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.
Work progress:
Hardware/specs:
Architecture:
Two options for how to set it up:
Open questions:
Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).
The text was updated successfully, but these errors were encountered: