tasks.txt

My PLAN:

---

versionado:
    1.1 - con fpredict dentro de isolate.
    1.2 - con placeholder en textarea.
    1.3 - minor.
    1.4 - presentation.
    1.5 - just to see what's going with 599: Timeout.

faltan los scripts que generan
    coverage.txt and coverage-2gram.txt

Shiny App.
    + test. chinese. browsers.
    + check error console in shinyapps.io
    X stats.

Presentation.
    review text.
    
    + same ideas in my forum description.
        I'm taking *all* the candidates from 2-3-4gram tables, merge them by their common predicted value (all=T), merge the unigram (all=F), replace NAs by 0s, multiply by the lambdas, sort, take the top 3.
    + add formula.
    + explain performance.
    expletives.
        https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
    +  ==>>> (1) how your model predicts, (2) what results it produces, and (3) how your app works.


optional in app:
        + if expletives -> symbols
        if fullstop -> capital
        if Names -> capital
        via dictionary.

    count symbols
        3106 hashtags    
    count numbers
        grep "[[:digit:]*] .*[^[:alpha:]]" freq.1.all.txt | grep -v "'" |awk 'BEGIN { ac=0; } { ac+=$1; } END { print ac; }'
        16947
        grep "[[:digit:]*] .*[^A-Za-z]" freq.1.all.txt | grep -v "'" |awk 'BEGIN { ac=0; } { ac+=$1; } END { print ac; }'
        30878
        filtered (café)


line-number, accum / tot <-- coverage.txt
awk 'BEGIN { acum=0;t=50615233; } { acum+=$1; print NR, acum/t; }' freq.1.all.txt | sed 's/,/./' >  coverage.txt


2. FREQ OF 2-NGRAM
50615230 total.
awk 'BEGIN { acum=0;t=50615230; } { acum+=$1; print NR, acum/t; }' freq.2.all.txt | sed 's/,/./' > coverage-2gram.txt
idem coverage 2-gram


FOOTPRINT:
> footprint()     # N=100
DI   6.6 Kb 
N1 	 1.9 Kb 
N2 	 2.4 Kb 
N3 	 2.9 Kb 
N4 	 3.4 Kb 
> footprint()
DI   611.5 Kb    # N=10,000
N1 	 117.9 Kb 
N2 	 157.1 Kb 
N3 	 196.3 Kb 
N4 	 235.5 Kb 
> footprint()    # N=10,000  M=100,000
DI   611.5 Kb 
N1 	 117.9 Kb 
N2 	 1563.4 Kb 
N3 	 1954.1 Kb 
N4 	 2344.8 Kb 
> footprint()    # N=10,000  M=1,000,000
DI   611.5 Kb 
N1 	 117.9 Kb 
N2 	 15625.9 Kb 
N3 	 19532.2 Kb 
N4 	 23438.6 Kb 
> footprint()    # N=10,000  M=1,000,000 using datatables.
DI      611.5 Kb 
N1 	 118.6 Kb 
N2 	 15626.6 Kb 
N3 	 19533 Kb 
N4 	 23439.5 Kb 


    shiny limits:
        small   256 MB
        medium  512 MB
        large	1024 MB
        xlarge	2048 MB
        xxlarge	4096 MB

-----

Benchmark 1:
weights: c(0.001, 0.05, 0.14, 0.809);

Overall score:      18.66 %
Overall precision:  25.68 %
Average runtime:    59.28 msec
Total memory used:  47.84 MB

Dataset details
Dataset "blogs"     Score: 15.57 %, Precision: 21.36 %
Dataset "quizzes"   Score: 23.87 %, Precision: 33.66 %
Dataset "tweets"    Score: 16.52 %, Precision: 22.02 %


Benchmark 2:

weights: c(1.387779e-17, 0.1239362, 0.7521276, 0.1239362);

Overall score:     17.93 %
Overall precision: 25.01 %
Average runtime:   63.77 msec
Total memory used: 47.94 MB

Dataset details
Dataset "blogs"    Score: 15.50 %, Precision: 21.32 %
Dataset "quizzes"  Score: 22.11 %, Precision: 32.01 %
Dataset "tweets"   Score: 16.19 %, Precision: 21.69 %


Benchmark 3:
same weights. 30000 words.

Overall score:     19.00 %
Overall precision: 26.30 %
Average runtime:   70.29 msec
Total memory used: 49.08 MB

Dataset details
 Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
  Score: 16.13 %, Precision: 22.17 %
 Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
  Score: 23.87 %, Precision: 33.99 %
 Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
  Score: 17.01 %, Precision: 22.73 %


Benchmark 4:
30,000 words. fixed bug.

Overall score:     21.55 %
Overall precision: 26.23 %
Average runtime:   75.38 msec
Total memory used: 49.43 MB

Dataset details
 Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
  Score: 17.96 %, Precision: 22.04 %
 Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
  Score: 27.94 %, Precision: 33.99 %
 Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
  Score: 18.75 %, Precision: 22.66 %


Benchmark 5:
30,000 words. bug in order fixed. pick first 10.

Overall score:     21.67 %
Overall precision: 26.36 %
Average runtime:   43.84 msec
Total memory used: 49.05 MB

Dataset details
 Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
  Score: 18.19 %, Precision: 22.30 %
 Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
  Score: 27.94 %, Precision: 33.99 %
 Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
  Score: 18.87 %, Precision: 22.80 %


Benchmark 6:
same as 5 but with optimizations and using the last benchmark.R

Overall top-3 score:     21.67 %
Overall top-1 precision: 16.08 %
Overall top-3 precision: 26.36 %
Average runtime:         37.91 msec
Total memory used:       49.09 MB

Dataset details
 Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
  Score: 18.19 %, Top-1 precision: 13.35 %, Top-3 precision: 22.30 %
 Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
  Score: 27.94 %, Top-1 precision: 20.79 %, Top-3 precision: 33.99 %
 Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
  Score: 18.87 %, Top-1 precision: 14.10 %, Top-3 precision: 22.80 %