-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtasks.txt
200 lines (153 loc) · 5.78 KB
/
tasks.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
My PLAN:
---
versionado:
1.1 - con fpredict dentro de isolate.
1.2 - con placeholder en textarea.
1.3 - minor.
1.4 - presentation.
1.5 - just to see what's going with 599: Timeout.
faltan los scripts que generan
coverage.txt and coverage-2gram.txt
Shiny App.
+ test. chinese. browsers.
+ check error console in shinyapps.io
X stats.
Presentation.
review text.
+ same ideas in my forum description.
I'm taking *all* the candidates from 2-3-4gram tables, merge them by their common predicted value (all=T), merge the unigram (all=F), replace NAs by 0s, multiply by the lambdas, sort, take the top 3.
+ add formula.
+ explain performance.
expletives.
https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
+ ==>>> (1) how your model predicts, (2) what results it produces, and (3) how your app works.
optional in app:
+ if expletives -> symbols
if fullstop -> capital
if Names -> capital
via dictionary.
count symbols
3106 hashtags
count numbers
grep "[[:digit:]*] .*[^[:alpha:]]" freq.1.all.txt | grep -v "'" |awk 'BEGIN { ac=0; } { ac+=$1; } END { print ac; }'
16947
grep "[[:digit:]*] .*[^A-Za-z]" freq.1.all.txt | grep -v "'" |awk 'BEGIN { ac=0; } { ac+=$1; } END { print ac; }'
30878
filtered (café)
line-number, accum / tot <-- coverage.txt
awk 'BEGIN { acum=0;t=50615233; } { acum+=$1; print NR, acum/t; }' freq.1.all.txt | sed 's/,/./' > coverage.txt
2. FREQ OF 2-NGRAM
50615230 total.
awk 'BEGIN { acum=0;t=50615230; } { acum+=$1; print NR, acum/t; }' freq.2.all.txt | sed 's/,/./' > coverage-2gram.txt
idem coverage 2-gram
FOOTPRINT:
> footprint() # N=100
DI 6.6 Kb
N1 1.9 Kb
N2 2.4 Kb
N3 2.9 Kb
N4 3.4 Kb
> footprint()
DI 611.5 Kb # N=10,000
N1 117.9 Kb
N2 157.1 Kb
N3 196.3 Kb
N4 235.5 Kb
> footprint() # N=10,000 M=100,000
DI 611.5 Kb
N1 117.9 Kb
N2 1563.4 Kb
N3 1954.1 Kb
N4 2344.8 Kb
> footprint() # N=10,000 M=1,000,000
DI 611.5 Kb
N1 117.9 Kb
N2 15625.9 Kb
N3 19532.2 Kb
N4 23438.6 Kb
> footprint() # N=10,000 M=1,000,000 using datatables.
DI 611.5 Kb
N1 118.6 Kb
N2 15626.6 Kb
N3 19533 Kb
N4 23439.5 Kb
shiny limits:
small 256 MB
medium 512 MB
large 1024 MB
xlarge 2048 MB
xxlarge 4096 MB
-----
Benchmark 1:
weights: c(0.001, 0.05, 0.14, 0.809);
Overall score: 18.66 %
Overall precision: 25.68 %
Average runtime: 59.28 msec
Total memory used: 47.84 MB
Dataset details
Dataset "blogs" Score: 15.57 %, Precision: 21.36 %
Dataset "quizzes" Score: 23.87 %, Precision: 33.66 %
Dataset "tweets" Score: 16.52 %, Precision: 22.02 %
Benchmark 2:
weights: c(1.387779e-17, 0.1239362, 0.7521276, 0.1239362);
Overall score: 17.93 %
Overall precision: 25.01 %
Average runtime: 63.77 msec
Total memory used: 47.94 MB
Dataset details
Dataset "blogs" Score: 15.50 %, Precision: 21.32 %
Dataset "quizzes" Score: 22.11 %, Precision: 32.01 %
Dataset "tweets" Score: 16.19 %, Precision: 21.69 %
Benchmark 3:
same weights. 30000 words.
Overall score: 19.00 %
Overall precision: 26.30 %
Average runtime: 70.29 msec
Total memory used: 49.08 MB
Dataset details
Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
Score: 16.13 %, Precision: 22.17 %
Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
Score: 23.87 %, Precision: 33.99 %
Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
Score: 17.01 %, Precision: 22.73 %
Benchmark 4:
30,000 words. fixed bug.
Overall score: 21.55 %
Overall precision: 26.23 %
Average runtime: 75.38 msec
Total memory used: 49.43 MB
Dataset details
Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
Score: 17.96 %, Precision: 22.04 %
Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
Score: 27.94 %, Precision: 33.99 %
Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
Score: 18.75 %, Precision: 22.66 %
Benchmark 5:
30,000 words. bug in order fixed. pick first 10.
Overall score: 21.67 %
Overall precision: 26.36 %
Average runtime: 43.84 msec
Total memory used: 49.05 MB
Dataset details
Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
Score: 18.19 %, Precision: 22.30 %
Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
Score: 27.94 %, Precision: 33.99 %
Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
Score: 18.87 %, Precision: 22.80 %
Benchmark 6:
same as 5 but with optimizations and using the last benchmark.R
Overall top-3 score: 21.67 %
Overall top-1 precision: 16.08 %
Overall top-3 precision: 26.36 %
Average runtime: 37.91 msec
Total memory used: 49.09 MB
Dataset details
Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
Score: 18.19 %, Top-1 precision: 13.35 %, Top-3 precision: 22.30 %
Dataset "quizzes" (20 lines, 323 words, hash 07697c9cf45891a1f6da633299f35522711a17e65136ba261702e78e0abd09e1)
Score: 27.94 %, Top-1 precision: 20.79 %, Top-3 precision: 33.99 %
Dataset "tweets" (793 lines, 14011 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
Score: 18.87 %, Top-1 precision: 14.10 %, Top-3 precision: 22.80 %