-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtextprediction.Rmd
152 lines (107 loc) · 8.67 KB
/
textprediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Text Prediction"
author: "herchu"
date: "Sunday, March 29th. 2015"
output: html_document
---
```{r, readcsv, cache=T, echo=F}
set.seed(1234)
commonwords <- read.csv('data/freq.1.all.txt', sep=" ", stringsAsFactors=F, nrows=1000, header=F, col.names=c('Freq','Word'))
covwords <- read.csv('data/coverage-words.txt', sep=" ", header=F, col.names=c('Position','Freq'))
covwords <- covwords[sample(1:nrow(covwords), 10000, replace=FALSE),]
cov2grams <- read.csv('data/coverage-2gram.txt', sep=" ", header=F, col.names=c('Position','Freq'))
cov2grams <- cov2grams[sample(1:nrow(cov2grams), 10000, replace=FALSE),]
```
### Introduction
In spite of miniaturization, computerization and technology achievements occurred along the last decades typing text in keyboards remains pretty similar what was a century ago. In portable digital devices like smartphones where the keyboard is extremely small the task became uncomfortable and slow. The popularization of these gadgets pressure the industry to find easier ways of entering text. Predicting the next word a user wants to type with a certain success can reduce the typing significantly.
In this report I describe a proof of concept of such technique that will be implemented as a web application, the algorithm that will drive this application, its constraints and how to measure its effectiveness.
### The Corpus
Succinctly the idea is to take a giant database of text, split it in small short phrases, sort them by popularity (frequency of occurrence), and store them in a local database. When a user types a sentence the application takes this partial phrase and looks for the best match it the database, if it can find one it suggest the user the rest of the phrase. This giant database of text is called a *corpus*.
For this exercise the corpus consists of three files of English texts. The first of the three includes the content of web blogs, the second contains news and the third are tweets. This files were provided by Coursera as part of Data Science Capstone project.
| File | Number of lines |
|:--------------------:|:-----------------:|
| | |
| en_US.blogs.txt | 899,288 |
| en_US.news.txt | 1,010,242 |
| en_US.twitter.txt | 2,360,148 |
The first thing I did was to split this files in train and test sets. The train set is 50% of the corpus. The rest will be used during the validation and test phase.
### Corpus Description
The rest of the study was done over the training set and to fairly compare these results with other studies the statistics will be reported in relative not absolute values.
During the pre-processing phase I translate all characters to lowercase and split them in words using regular expressions.
Due to the way these files are organized counting words per line differs significantly among the three. All in all they all share the same rate of words per file size. It can be interpreted as a density: lexical words per bytes.
```{r pander0, echo=F}
library(pander)
panderOptions("table.split.table", 1000)
Words.per.bytes <- c(
18735811/100274519, # blogs
17035050/97645292, # news
14844372/77755789 # tweets
)
File <- c('en_US.blogs.txt','en_US.news.txt','en_US.twitter.txt')
df <- data.frame(File, Words.per.bytes)
pander(df)
```
The inverse of these values *is not* the average English word length in characters, as the total byte values contains punctuation, numbers, symbols, etc.
There are some interesting facts within this corpus.
1. It contains `r 3106/423633*100`% unique hash-tags.
2. There are `r 30878/423633*100`% different words with non English characters,
3. where `r 16947/423633*100`% are non-alphabetic distinct lemmas.
Counting the occurrence of words and taking the most common ones shows the expected results.
##### Table with the first 150 most common words
```{r pander, echo=F}
library(pander)
panderOptions("table.split.table", 1000)
out <- matrix(head(commonwords$Word, n=150), ncol=15)
pander(out)
```
----------
However the lemma *t* surprisingly appears within the most popular words. In this case this was due to the fact that its occurrence adds in many different meanings: t as tablespoon acronym in receipts, t in contractions with missing apostrophe, t as "the" and "to" abbreviations. In this set there's also the lemma *u* which of course is the popular abbreviation of the word "you" in informal written language.
Another finding is that the word "obama" has an occurrence similar to other common words like "welcome". This suggests that the corpus for this kind of applications must be a living entity.
Misspellings and foreign lemmas also show some commonality. "adios" and "tomorow" have the same frequency as the beauty but uncommon word "triangular".
As you see foreign words were not filtered out. If they appear in an English corpus it would be for a reason.
While frequency occurrences are commonly shown using histograms, in this case I prefer to graph them using cumulative frequencies relative to the total number of words.
```{r, echo=F}
# plot(covwords, yaxt="n")
# axis(2, at=pretty(covwords$Freq), lab=paste0(pretty(covwords$Freq)*100, "%"), las=T)
library(ggplot2)
library(scales)
qplot(Position, Freq, data=covwords, colour=I("blue"))+
scale_y_continuous(labels=percent, name="Coverage (in percentage over the corpus)")+
scale_x_continuous(labels=comma, name="Popularity Order (smaller numbers means more common)")
```
It is more readable by graphing the x axis using a logarithm scale.
```{r, echo=F}
#plot(covwords, log="x", yaxt="n")
#axis(2, at=pretty(covwords$Freq), lab=paste0(pretty(covwords$Freq)*100, "%"), las=T)
qplot(Position, Freq, data=covwords, colour=I("magenta"))+
scale_x_log10(labels=comma, name="Popularity Order (smaller numbers means more common. log scale)")+
scale_y_continuous(labels=percent, name="Coverage (in percentage over the corpus)")+
annotation_logticks(sides="tb")
```
A layman will be surprised by the linearity of the relationship in this chart. The x axis is in a log scale which implies that the relation exists and is exponential. As a matter of fact this a well known behavior called the Zipf's law, see <http://en.wikipedia.org/wiki/Zipf%27s_law>
Those 150 lemmas cover 50% of word use case. While a dictionary of 1,000 words would cover 70% of the most common words, increasing up to 80% would require to triple its size, to 3,000 words approximately.
This property is key to help us to decide the size of the dictionary in the application that better fits the restricted environment.
If instead of one word we take unique pair of words and measure their frequencies we build what is called bi-grams or 2-grams. Plotting them shows a similar pattern although it takes much number of them to obtain the same increase in coverage.
```{r, echo=F}
#plot(cov2grams, log="x", yaxt="n")
#axis(2, at=pretty(covwords$Freq), lab=paste0(pretty(covwords$Freq)*100, "%"), las=T)
qplot(Position, Freq, data=cov2grams, colour=I("red")) +
scale_x_log10(labels=comma, name="Popularity Order (smaller numbers means more common. log scale)")+
scale_y_continuous(labels=percent, name="Coverage (in percentage over the corpus)")+
annotation_logticks(sides="tb")
```
### The application
The proof of concept will be implemented as a very simple web application consisting on a text box data entry and two small information panels, one on top and one at the bottom of the text box. The user will type whatever she wants in the text box and the predicted word will appear in the top panel. If she press tab the word will be copied inside the box avoiding the need to type it.
The bottom panel will display some statistics regarding the performance of the application like the numbers of key strokes that were saved for the user, the successful and failure prediction rates.
### The engine inside the application
1. The application will ignore all punctuation symbols except for the apostrophe.
2. Any other things excepts words like numbers will be ignored.
3. There will no special treatment for foreign languages. If some terms proves to be popular it will be included.
4. Same for typos.
5. All words will be in lower case. (see 7)
6. Profanities won't be filtered out but,
7. A dictionary will translate proper names to uppercase and profanities with special characters.
8. After a stop the next suggested word will be capitalized.
Internally the application will hold a n-gram dictionary (n-1 words being the predictors, the nth the result) for n from 2 to 4.
If there is no match in the 4-gram the algorithm will fallback the 3-gram and try to match there and so on.
Increasing the coverage while maintaining a limited memory footprint can be done by keeping the internal structures using hashes instead of literal strings.