The greatest challenge to any thinker is stating the problem in a way that will allow a solution.
Bertrand Russell
Using natural language processing (NLP), texts of different authors are used for categorisation. With the help of these texts any sentence can be categorically determined. To understand how written language works and what the differences are between authors it helps to analyse the context of the sentences. Though visualisation it is simpler to see structural varieties such as average sentence length, word class ratio and the use of stop words. This notebook is the heart of the project. More details about this notebook and how to use it can be found in the introduction of it, right at the top of the notebook. You can open it in Google Colab to use a GPU and have a nice platform for editing. There you can run it out of the box. No setup needed!
- Preparations
- Loading text data
- Collect and clean data
- Creating DataFrame and extract new data
- Store or load DataFrame
- Visualization of data
- Preparation, splitting and normalization
- Hyperparameter tuning
- Model preparation and training
- Save or load model
- Evaluation
- TensorBoard
author
a more readable form of thelabel
word_count
mean_word_length
stop_words_ratio
The ratio of stop word to all wordsstop_words_count
- If POS tagging is activated another 16 columns are added:
ADJ_count
adjective countADV_count
adverb countADP_count
adposition countAUX_count
auxiliary countDET_count
determiner countNUM_count
numeral countX_count
other countINTJ_count
interjection countCONJ_count
conjunction countCCONJ_count
coordinating conjunction countSCONJ_count
subordinating conjunction countPROPN_count
proper noun countNOUN_count
noun countPRON_count
pronoun countPART_count
particle countVERB_count
verb count
Shows the shares of data for each author:
Seems like Kant's share is too big π€.
and the distribution of word length by author:
Distribution of sentence length by author:
Why does Hume has so many sentences π€?
Number of sentences
, Median sentence length
, Unique vocabulary count
, Median stop word ratio
.Hume does not only has a lot sentences, but also very long onces π―.
Presents the ratio of authors total used words to word classes:
Plato's sentences seem different to the others. Probably because most of his texts are debates π€. Gives an overview of the number of sentences containing one if the most 20 common words:
I would have suspected 'reason' in one of the first places π§.
To get understand the structures of the sentences you can visualize it:
Classical Nietzsche π
This step prepared the data for the Tensorflow model. To process the text data it needs to be tokenized and encoded. Keras preprocessing methods are used for this.
texts_to_sequences
encodes the text to a sequence of integers.Each sequence is padded to the longest available sequence using
pad_sequences
.The collected metadata (e.g. number of stop words, etc.) gets normalized, not used columns get removed. And afterwards two data frames are concatenated.
Afterwards scikit-learn's
train_test_split
method is used to split the data.At the end two sets of train, validation and label arrays are created for hyperparameter seach and training the model. Instead of manually searching for the best hyperparameter used by the model. In this project Keras Tuner is used.
There are two different ways to create the weights for the embedding layer. You may create your own Word2Vec model using the
embeddings_trainer.ipynb
. For the English language it is also possible to use the weights from the Word2Vec model provided by Tensorflow Hub.At the beginning of the step the Word2Vec model is loaded which can be created
The hypermodel
function contains the definition of the model and the ranges for tuning the hyperparameters. The following parameters can be tuned:
hp_dense_units
- Number of units in dense layershp_lstm_units
- Number of units in LSTM layershp_dropout
- Dropout ratehp_learning_rate
- Learning rate parameter for the optimizerhp_adam_epsilon
- Epsilon parameter for Adam
executions_per_trial
- Number of models that should be built and fit for each trial for robustness purposesmax_epochs
- The maximal number of epochs. This number should be slightly bigger than the epochs for the fitting processhyperband_iterations
- The number of times to iterate over the full Hyperband algorithm
get_best_hyperparameters
. A collection of the best models are returned by get_best_models
.This image shows a possible model found by the Keras tuner search:
The Model contains two inputs. One is used for passing the encoded and padded sentences to the embedding layer. The other input handles the generated metadata. Later they get concatenated before the model ends with a Dense layer having the number of units equal to the classes available (authors). Using the fit method of the selected model - here it gets trained using the train and validation data. Three different callbacks are used:
Tensorboard
- For collecting the data for presentation in TensorBoardReduceLROnPlateau
- Reduce learning rate when a metric has stopped improvingEarlyStopping
- Stops the training if no progress in learning
The embeddings_trainer notebook contains a collection of functions to train Word2Vec, Doc2Vec and FastText models. After some test the outcome was, that the Word2Vec embedding model works best for this case. The ruakspider folder contain all stuff needed for crawling over certain websites to get text as training data for the classification model and text for the training of the Word2Vec embedding model.