This is an implementation of document classification using four algorithms: Multinomial Naive Bayes, Support Vector Machines, Random Forests and Logistic Regression.
Validation Sore is calculated as ( #correct - #incorrect )/( total )
SVM has shown to have the highest validation score of 94%
- Python 2 or 3
- sklearn python module
- nllk python module and download stop words beforehand if possible
- pickle python module
- sys python module
- time python module
- pathlib python module
- future python module
ALL of these variables can be found in the main() function
- cachedStopWords = False - list of stop words to remove.
you can define your own stop words. if "False" it will download from NLTK and use stopwords provided from them.
cachedStopwords = stopwords.words("english") - training_file = "trainingdata.txt" - this is the path of training file.
- testing_file = "testingdata.txt" - this is the path of the testing file.
- save_model_file = "document_classifier.sav" - this is the file in which the model wll be saved in.
- save_vectorizer_file = "tfidf_vectorizer.sav" - path of the file to save the vectorizer model in (should be a .sav file)
- load_model_status = True - if true it will load the model from the file with the name stored in the variable load_model_from, if False it will create a new model.
- Load_model_from = "document_classifier.sav" - this is the path of the file from which the model will be loaded from.
- load_vectorizer_from = "tfidf_vectorizer.sav" - path of the file to load the vectorizer from. if the file does not exist, it will use the tfidf vectoizer from scikit-learn by default.
- validation_size = 0.25 - this is the split ratio for training vs validation data.
-> size(validation_data) = validation_size * size(total_data) - classifier_to_use = "svm" - this is the classifier to use.
Choose from: "Multinomialnb", "svm", "random_forest" and "logistic_regression"
score() function that will calculate the score as 100*(#correct-#incorrect)/(T)
load_test_file() function to load the testing file data given the testing_file would be a certain format
help() function to display help please note for this function to work you will need the help.txt (path stored in the variable help_file)
"document_classifier" is a class with the following member functions: init(), train(), predict() and save_model()
python document_classification.py
-> normal
python document_classification.py -h
-> enables help
python document_classification.py -help
-> enables help
- The first line contains the number of lines that will follow.
- Each following line will contain a number (1-8), which is the category number. The number will be followed by a space then some space seperated words which is the processed document.
- The first line in the input file will contain T the number of documents.
- T lines will follow each containing a series of space seperated words which represents the processed document.