Text Classification using Scikit-Learn (sklearn)¶
This is a classification of emails received on a mass distribution group based on subject and hand labelled categories (supervised). The solution includes preprocessing (stopwords removal, lemmatization using nltk), features using count vectorizer and tfidf transformer. The solution is a vanilla implementation that can be used to extend from here to various text classification problems.
Things that can be tweaked to improve accuracy...
- Add more parameter configurations to GridSearchCV
- Increase number of K Folds used with GridSearchCV, default is 3.
- Increase the dataset (current dataset is only 500 emails)
- The classes in the dataset are skewed with varying proportions, the dataset can either be balanced by oversampling or the weights for each class can be adjusted if the classifier allows.
- Try different classifiers or model stacking
Quick Info...¶
- Dataset: Dataset is a csv with columns 'Subject' and 'Categroy' (target variable) for about 500 emails. I'm not sharing dataset as it is from real emails taken from my inbox. Replace the dataset with your own dataset that has these two columns.
- Features: Features matrix is created using a sklearn.feature_extraction.text.CountVectorizer, to get a counts matrix of all tokens and sklearn.feature_extraction.text.TfidfTransformer to normalize the count matrix.
- Classifier: sklearn.linear_model.SGDClassifier
- Pipeline and GridSearchCV: sklearn.pipeline.Pipeline and sklearn.model_selection.GridSearchCV are one of the best things in sklearn. Pipelines let you perform a series of steps on data without individually creating objects, handling parameters/return values and data hand off between steps. GridSearchCV helps with parameter tuning. It also performs cross validation with default 3 fold validation. Pipelines and GridSearchCV together reduce a lot of code complexity and improve readability of a solution.
In [1]:
import numpy as np
import pandas as pd
from pprint import pprint
from time import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#Not using stemming as the performance improvement wasn't observed.
#from nltk.stem.porter import *
In [2]:
emails = pd.read_csv('emails.csv')
em = emails.dropna(axis=0)
em.sample(3)
Out[2]:
In [3]:
em['Category'].value_counts()
Out[3]:
In [4]:
def pre_process_text(textArray):
#If using stemming...
#stemmer = PorterStemmer()
wnl = WordNetLemmatizer()
processed_text = []
for text in textArray:
words_list = (str(text).lower()).split()
final_words = [wnl.lemmatize(word) for word in words_list if word not in stopwords.words('english')]
#If using stemming...
#final_words = [stemmer.stem(word) for word in words_list if word not in stopwords.words('english')]
final_words_str = str((" ".join(final_words)))
processed_text.append(final_words_str)
return processed_text
em['Subject'] = pre_process_text(em['Subject'])
In [5]:
categories = [ 'Real-Estate', 'Automobile', 'Travel-Fun', 'Recommendation', 'Sale', 'Other', 'Relocation']
In [6]:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
]);
In [7]:
# Every additional parameter value here will increase the training time by orders of magnitude.
# I'm running on a relatively slow computer, hence reduced the values
parameters = {
'vect__max_df': (0.5, 1.0),#0.6, 0.7, 0.8, 0.9, 1.0),
'vect__max_features': (None, 1000, 5000),#2000, 3000, 4000, 5000, 6000, 10000, 20000, 30000, 40000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)),#, (1, 3)), # unigrams or bigrams or trigrams
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (0.1, 0.01, 0.001),#, 0.0001, 0.00001, 0.000001, 0.0000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__n_iter': (10, 50)#, 100, 200, 300, 400, 500, 100),
}
In [8]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, refit=True)
print("Grid Search started\n---------------------------------------")
print("Pipeline:", [name for name, _ in pipeline.steps])
print("Grid Search Parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(np.array(em['Subject']), np.array(em['Category']))
print("done in %0.3fs\n----------------------------------------------" % (time() - t0))
print("Best Score: %0.3f\n-------------------------------------------" % grid_search.best_score_)
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
In [9]:
test_set = [
'RE: items for sale',
'Coorg trip advice',
'movie tickets for sale',
'Advice needed for treatment of hair fall',
'Moving out sale',
'RE: Selling Honda City'
]
In [10]:
grid_search.best_estimator_.predict(np.array(test_set))
Out[10]:
In [ ]:
Comments