Halaman

    Social Items

 



Latest Update:

I have uploaded the complete code (Python and Jupyter notebook) on GitHub: https://github.com/javedsha/text-classification


Document/Text classification is one of the important and typical task in supervised machine learning (ML). Assigning categories to documents, which can be a web page, library book, media articles, gallery etc. has many applications like e.g. spam filtering, email routing, sentiment analysis etc. In this article, I would like to demonstrate how we can do text classification using python, scikit-learn and little bit of NLTK.


Disclaimer: I am new to machine learning and also to blogging (First). So, if there are any mistakes, please do let me know. All feedback appreciated.

Let’s divide the classification problem into below steps:

  1. Prerequisite and setting up the environment.

Step 1: Prerequisite and setting up the environment

The prerequisites to follow this example are python version 2.7.3 and jupyter notebook. You can just install anaconda and it will get everything for you. Also, little bit of python and ML basics including text classification is required. We will be using scikit-learn (python) libraries for our example.

Step 2: Loading the data set in jupyter.

The data set will be using for this example is the famous “20 Newsgoup” data set. About the data from the original website:

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

This data set is in-built in scikit, so we don’t need to download it explicitly.

i. Open command prompt in windows and type ‘jupyter notebook’. This will open the notebook in browser and start a session for you.

ii. Select New > Python 2. You can give a name to the notebook - Text Classification Demo 1



iii. Loading the data set: (this might take few minutes, so patience)

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)


 Note: Above, we are only loading the training data. We will load the test data separately later in the example.

iv. You can check the target names (categories) and some data files by following commands.

twenty_train.target_names #prints all the categories
print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file

Step 3: Extracting features from text files.

Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model for our example. Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’. More about it here.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the vocabulary dictionary and it returns a Document-Term matrix. [n_samples, n_features].

TF: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.

TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.

We can achieve both using below line of code:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

The last line will output the dimension of the Document-Term matrix -> (11314, 130107).

Step 4. Running ML algorithms.

There are various algorithms which can be used for text classification. We will start with the most simplest one ‘Naive Bayes (NB)’ (don’t think it is too Naive! 😃)

You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are many variants of NB, but discussion about them is out of scope)

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

This will train the NB classifier on the training data we provided.

Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:

>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.

Performance of NB Classifier: Now we will test the performance of the NB classifier on test set.

import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

The accuracy we get is ~77.38%, which is not bad for start and for a naive classifier. Also, congrats!!! you have now written successfully a text classification algorithm 👍

Support Vector Machines (SVM): Let’s try using a different algorithm SVM, and see if we can get any better performance. More about it here.

>>> from sklearn.linear_model import SGDClassifier>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
... alpha=1e-3, n_iter=5, random_state=42)),
... ])
>>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)>>> predicted_svm = text_clf_svm.predict(twenty_test.data)
>>> np.mean(predicted_svm == twenty_test.target)

The accuracy we get is~82.38%. Yipee, a little better 👌

Step 5. Grid Search

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
... 'tfidf__use_idf': (True, False),
... 'clf__alpha': (1e-2, 1e-3),
... }

Here, we are creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, we create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

This might take few minutes to run depending on the machine configuration.

Lastly, to see the best mean score and the params, run the following code:

gs_clf.best_score_
gs_clf.best_params_

The accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore! 😄) and the corresponding parameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

Similarly, we get improved accuracy ~89.79% for SVM classifier with below code. Note: You can further optimize the SVM classifier by tuning other parameters. This is left up to you to explore more.

>>> from sklearn.model_selection import GridSearchCV
>>> parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
... 'tfidf__use_idf': (True, False),
... 'clf-svm__alpha': (1e-2, 1e-3),
... }
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)
gs_clf_svm.best_score_
gs_clf_svm.best_params_

Step 6: Useful tips and a touch of NLTK.

  1. Removing stop words: (the, then etc) from the data. You should do this only when stop words are not useful for the underlying problem. In most of the text classification problems, this is indeed not useful. Let’s see if removing stop words increases the accuracy. Update the code for creating object of CountVectorizer as follows:
>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])

This is the pipeline we build for NB classifier. Run the remaining steps like before. This improves the accuracy from 77.38% to 81.69% (that is too good). You can try the same for SVM and also while doing grid search.

2. FitPrior=False: When set to false for MultinomialNB, a uniform prior will be used. This doesn’t helps that much, but increases the accuracy from 81.69% to 82.14% (not much gain). Try and see if this works for your data set.

3. Stemming: From Wikipedia, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g. A stemming algorithm reduces the words “fishing”, “fished”, and “fisher” to the root word, “fish”.

We need NLTK which can be installed from here. NLTK comes with various stemmers (details on how stemmers work are out of scope for this article) which can help reducing the words to their root form. Again use this, if it make sense for your problem.

Below I have used Snowball stemmer which works very well for English language.

import nltk
nltk.download()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
... ('tfidf', TfidfTransformer()),
... ('mnb', MultinomialNB(fit_prior=False)),
... ])
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)np.mean(predicted_mnb_stemmed == twenty_test.target)

The accuracy with stemming we get is ~81.67%. Marginal improvement in our case with NB classifier. You can also try out with SVM and other algorithms.

Conclusion: We have learned the classic problem in NLP, text classification. We learned about important concepts like bag of words, TF-IDF and 2 important algorithms NB and SVM. We saw that for our data set, both the algorithms were almost equally matched when optimized. Sometimes, if we have enough data set, choice of algorithm can make hardly any difference. We also saw, how to perform grid search for performance tuning and used NLTK stemming approach. You can use this code on your data set and see which algorithms works best for you.

Update: If anyone tries a different algorithm, please share the results in the comment section, it will be useful for everyone.

Please let me know if there were any mistakes and feedback is welcome ✌️

Recommend, comment, share if you liked this article.

References:

http://scikit-learn.org/ (code)

http://qwone.com/~jason/20Newsgroups/ (data set)

Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK

 

Clear Memory in Python

  1. 1. Clear Memory in Python Using the gc.collect() Method
  2. 2. Clear Memory in Python Using the del Statement

This tutorial will look into the methods to free or clear memory in Python during the program execution. When a program has to deal with large files, process a large amount of data, or keep the data in the memory. In these types of scenarios, the program can often run out of memory.

To prevent the program from running out of memory, we have to free or clear the memory by clearing the variable or data, which is no more needed in the program. We can clear the memory in Python using the following methods.

Clear Memory in Python Using the gc.collect() Method

The gc.collect(generation=2) method is used to clear or release the unreferenced memory in Python. The unreferenced memory is the memory that is inaccessible and can not be used. The optional argument generation is an integer whose value ranges from 0 to 2. It specifies the generation of the objects to collect using the gc.collect() method.

In Python, the short-lived objects are stored in generation 0 and objects with a longer lifetime are stored in generation 1 or 2. The list maintained by the garbage collector is cleared whenever the gc.collect() with default generation value equal to 2 is called.

The gc.collect() method can help decrease memory usage and clear the unreferenced memory during the program execution. It can prevent the program from running out of memory and crashing by clearing the memory’s inaccessible data.

Clear Memory in Python Using the del Statement

Along with the gc.collect() method, the del statement can be quite useful to clear memory during Python’s program execution. The del statement is used to delete the variable in Python. We can first delete the variable like some large list, or array, etc., about which we are sure that are no more required by the program.

The below example code demonstrates how to use the del statement to delete the variable.

import numpy as np

a= np.array([1,2,3])
del a

Suppose we try to use or access the variable after deleting it. In that case, the program will return the NameError exception as the variable we are trying to access no more exists in the variable namespace.

Example code:

import numpy as np

a= np.array([1,2,3])
del a
print(a)

Output:

NameError: name 'a' is not defined

The del statement removes the variable from the namespace, but it does not necessarily clear it from memory. Therefore, after deleting the variable using the del statement, we can use the gc.collect() method to clear the variable from memory.

The below example code demonstrates how to use the del statement with the gc.collect() method to clear the memory in Python.

import numpy as np
import gc

a = np.array([1,2,3])
del a
gc.collect()

Diagnosing and Fixing Memory Leaks in Python



From being named People’s “Sexiest Man Alive” to earning three consecutive Oscar nominations, Bradley Cooper has had quite an interesting career. His big blue eyes have always won over the hearts of swooning fans, but Cooper had to pay his dues for many years to lead up to the critical acclaim he has earned today. With roles such as the distant war hero Chris Kyle in the controversial American Sniper to a struggling bipolar man just released from a psychiatric hospital in Silver Linings Playbook, Cooper has shown a range of acting skills in the last few years that arguably had not been expected in the past.


The 40-year-old actor had his up and downs, starring in minor roles of a diverse assortment of characters. Living past the pretty-boy persona can be a difficult feat but despite Cooper’s own personal struggles, he has far surpassed that, showing the world that he is a capable and talented actor. His days of being the beau to some famous actress are over, and now he is front and center. So here are 10 awesome things you probably never knew about Bradley Cooper.

He Auditioned For Green Lantern
Though talks of Bradley Cooper in the running for an upcoming Green Lantern role are currently traveling through the internet, this wouldn’t be the first time the actor auditioned for the part. Back in 2009, Cooper was a frontrunner against Ryan Reynolds for the titular role in Green Lantern. He told Conan O’Brien on The Tonight Show (via MTV) that during his audition he couldn’t help but imitate Christian Bale’s Batman:
I couldn't not do Christian Bale's Batman when I was doing the audition. I don't know what it was! I put a mask on and the director was like, 'Okay Bradley, be regular and talk.' And I was like, 'Yeah, got it... [in a deep, gravely Batman voice] listen, Sally, we're going to have to take your family away if you don't listen to me!' By the way, that's the worst Batman [impression] ever. I apologize.


He Gained 40 Pounds For His American Sniper Role
For his role as Chris Kyle, Bradley Cooper put on about 40 pounds of pure muscle to resemble the war hero. He had two workouts a day of two hours each and instead of having a year, Cooper only had three months of prep before shooting started. Cooper told Vanity Fair that during his workouts he listened to the exact same playlist that Chris Kyle had when he worked out in between shifts as a navy SEAL. He ate 5,000 calories a day and by the end of his time working out he was able to deadlift 415 pounds for five sets of eight reps. He even learned how to hold and shoot the various weapons Kyle used from former navy Seals who served with him. He kept in character the entire shoot.

He Missed Graduation To Be In Wet Hot American Summer
While Bradley Cooper was finishing up his MFA at The New School, he was beginning his acting career. He had taken some small guest role appearances on TV shows and even served as a presenter for a travel-adventure series called Globe Trekker But his film debut came in the cult classic comedy, Wet Hot American Summer. The problem was that filming happened to be right around the time Cooper graduated. He joked with GQ saying he missed his graduation to “get fucked in the ass by Michael Ian Black”.

He Asked J.J. Abrams To Write Him Off Alias
Bradley Cooper asked J.J. Abrams to write him off of Alias because he thought that he was going to fire him anyway. He explained in his GQ interview that his part grew less substantial as the show progressed and it nearly ended his career. Because of his aggravation, he asked to be written off despite having no future jobs lined up, and within a couple weeks he ended up tearing his Achilles while playing basketball and spend the next year on his couch debating whether or not to quit acting altogether.

His Most Difficult Role Was In The Hangover
For an actor who has played a bipolar man, an experienced war hero, and an FBI agent, it’s hard to believe that Bradley Cooper’s found his most difficult role to be that of a sunglass-rocking teacher named Phil. Cooper told The Guardian that his role in the box office hit The Hangover was actually his most difficult yet. He said:
That guy is so different from me. I'm always amazed by it, actually. When I look at that character on screen, I don't see me at all.


James Lipton Knew He’d Be Famous
James Lipton, host of Inside the Actors Studio predicted Bradley Cooper’s stardom. Not only was he sitting in on the auditions during Cooper’s application to the masters program, but he was particularly drawn to Cooper’s performance. According to Vanity Fair, after Cooper’s master’s thesis performance (which he performed scenes from The Elephant Man) Cooper’s mother asked Lipton what he thought and Lipton responded:
He’s going to go all the way. I never predicted that for any other student.


He Was In Sex And The City And Learned To Drive Stick
Bradley Cooper’s first TV appearance when he moved to New York was in an early episode of Sex and the City where he played one of Sarah Jessica Parker’s hunky love interests. Cooper told The Guardian that upon landing the role there was one very specific thing that it required, “no tongues”. In a Backstage interview, he divulged that he had a big problem with his newly earned role though, in that he didn’t know how to drive stick shift. He quickly went to a driving school in Manhattan, but it didn’t work out too well, and a stand-in had to drive instead.

He Knew He Wanted To Be An Actor After Seeing Elephant Man
Bradley Cooper knew he wanted to be an actor after seeing David Lynch’s The Elephant Man when he was 12 years old. He told Vanity Fair that he was sitting on the red couch in his living room sobbing and aware of the dignity and humanity of John Merrick, even though Cooper himself was still so young. And Cooper actually just recently revived the role of John Merrick in the Broadway revival of the Bernard Pomerance play The Elephant Man.

He Was A Doorman At Morgans Hotel When He First Moved To New York
When he moved New York to study acting at the New School, Bradley Cooper worked nights at the Morgans Hotel in Manhattan. He told Esquire that every night he carried had to carry a bunch matches and as a new guest was welcomed he would have to relight all the votive candles and scurry to the door for them. Many celebrities stayed there as well, and one night he welcomed Leonardo DiCaprio who was hot off his Titanic role, and all Bradley could think about was how different the two actors were.

He’s Super Smart
Not only is Bradley Cooper fluent in French (which has blown up the internet), but he also graduated with honors from Georgetown with an English degree. He told GQ he wrote his thesis on Nabokov's Lolita and he didn’t participate in much drama in high school or at Georgetown but was more of an athlete up until he went for his MFA. Cooper somewhat randomly applied for his master’s at the Actors Studio Drama School in New York almost as a joke but ended up getting in. Even during his acting career he has contemplated going back to school to get his Ph.D. in English and teaching literature.

Thanks & Cheers

Bradley Cooper: 10 Awesome Things You Probably Never Knew