IMDB MOVIE REVIEWS DATASET OVERVIEW

IMDB MOVIE REVIEWS DATASET OVERVIEW
1+

For this analysis we’ll be using a dataset of 50,000 movie reviews taken from IMDb. The data was compiled by Andrew Maas and can be found here: IMDb Reviews.

The data is split evenly with 40k reviews intended for training and 10k for testing your classifier. 

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labelled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

 

Step 1: Download and Combine Movie Reviews

In order to download the data set, go to IMDb Reviews and click on “Large Movie Review Dataset v1.0”. Once that is complete, you’ll have a file called aclImdb_v1.tar.gz in your downloads folder. You have to unzip the file using the 7zip software available in Windows 10. Now after you have extracted the entire data, we have to import the data in the Python for carrying out the further analysis.

I am attaching below the code required for importing the data set into the python software.

 

import numpy as np
import scipy as sc import pandas as np import matplotlib.pyplot as plt import pandas as pd import spacy import nltk from nltk.tokenize.toktok import ToktokTokenizer import re from bs4 import BeautifulSoup import unicodedata from sklearn.preprocessing import LabelEncoder from nltk.stem.porter import PorterStemmer from nltk.corpus import stopwords from sklearn.metrics import classification_report,confusion_matrix,accuracy_score import seaborn as sns

df=pd.read_csv('../input/movie_reviews.csv')
number=LabelEncoder() df['sentiment']= number.fit_transform(df['sentiment']) reviews=df['review'] sentiments=df['sentiment'] reviews=np.array(reviews) sentiments=np.array(sentiments) df.head()

Step 2: Cleaning and Pre-processing of the data

 

The IMDB movie reviews data set contains two columns named ‘REVIEWS’ and ‘SENTIMENTS’.

In order to move forward in analyzing the data set, we start with encoding the positive and negative sentiments as 0’s and 1’s respectively. Now we will be cleaning the text data and pre-process it for the further analysis.

The steps of preprocessing includes under mentioned methods

1) Removal of html texts and symbols

2) Removal of stop words

3) Removal of characters

4) Removal of special characters

5) Stemming and Lemmatization of data.

The code for the above cleaning and pre-processing is attached below:

 

tokenizer=ToktokTokenizer()
stopwords_list=nltk.corpus.stopwords.words('english')
stopwords_list.remove('no')
stopwords_list.remove('not')

def strip_html_tags(text): soup = BeautifulSoup(text, "html.parser") stripped_text = soup.get_text() return stripped_text
df['review']=df['review'].apply(strip_html_tags)
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
	return text
df['review']=df['review'].apply(remove_accented_chars)
def remove_special_characters(text):
    text = re.sub('[^a-zA-Z0-9\s]', '', text)
    return text
df['review']=df['review'].apply(remove_special_characters)
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text=' '.join([ps.stem(word) for word in text.split()])
    return text
df['review']=df['review'].apply(simple_stemmer)
stop=set(stopwords.words('english'))
print(stop)
#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
     else:
        filtered_tokens = [token for token in tokens if token.lower() not
					in stopwords_list]
        filtered_text =' '.join(filtered_tokens)    
	return filtered_text
#Apply function on review column
				
df['review']=df['review'].apply(remove_stopwords)

Step 3: DIVIDING THE DATA INTO TRAIN AND TEST SETS

In this context we will be dividing the entire movie review data set into train and test sets in the ratio of 80:20 i.e we will classify 80% of the data set into training set and rest into test set. The code is attached below: –

#division into train and test set
				
norm_train_reviews=df.review[:40000]
norm_train_reviews
norm_test_reviews=df.review[40000:]
norm_test_reviews

train_sentiments=df.sentiment[:40000]
test_sentiments=df.sentiment[40000:] train_sentiments

Step 4: FEATURE ENGINEERING: -Vectorization of the text data

In this context we will be using two methods of vectorization of the text data

1) Tf-Idf vectorizer

Hereby below I will be attaching the code of the above method: –

####FEATURE ENGINEERING
from sklearn.feature_extraction.text import TfidfVectorizer tf=TfidfVectorizer() tv_train=tf.fit_transform(norm_train_reviews) tv_train tv_test=tf.transform(norm_test_reviews)

Now in this blog of mine, I will be working on the IMDB movie data set and will be doing its analysis using two ways

1) Traditional supervised Machine Learning Methods using Logistic Regression.

2) Unsupervised LEXICON based Models.

 

TRADITIONAL SUPERVISED MACHINE LEARNING METHOD

 

In this part of my blog I will be discussing about to fit interactive and precise Supervised Machine Learning models. I have here implemented Logistic Regression Model to the featured vectors derived from the above data set.

 

I had implemented various other models like SVM, Linear regression but Logistic Regression proved the most efficient model with the accuracy score of 0.8941. The code is attached below: –

#fitting of logistic regression

from sklearn.linear_model import LogisticRegression

lr=LogisticRegression(penalty='l2',max_iter=500)

lr.fit(tv_train,train_sentiments)
lr.score(tv_train,train_sentiments)
predict=lr.predict(tv_test)
lr_tfidf_score=accuracy_score(test_sentiments,predict) print("lr_tfidf_score :",lr_tfidf_score) cm_tfidf=confusion_matrix(test_sentiments,predict,labels[1,0]) print(cm_tfidf)plt.figure(figsize=(9,9)) sns.heatmap(cm_tfidf, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r'); plt.ylabel('Actual label'); plt.xlabel('Predicted label'); all_sample_title ='Accuracy Score: {0}'.format(lr_tfidf_score)
plt.title(all_sample_title, size = 15);

Now with the confusion matrix I have created I have plotted a heatmap using seaborn package of python.

 


Unsupervised LEXICON based Models.

In this part, we will deal with how to can deal with the IMDB movie reviews datasets with UNSUPERVISED LEXICON BASED MODELS. We will be here implementing AFINN lexicon to analyse the sentiments underlying the data and generate proper visualizations.

 

In AFINN lexicons, we calculate sentiment scores using AFINN library and tag the sentiments as positive, negative and neutral based on the scores calculated. The code in python and its visualizations. is attached below.

####Use of Unsupervised method of SA in nlp(AFINN LEXICONS)
				
from afinn import Afinn
af=Afinn()
sentiment_scores=[af.score(article) for article in df['review']] sentiment_category=['positive' if score>0
else 'negative' if score<0
else 'neutral' for score in sentiment_scores] df2=pd.DataFrame([list(df['sentiment']),sentiment_scores,sentiment_category]).T df2.head()
import random
random.seed(10)
df2.columns=['sentiment','sentiment_scores','sentiment_category']
df2['sentiment_scores']=df2.sentiment_scores.astype('float')
df2.head()

df2.groupby(by=['sentiment']).describe()
###visualisation of AFINN lexicons

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
sp = sns.stripplot(x='sentiment', y="sentiment_scores",	hue='sentiment', data=df2, ax=ax1)
bp = sns.boxplot(x='sentiment', y="sentiment_scores", hue='sentiment', data=df2, palette="Set2", ax=ax2)
t = f.suptitle('Visualizing News Sentiment', fontsize=14)				


There is another type of LEXICON based model that we can implement here named VADER LEXICONS. I am attaching the code below: –

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence): score = analyser.polarity_scores(sentence) print("{:-<40} {}".format(sentence, str(score))) scores=sentiment_analyzer_scores(df['review'])

Want to do similar projects with us? Fill out the google form.

Souptik Sarkar
Souptik Sarkar

M.Sc in Statistics at University of Hyderabad


[likebtn counter_type=”percent” bp_notify=”0″]
1+

Mathematica-City

Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. Feel free to reach out to us for any kind of discussion on any of the related topics,

Leave a Reply

Your email address will not be published.