Natural Language Processing and Sentiment Analysis using Tensorflow.

Hasan Faraz Khan
Level Up Coding
Published in
5 min readJun 29, 2020

--

Natural Language Processing or NLP is a field of Artificial Intelligence and Deep Learning that gives the machines the ability to read, understand, and derive meaning from human languages and focuses on the interaction between data science and human language.

This attempt is a sneak peek and primarily focuses on performing a basic Sentiment Analysis using Tensorflow and Keras. A commonly used approach would be using a Convolutional Neural Network (CNN) to do sentiment analysis. Keras, however, has Embedding() Layers, GlobalAveragePooling1D() and LSTM layers which can be used to build a much accurate machine learning model.

Keras’s Sequential API is quite easy to use and common as well for creating a linear stack of layers. It provides Input layer for taking input, Dense layer for creating a single layer of neural networks, in-built tf.losses to choose over a range of loss function to use, in-built tf.optimizers, in-built tf.activation, etc. We can create custom layers, loss-function, etc.

Starting with importing Tensorflow and other requisite libraries and packages.

Data Preparation and cleaning.

Import your dataset or CSV file and check value_counts() for the column with all classes.

First ten rows of the dataframe.

This dataset doesn’t have a column for sentiment classification and this is why we’ll be using df[“rating”] column to create a column for possible sentiments in this pandas dataframe. One possible way of doing this is to create a function that returns “Negative” for lower ratings and “Positive” for relatively higher ratings. The aim is to create a new dataframe which we’ll later be using for encoding.

Performing visualizations for data ex. WordCloud are always a good choice as the WordCloud is designed to make words stand out according to their size based on their frequency of occurrence.

For generating word cloud in Python, modules needed are — matplotlib, pandas and wordcloud. To install these packages, run the following commands :

pip install matplotlib
pip install pandas
pip install wordcloud

You can check the documentation for WordCloud : Here

The next step is preprocessing and removing stopwords.Text cleaning procedure depends on the task. For example, if you have task of text classification or sentiment analysis then you should remove stop words since they don’t provide any information to model but if you have task of language translation then stopwords are useful. We’ll be using NLTK(Natural Language Toolkit) to download these stopwords.

Train-Test Split

Machine Learning is all about generalizing so that your inference is right on data which is unseen. If you dont split it , you would be training on the entire data available to you. Though you may now have inferences that are right on this data. But you don’t know for sure, how good would your inference be on unseen data. So validation and test set are splitted so that you get a sense of how the model performs on unseen data.

Using Tokenizer and creating padded sequences.

Tokenizer allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count.

I have used tf.keras.preprocessing.sequence.pad_sequences to create padded sequences for sentences.This function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps). num_timesteps is either the maxlen argument if provided, or the length of the longest sequence in the list.

After this we’ll have to tokenize labels and create pad_sequences for our test data as well. Because we’ll be using it as a validation set for the training of our model.

Building the model using Keras

After checking the summary of model and all trainable parameters all we have to do is to train our model on training data over a decent number of epochs.

After the completion of training I received a training accuracy of 0.9032 and validation accuracy of 0.8742 which is good. Also the difference is less which shows that it’s not a case of overfitting.

Plotting accuracy and loss graphs.

This is what these graphs look like.

Making predictions on our model.

The first thing we’ll be doing is to tokenize our classes because we’ll be using these tokens as an index for our class list. Then obviously, the second step is to create a class list with all these sentiments.

The third step is to make predictions using model.predict() which will return an array. To get our predicted token or index we’ll be using np.argmax() as it will return the token with highest probability (This is where softmax function plays it’s role).

This code block will allow us to take input(review) from user and turn in into sequence of word_index. The prediction is to be made on the padded sentence which later returns the predicted category for the sample text.

This is how predictions can be made on unseen data.

We can also use LSTM and RNNs to improve the accuracy of our models.Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.

--

--