Using Neural Networks and NLP Word Embedding to Predict Amazon User Review Ratings
Summary
Project Purpose:
The purpose of this project was to build a model that predicts a user’s star rating (scale of 1–5) on Amazon digital music albums based on the text written in their review comments. I tested several Neural Network algorithms, which were trained with sampled user review comments and their corresponding star ratings. And I leveraged several Natural Language Processing (NLP) techniques to tokenize and vectorize text into a “trainable” data structure. By building a supervised classification model that deciphers user opinion from text, I hope my algorithm can potentially be leveraged for other types of sentiment analysis problems, including ones that require unsupervised learning (i.e. don’t have star rating classification labels to train on). This is, of course, with the assumption that star rating is a true reflection of user sentiment.
As exemplified in the table below, my model would take sample raw user review data (reviewText column), would vectorize the text from tokenized key words (tokens column), and then predict the user star rating they would leave. I would then see how the predicted star value (pred_stars) compared to the actual star value in the dataset (diff_pred).
Data and Code Used:
The dataset I used was an open-source dataset from Amazon of ~1.5M user reviews. Since my laptop had processing limitations and didn’t have access to a cloud server, I chose to only train/test on a 10% random sample of the full dataset. All my code can be found within my Jupyter notebook here.
Methodology Overview:
To build this model, I ended up testing and hyperparameter tuning six different types of algorithm and NLP vectorization combinations. The algorithms I explored were Support Vector Machine (SVM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models. To vectorize my text data, I tested three methods: Bag of Words (BOW), training a word embedding vector from scratch, and using the pre-trained word embedding vector Global Vectors for Word Representation (GloVe) algorithm. Each of these models followed the same workflow shown in the diagram below. After sampling and text preprocessing, the data was partitioned into testing and training data.
Summary of Results:
As seen by the results below, the DNN with BOW vectorization had the highest accuracy of 62%. While I was disappointed that the accuracy of all my models were only ~60%, I was pleased that they each had ≥85% accuracy in predicting within one star rating. But as can seen by the high training accuracy, all my models appear overfitted.
It’s important to keep in mind that interpreting performance of multi-class models (in this case, 5 for each star) is less straight-forward than binary models. For example, it may make sense to look at weighted accuracy or performance for each class level (i.e. star). I did not do a deeper-dive of performance metrics, but plan to explore this later on. That being said, I suspect there was a combination of culprits that caused overfitting in my model. At a high level, these include:
- Imbalanced dataset: My raw data sample was extremely skewed, and ~80% of ratings had 5 stars. While I did use resampling techniques to improve uniformity of distribution, it still remained imbalanced. This impact on performance can be particularly exacerbated on multi-class models. With more time, I plan to explore other over/undersampling techniques.
- Embedding and training parameters: It’s possible there were deficiencies in my word embedding training, or suboptimal hyperparameters (e.g. layers, filters, etc.) in my NN/CNN algorithms. I had tested out regularization functions in Keras, but this ended up decreasing the accuracy in my training data without improving the performance on my validation data. While I used GridSearch to tune my model hyperparameters, the extensive training time it took to train limited the range and number of parameters I could reasonably test.
- Insufficient sample size: My sampled dataset was only 10,000 records. If I had more time and a cloud server, I would have considered using the entire dataset of 1.6 million records over more epochs…but I anticipate this would take several days to run. And when I did test out slightly increasing the sample size, it didn’t seem to make much difference. I had also explored using more epochs during my model training, but accuracy seemed to decrease after the first several epochs. I ended up using 10 epochs for each model.
- Not enough model features: It’s important to keep in mind that user review comments are an imperfect predictor of what star rating they may give. By only training my model on user comments, I didn’t account for other predictive factors that could impact a given user’s star rating. This includes how one member may review other online products, their demographics (e.g. age, location, education, etc.), or other factors relating to the music album itself. Rather than predict what star ratings were given, my model could instead get repurposed to serve as a predictor of user sentiment on a scale of 1–5. My model’s accuracy in predicting within one star proves that it could be effective in doing so.
Overall, this was a great learning experience for experimenting with different Neural Net architectures in Keras and Natural Language Processing techniques. Below, I’ve provided more detail on each step I took in creating my models.
Text Data Pre-Processing:
The preprocessing and vectorization ended up being a very time-consuming component of this project. This was due to my research and experimentation of various techniques. The ultimate goal of text preprocessing is to create some type of vector(s) with numeric representation of each word. This numeric vector is then input into a machine learning model for both training and testing purposes. The general flow of preprocessing I used was 1.) sampling the dataset, 2.) tokenizing, 3.) normalizing, and 4.) vectorizing all of my text data.
Sampling the dataset:
Early on, I realized two key concerns with the source Amazon reviews dataset. First, it was way too large (1.6 million records) to be able to complete training of various models in a reasonable timeframe. I ended up only sampling 10,000 records, which still took a fair amount of time to train certain models. My next realization was that the distribution of star ratings was very skewed. About 80% of records had a 5-star rating, which I felt would have skewed the meaning of my accuracy score (i.e. a model that predicts all reviews being 5-stars would have 80% accuracy, which is deceptively high). To mitigate this when random sampling my 10k records, I put heavier weighted probability for sampling records with < 5 stars. The distribution of ratings before and after sampling are shown below.
Tokenization:
My next step was to tokenize the data so that each word represents a distinct token feature. When doing this, I also made sure to remove “stopwords”, which are common function words such as “the”, “is”, and “at”. Below are snippets from my data before and after tokenization. It is expected that tokens like “n’t” will be formed from contractions (e.g. don’t).
Normalization:
After tokenizing, I further normalized the text data to reduce redundancy. This was done by removing any non-ASCII characters, removing punctuation, making everything lowercase, and replacing numbers with textual representation (500 = five hundred). I leveraged code snippets from here.
I also applied lemmatization, which a type of stemming technique. This normalizes words, particularly verbs, into their root form and present tense. A snippet of the same review record after normalization is shown below. Note how “listened” becomes “listen and “thought” becomes “think”.
Vectorization:
The last, and most complicated step is vectorizing the text data as a numeric matrix for input into my algorithms. There are numerous techniques for doing this. The three methods I explored were Bag of Words (BOW), training a word embedding vector from scratch, and using the pre-trained word embedding vector Global Vectors for Word Representation (GloVe) algorithm.
BOW embedding: With this method, a matrix of vectors is created that represents the frequency of each word. Each vector corresponds to a review, and the width of each vector is the number of all distinct words in the entire corpus across all records. In each reviews’ vector, the frequency of each word is populated. An example is shown below. While this is the simplest approach to implement, it has several drawbacks. As the number of records and word vocabulary increase, the vectors become very wide with many non-zero values. Also, word frequency does not account for context or ordering of the words.
Below is a code snippet of a BOW test I did with an SVM algorithm on 1,000 sample records. It also shows performance results on the 250 records in my validation data (75/25 split).
Manually trained word embeddings: My second approach was to create word embeddings, which would be manually trained in my model. There are three general approaches to token vectorizing and word embedding:
- Words represented by each word as a vector
- Characters represented by each character as a vector
- N-grams of words/characters represented as a vector. N-grams are overlapping groups of multiple succeeding words/characters in the text. For example, you can choose N=3 to look for word combinations like “I like candy”. This method represents words in a similar manner if they are used in similar ways. The color blue would have a much more similar numeric representation to red than it would to fish.
I chose method #1, since I felt vectorizing each character would be inefficient for such a large dataset, and didn’t have as much time to test N-gram parameters. Within this method #1, I explored both one-hot encoding and word embedding. One-hot encoding outputs a 0/1 binary indicator of whether a word exists amongst the entire corpus of distinct words for each record. But like BOW, a big drawback of one-hot encoding is that it creates very wide input vectors and doesn’t take word ordering (i.e. context) into account.
Instead, I ultimately used a word embedding technique that creates equal-sized and condensed vectors whose maximum size is based on the number of words in the longest user comment. This vector would contain distinct indexed numbers associated with the given words in a record’s user review. For user reviews that are shorter, the vectors will be “padded” with zeros. The image below shows how tokenized data gets converted to training sequences, which are then subsequently padded to equal-sized vectors. In the example below, since the first text sequence is only three words, the rest of the vector is padded with zeros
Within Keras, there are helpful libraries to tokenize, pad, and vectorize data. Below is an example that tokenizes a sample review record, followed a code snippet to pad the vectors to be equal size.
GloVe embedding: My third approach was to use a pre-trained GloVe embedding. This free downloadable embedding has initialized weights for words, which are largely derived from creating word co-occurrence matrices across the whole text corpus. Keras gives the ability to incorporate these embedding dictionaries into the embedding function of a neural net model. While this embedding was used to seed the model, I chose for them to be updated during training of the model. If I had more time, other pre-trained embedding libraries I would have explored were Word2Vec and TF-IDF. As seen from the code snippet below, the GloVe library accounted for ~93% of all words in my data’s text corpus if I set a max embedding dimension of 50.
Modeling:
Due to the large amount of data, particularly when the text is vectorized, I was inclined to explore using Neural Networks. I was particularly intrigued by using a CNN, which is often used in text classification problems. And I was curious how it would perform against a more traditional/less intensive ML algorithm (SVM), which is also a popular choice for text classification.
Support Vector Machines (SVM):
For my SVM model, I used the default parameters for training my text data. It did not apply any regularization penalty, and used a radial basis function kernel. The input vectors were created from the BOW encoding technique. Since SVM was not a focus of my project, I did not spend much time tuning the model’s parameter values, and treated it as more of a baseline model.
The results of my SVM baseline model are below. The accuracy on 2,500 validation records was ~62%. Also, the accuracy on the training dataset was only 69.1%, which implies this algorithm was not very effective.
Deep Neural Network (DNN):
I implemented a DNN using the BOW and trained word embedding text vectorization techniques I described above. I used Keras’ Sequential Model API, where layers are added in sequence. This tutorial was a particularly helpful guide for both implementing Keras and getting a baseline understanding of Neural Networks. I’ll attempt to summarize this below, and I’ve italicized key elements of a NN that have multiple variations I experimented with.
[Very] High Level Description of Neural Networks: Neural networks consists of nodes that feed forward training data from an input layer → one or several hidden layers where all the iterative training occurs → an output layer that contains the weighted predictor probabilities. The output layer would have one node for each prediction class (in this case 5). A simplistic diagram is below.
Throughout a NN, each layer of input nodes gets multiplied by a calculated weight and bias variable in order to produce the next layer’s output nodes. The initial input weights of a model are randomly assigned and trained through a method called backpropogation. This method uses optimizers to reduce the error between the computed and desired target output. The goal of the optimizer is to reduce the loss function, which is how we quantify the error. Throughout backpropogation, the weights of each input layer get passed through an activation function.
Key parameters in my model:
- Hidden layers: I built a “deep” neural network, which means there were several hidden layers to allow for additional training. In my model with manually trained embeddings, I had to add an embedding layer in my vectorization. This was not required in my BOW model. To help generalize the model, I also tested out adding a Dropout layer, which randomly drops nodes throughout backpropogation. However, it had limited impact on performance accuracy.
- Optimizer: I used an Adam optimizer, because it is widely used and considered to have good performance on a variety of model types.
- Loss function: I used a categorical cross-entropy loss function because this was a multi-class classification problem.
- Activation function: It is possible to apply different activation functions to different layers within a NN. For my initial layer, I used 10 neurons with a Relu activation function. I used the softmax activation function, which is ideal for multi-class problems because it outputs the probability of each class label. I also I tested out adding a regularizer function to my Dropout layer.
The code snippets below show how the DNN was built in Keras.
Results of DNN:
As shown below, both of my DNN models had accuracy around 60% on the 2,500 validation records. The high training accuracy (92–95%) implies that the models were very overfit. However, as mentioned above, my DNN with BOW performed the best at predicting star ratings within one star (91.2%).
When fitting the model to the training data, it seemed that accuracy on validation data actually started to decrease after ~5 epochs. Below, the X axis refers to epochs, and the Y axis is proportion of star ratings correctly predicted.
Convolutional Neural Network (CNN):
Architecture and Methodology:
CNNs are a type of DNN that are similar in architecture, but have some key additional layers and parameters that are exemplified in the diagram below.
Layer 1 - Input layer: On the far left, the input layer embeds words into low-dimensional vectors. This is done by the Embedding function in Keras, similar to how it was done with my DNNs.
In my CNN model that used pre-trained GloVe word embeddings, I invoked this by creating an embedding layer in my initial step within the CNN architecture. This was done by setting the weights to equal the embedding matrix I imported. The input length was the max length of the padded vector sequences I had created in text processing (described above). While these embeddings seeded my model, I chose to set the embedding to be trainable and have weights continuously updated during model training.
Layer 2 - Convolutional layer: This second layer convolves through the first input layer of embedded word vectors. using various filters and performs element wise multiplications for each convolution with an activation function (like a DNN). It uses numerous filters of various sizes on subsets of tokens in my input matrix. An example is seen in the diagram below on the left, where a 2x5 filter iterates through two adjacent words at a time in an input sentence (represented in embedding matrix). It then produces an element-wise product that is recorded in the feature-map output layer (0.51). This sequence is repeated for each cluster of words it filters.
There are different nonlinear activation functions that can be used in this layer, such as Relu. In my model I used a softmax activation function, and used 128 filters that were 5x5 in dimensions with a stride of 1. This layer filtered through various token subsets in my text string input. Like my DNN, I also used an Adam optimizer and categorical cross entropy loss functions in my dense layers.
Layer 3 - Pooling Layer: The pooling layer in a CNN down-samples from the feature map output to give a summarized and generalized version of the features detected in the input. Therefore, even if the convolutional layer detects a small change in the feature input (such as ordering of words), the pooled feature map can have the same result.
Unlike the diagram above, I used Global Max pooling, which down-samples the entire feature map to a single value. Also, my model’s output (far right) had 5 classes (for each star) instead of 2.
Hyperparameter Tuning: I used Python’s Gridsearch functionality to test optimal parameter combinations for my CNN GloVe model. The output of these results for the top 20 performing combinations is below. The CNN parameters I tested were the activation function, dropout rate, learning rate, and number of neurons in initial layer. I ended up using the default learning rate of 0.1, 5 neurons, 0.2 dropout rate, and a softmax activation function.
Below is a code snippet of how I coded the CNN with Python’s Keras library.
Results of CNN:
Despite all of the tuning in my CNN model (and it’s significantly longer training time), it performed slightly worse than my Neural Network BOW model. My CNN with GloVe embedding had 57% on the 2,500 validation records. And my CNN with manual word embedding had 56.2% accuracy. Like my DNNs, they both had high accuracy on the training sample (~93%), which implies significant overfitting. However, I was once again and I was pleased that they had ~90% accuracy of predicting accurately within one star.
Again, like my DNNs, the learning curve rates below show that the accuracy on my validation data started to decrease after ~5 epochs. If given more time, I would continue testing out various parameters — specifically the filtering parameters in the Convolutional layer of my CNN.
As I continue to learn more about NLP and Neural Network modeling techniques, I hope to revisit this project and potentially improve model performance.