A Basic Recommendation System
Did you know that there are many interesting machine learning tools and techniques that
are now used so often in our daily lives that we don't even realize we are interacting with them?
Well, now you do! And one of the most widely used ones is recommendation systems.
Recommendation systems are a class of machine learning that are powered by deep learning
and are algorithms that are used to help discover relevant items (i.e. products, services, users,
etc.). In short, these systems make connections to then predict what rating a user might give
to a specific item, and then returns those predictions to a user.
These systems are so essential in today's world that the majority of online businesses,
including Google, YouTube, and Netflix, utilize them by using a user's data to identify and
predict future interests. These systems are intended to, and generally do, improve a
user's overall experience by exposing them to services (or items) they otherwise would have
missed. Many systems are rather complex (for example, because the engines used by the
larger companies are hard to interpret, sometimes recommender systems are referred to
as "black box"), however even a basic system can produce favorable results.
In this project I will demonstrate how to construct a simple recommendation system
using the Netflix Title Movie Dataset found on kaggle.com. This system should take in a movie
name and produce a number of suggested movies.
In addition to the basic packages needed for analyzations and visualizations, the most
import packages we need is the TfidfVectorizer package from sklearn as well as
linear_kernel. These packages are needed to build our actual system (more on both below).
After examining the "movies" dataset, we can see that it has 12 features, of which I will only be
using 2 for our system:
- title: Title of the Movie / Tv Show
- description: A brief description of the movie
Recommendation System (Content Based)
The TF-IDF(Term Frequency-Inverse Document Frequency) score is the frequency
of a word occurring in a document, down-weighted by the number of documents in which
it occurs. This is done to reduce the importance of words that occur frequently in plot
overviews and therefore, their significance in computing the final similarity score. It has
many uses, most importantly in automated text analysis, and is very useful for scoring
words in machine learning algorithms for Natural Language Processing (NLP).
The TF-IDF model computes tfidf with the help of following two simple steps:
Step 1: Multiplying local and global component:
In this first step, the model will multiply a local component such as TF (Term Frequency) with a
global component such as IDF (Inverse Document Frequency).
Step 2: Normalize the Result
Once done with multiplication, in the next step TFIDF model will normalize the result to the unit
length.
As a result of these above two steps frequently occurred words across the documents will get
down-weighted.
Stopwords are those words that occur so frequently in the language that they rarely convey
information about the meaning of a particular document (such as "a", "how", "or", and "but").
Linear_kernel: when the data can be separated by using a single line (i.e. is linearly
separable), a linear kernel is used mainly for datasets that contain a large number of features
(such as our current data set or other NLP projects).
So for the 7,787 movies in the dataset, there are approximately 17,900 words to
describe them.
Cosine Similarity
Cosine similarity is a mathematical computation that tells us the similarity between two
vectors A and B. Essentially, we are calculating the cosine of the angle theta between these
two vectors, which returns a value between -1, indicating complete opposite vectors, to 1,
indicating the same vector. 0 indicates a lack of correlation between the vectors, and
intermediate values indicate intermediate levels of similarity.
*Theta is used to represent a measured angle
I'll use the cosine function to compute the similarity score between movies, where each movie
will have a similarity score with every other movie in our dataset. Thus, my model will utilize
all of the movies' properties and the metadata to calculate and find the most similar movie to
the user input. I'll then define my indices by removing any duplicate movie titles and then create
my function for my system:
Comments
Post a Comment