A Basic Recommendation System

Did you know that there are many interesting machine learning tools and techniques that

are now used so often in our daily lives that we don't even realize we are interacting with them?

Well, now you do! And one of the most widely used ones is recommendation systems.

Recommendation systems are a class of machine learning that are powered by deep learning

and are algorithms that are used to help discover relevant items (i.e. products, services, users,

etc.). In short, these systems make connections to then predict what rating a user might give

to a specific item, and then returns those predictions to a user.

These systems are so essential in today's world that the majority of online businesses,

including Google, YouTube, and Netflix, utilize them by using a user's data to identify and

predict future interests. These systems are intended to, and generally do, improve a

user's overall experience by exposing them to services (or items) they otherwise would have

missed. Many systems are rather complex (for example, because the engines used by the

larger companies are hard to interpret, sometimes recommender systems are referred to

as "black box"), however even a basic system can produce favorable results.

In this project I will demonstrate how to construct a simple recommendation system

using the Netflix Title Movie Dataset found on kaggle.com. This system should take in a movie

name and produce a number of suggested movies.

In addition to the basic packages needed for analyzations and visualizations, the most

import packages we need is the TfidfVectorizer package from sklearn as well as

linear_kernel. These packages are needed to build our actual system (more on both below).

After examining the "movies" dataset, we can see that it has 12 features, of which I will only be

using 2 for our system:

title: Title of the Movie / Tv Show
description: A brief description of the movie

Recommendation System (Content Based)

There are 5 different types of recommendation systems: Content-Based, Collaborative,

Demographic-Based, Utility-Based, and Knowledge-Based. For this project, I will be

building a basic content based system which requires computing a TF-IDF score.

The TF-IDF(Term Frequency-Inverse Document Frequency) score is the frequency

of a word occurring in a document, down-weighted by the number of documents in which

it occurs. This is done to reduce the importance of words that occur frequently in plot

overviews and therefore, their significance in computing the final similarity score. It has

many uses, most importantly in automated text analysis, and is very useful for scoring

words in machine learning algorithms for Natural Language Processing (NLP).

The TF-IDF model computes tfidf with the help of following two simple steps:

Step 1: Multiplying local and global component:

In this first step, the model will multiply a local component such as TF (Term Frequency) with a

global component such as IDF (Inverse Document Frequency).

Step 2: Normalize the Result

Once done with multiplication, in the next step TFIDF model will normalize the result to the unit

length.

As a result of these above two steps frequently occurred words across the documents will get

down-weighted.

Stopwords are those words that occur so frequently in the language that they rarely convey

information about the meaning of a particular document (such as "a", "how", "or", and "but").

Linear_kernel: when the data can be separated by using a single line (i.e. is linearly

separable), a linear kernel is used mainly for datasets that contain a large number of features

(such as our current data set or other NLP projects).

First, I'll remove the stopwords, replace the Nan values with an empty string so that our system

isn't affected by having null entries, and then construct the TF-IDF (bag-of-words model)

matrix by fitting and transforming the data.

Out[29]:

(7787, 17905)

So for the 7,787 movies in the dataset, there are approximately 17,900 words to

describe them.

Cosine Similarity

Cosine similarity is a mathematical computation that tells us the similarity between two

vectors A and B. Essentially, we are calculating the cosine of the angle theta between these

two vectors, which returns a value between -1, indicating complete opposite vectors, to 1,

indicating the same vector. 0 indicates a lack of correlation between the vectors, and

intermediate values indicate intermediate levels of similarity.

*Theta is used to represent a measured angle

I'll use the cosine function to compute the similarity score between movies, where each movie

will have a similarity score with every other movie in our dataset. Thus, my model will utilize

all of the movies' properties and the metadata to calculate and find the most similar movie to

the user input. I'll then define my indices by removing any duplicate movie titles and then create

my function for my system:

Now that we've build our very basic system, it's time to test it out. I will request movies similar

to Cobra Kai:

3053                        Iris
3043                   Invisible
1986                Elstree 1976
1451              Coffee for All
6705         The Next Karate Kid
2699                Henry Danger
1600    Daniel Sloss: Live Shows
7749                   Yu-Gi-Oh!
1117                    Brothers
6805             The Real Miyagi
Name: title, dtype: object

Search This Blog

Chronicles of An Aspiring Data Scientist

A Basic Recommendation System

Recommendation System (Content Based)

Cosine Similarity

And there we have it -

Success!

References:

Comments

Post a Comment

Popular Posts

King County Home Price Analysis

The .explode() and .strftime() methods - A simple test case