A Basic Recommendation System

 




Did you know that there are many interesting machine learning tools and techniques that

are now used so often in our daily lives that we don't even realize we are interacting with them? 


Well, now you do! And one of the most widely used ones is recommendation systems.


Recommendation systems are a class of machine learning that are powered by deep learning

and are algorithms that are used to help discover relevant items (i.e. products, services, users,

etc.). In short, these systems make connections to then predict what rating a user might give 

to a specific item, and then returns those predictions to a user.


These systems are so essential in today's world that the majority of online businesses,

including Google, YouTube, and Netflix, utilize them by using a user's data to identify and 

predict future interests. These systems are intended to, and generally do, improve a 

user's overall experience by exposing them to services (or items) they otherwise would have

missed. Many systems are rather complex (for example, because the engines used by the 

larger companies are hard to interpret, sometimes recommender systems are referred to

as "black box"), however even a basic system can produce favorable results.


In this project I will demonstrate how to construct a simple recommendation system 

using the Netflix Title Movie Dataset found on kaggle.com. This system should take in a movie 

name and produce a number of suggested movies.


In addition to the basic packages needed for analyzations and visualizations, the most 

import packages we need is the TfidfVectorizer package from sklearn as well as 

linear_kernel. These packages are needed to build our actual system (more on both below).



After examining the "movies" dataset,  we can see that it has 12 features, of which I will only be

using 2 for our system: 

  • title: Title of the Movie / Tv Show
  • description: A brief description of the movie

Recommendation System (Content Based)


There are 5 different types of recommendation systems: Content-Based, Collaborative, 
Demographic-Based, Utility-Based, and Knowledge-Based. For this project, I will be 
building a basic content based system which requires computing a TF-IDF score.

The TF-IDF(Term Frequency-Inverse Document Frequency) score is the frequency 

of a word occurring in a document, down-weighted by the number of documents in which 

it occurs. This is done to reduce the importance of words that occur frequently in plot 

overviews and therefore, their significance in computing the final similarity score. It has 

many uses, most importantly in automated text analysis, and is very useful for scoring 

words in machine learning algorithms for Natural Language Processing (NLP).

The TF-IDF model computes tfidf with the help of following two simple steps:

Step 1: Multiplying local and global component:

In this first step, the model will multiply a local component such as TF (Term Frequency) with a 

global component such as IDF (Inverse Document Frequency).

Step 2: Normalize the Result

Once done with multiplication, in the next step TFIDF model will normalize the result to the unit

length.

As a result of these above two steps frequently occurred words across the documents will get 

down-weighted.

Stopwords are those words that occur so frequently in the language that they rarely convey 

information about the meaning of a particular document (such as "a", "how", "or", and "but").

Linear_kernel: when the data can be separated by using a single line (i.e. is linearly 

separable), a linear kernel is used mainly for datasets that contain a large number of features 

(such as our current data set or other NLP projects).

First, I'll remove the stopwords, replace the Nan values with an empty string so that our system 

isn't affected by having null entries, and then construct the TF-IDF (bag-of-words model) 

matrix by fitting and transforming the data.


Out[29]:
(7787, 17905)

So for the 7,787 movies in the dataset, there are approximately 17,900 words to 

describe them.

Cosine Similarity

Cosine similarity is a mathematical computation that tells us the similarity between two 

vectors A and B. Essentially, we are calculating the cosine of the angle theta between these 

two vectors, which returns a value between -1, indicating complete opposite vectors, to 1, 

indicating the same vector. 0 indicates a lack of correlation between the vectors, and 

intermediate values indicate intermediate levels of similarity.


*Theta is used to represent a measured angle

I'll use the cosine function to compute the similarity score between movies, where each movie 

will have a similarity score with every other movie in our dataset. Thus, my model will utilize 

all of the movies' properties and the metadata to calculate and find the most similar movie to 

the user input. I'll then define my indices by removing any duplicate movie titles and then create 

my function for my system:


Now that we've build our very basic system, it's time to test it out. I will request movies similar 

to Cobra Kai:




3053                        Iris
3043                   Invisible
1986                Elstree 1976
1451              Coffee for All
6705         The Next Karate Kid
2699                Henry Danger
1600    Daniel Sloss: Live Shows
7749                   Yu-Gi-Oh!
1117                    Brothers
6805             The Real Miyagi
Name: title, dtype: object

And there we have it -

Success!


For the full project repository click here.

References:

Comments

Popular Posts