The .explode() and .strftime() methods

December 17, 2021

The .explode() and .strftime() methods - A simple test case

For my first project at Flatiron I was instructed to conduct an analysis using the provided datasets to determine which strategies Microsoft could use when opening a new movie making studio. I looked at a number of factors to better understand what contributes to the success of a movie and then offered a few recommendations on the best strategies to get started. The goes is that Microsoft can use the analysis to adjust planning, production, and marketing to hit the ground running as they enter this highly competitive space.

The goal of the project was to assist Microsoft in entering the movie making sector while successfully standing out from fierce competition by producing an analysis that would suggest the best strategies they should us. By choosing to create films that their target audiences have shown to thoroughly enjoy, they can produce movies that will instantly become hits, which will in turn allow them to improve on and produce even more content, setting them up to be a studio force to be reckoned with.

Using data from well-known industry sources such as Imdb.com and Rotten Tomatoes, I analyzed and explained patterns in popular movie types based on profits, as well as budgeting decisions to help predict what audiences want from a film and thus, guaranteeing its success.

I used descriptive analysis, including description of movie trends based on the months in a year to provide a useful overview of the movie industries' profits and profit margins based on release timings.

Throughout this project I've used a number of data cleaning techniques, such as the .explode() method. (1) After initially viewing my dataset I began by preparing and cleaning my data by starting with the movie names and genres. Since I've decided on which information I want to work with and which to eliminate, I've chosen to make a new data frame using only the columns I'll be needing with the intent to later merge back with the main data frame. Here they are the 'movie_title' and 'genres' column.

Once I've done that, I renamed the new data frame to 'movies_table' using the .rename() method:

I then create a new 'genres' data frame consisting of the genre name and the number of times it occurs in the dataset.

In this particular situation, a number of my movies were classified into more than one genre which were listed as a string separated by columns.

Take the above example: the movie "Percy Jackson & the Olympians: The Lightning Thief" is categorized under Action & Adventure, Comedy, Drama, Science Fiction. Simply using the .value_counts() method, which returns the values and number of times they appear, would have produced a result similar to:

Action & Adventure, Comedy, Drama, Science Fiction | 1

In other words, this method would have returned a value count of 1 because this entire string is counted as a single value.

So I did some research and discovered that there is a solution to reach my desired result!

I did so using the .explode() function. The .explode() function is used to transform each element of a list-like (such as lists, tuples, series, and numpy.ndarray) to a row and replicate the values of an index.

First I converted the 'genre' column to a string:

Next, I removed the extra spaces because of the inconsistent spacing, some of the genres were duplicated. For example, the drama genre was entered at times as "Drama", and at other times as " Drama" (note the space after the quotation mark).

In addition, I removed the extra spacing surrounding the ampersands due to inconsistencies (i.e. action and adventures was entered as "Action & Adventure" and "Action&Adventure")

It is very important to carefully examine your data to determine patterns and inconsistencies. Had I not taken the steps above, I would have been given value counts for both "Action & Adventure" and "Action&Adventure", as well as "Drama" and " Drama", which would have produced in accurate results since they are one in the same!

After I've prepared my strings, I went on to use the mighty explode method.

Since this method transforms each element from a list-like, and each genre is an element, I've decided to use the .str.split() method to split my string at each comma since that is where each element is separated.

Then I used the .explode() method to convert each genre to its own row, and then completed the table by calling the .value_counts() function to produce the number of times each genre occurs. I then renamed my table and converted to a data frame.

Next I reset my index (to reflect only my current data frame) and renamed my columns to "Genre" and "Movie Count".

And this is the result!

Simple, easy and effective!

Continuing with my analysis I've used a number of other methods, including the .strftime()(2) method to display the months of the year in which movies perform the best profit-wise.

The .strftime() method takes one or more format codes and returns a formatted string. My 'release_date' column's values were the dates each movie was released.

So after outputting my column to a new data frame, I used a for statement to iterate over its values to extract the month in each date. Doing so is rather easy, since this method takes in predefined format code to return the specific value you need. Here the '%B' format returns the full month name (i.e. January -- see code cheatsheet here!)

Then I sorted my new data frame based on the month with the most movie releases (which turns out to be December!)

Based on the analysis I've done I've arrived at the following as a result of the methods I've used, including the two explained above:

The top 3 most profitable movie genres are Romance, Cult Movies, and Animation, with Romance offering an average of .8 net profit!

And that the best times to release a movie is during the early summer months (June and July), with July taking the number one spot in profit margins, followed by the holiday season, specifically November and December.

Conclusions

These two simple methods have show how powerful a piece of code can be! As I've stated above, its important to thoroughly examine your data to know what methods should be applied as well as what preparation should done before using them (i.e. checking for inconsistencies and remedying them -- such as removing the extra spaces as I've done above.) Once you've done so, manipulating and analyzing your data should become easier!

References

(1) .explode() method: splits a string based on a string delimeter, i.e. it splits the string wherever the delimeter character occurs. Source: Pandas DataFrame explode() Method - Studytonight

(2) .strftime() method: converts a tuple or struct_time representing a time as returned by gmtime() or localtime() to a string as specified by the format argument. %B returns the full month name. Source: Python time strftime() Method (tutorialspoint.com)

.strftime() handy cheatsheet: Python strftime reference cheatsheet

(3) Header image [source]

To view the full github respository click here.

Search This Blog

Chronicles of An Aspiring Data Scientist