Using pandas on the MovieLens dataset

October 26, 2013 // python, pandas, sql, tutorial, data science

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

This is part three of a three part introduction to pandas, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.

Part 1: Intro to pandas data structures, covers the basics of the library's two main data structures - Series and DataFrames.
Part 2: Working with DataFrames, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.
Part 3: Using pandas with the MovieLens dataset, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.

Using pandas on the MovieLens dataset

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset. Recall that we've already read our data into DataFrames and merged it.

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
                    encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
                      encoding='latin-1')

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
                     encoding='latin-1')

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)

What are the 25 most rated movies?

most_rated = lens.groupby('title').size().sort_values(ascending=False)[:25]
most_rated

title
Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Back to the Future (1985)                    350
Titanic (1997)                               350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
dtype: int64

There's a lot going on in the code above, but it's very idomatic. We're splitting the DataFrame into groups by movie title and applying the size method to get the count of records in each group. Then we order our results in descending order and limit the output to the top 25 using Python's slicing syntax.

In SQL, this would be equivalent to:

SELECT title, count(1)
FROM lens
GROUP BY title
ORDER BY 2 DESC
LIMIT 25;

Alternatively, pandas has a nifty value_counts method - yes, this is simpler - the goal above was to show a basic groupby example.

lens.title.value_counts()[:25]

Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Titanic (1997)                               350
Back to the Future (1985)                    350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
Name: title, dtype: int64

Which movies are most highly rated?

movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
movie_stats.head()

	rating
	size	mean
title
'Til There Was You (1997)	9	2.333333
1-900 (1994)	5	2.600000
101 Dalmatians (1996)	109	2.908257
12 Angry Men (1957)	125	4.344000
187 (1997)	41	3.024390

We can use the agg method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.

Let's sort the resulting DataFrame so that we can see which movies have the highest average score.

# sort by rating average
movie_stats.sort_values([('rating', 'mean')], ascending=False).head()

	rating
	size	mean
title
They Made Me a Criminal (1939)	1	5
Marlene Dietrich: Shadow and Light (1996)	1	5
Saint of Fort Washington, The (1993)	2	5
Someone Else's America (1995)	1	5
Star Kid (1997)	3	5

Because movie_stats is a DataFrame, we use the sort method - only Series objects use order. Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.

The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.

atleast_100 = movie_stats['rating']['size'] >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

	rating
	size	mean
title
Close Shave, A (1995)	112	4.491071
Schindler's List (1993)	298	4.466443
Wrong Trousers, The (1993)	118	4.466102
Casablanca (1942)	243	4.456790
Shawshank Redemption, The (1994)	283	4.445230
Rear Window (1954)	209	4.387560
Usual Suspects, The (1995)	267	4.385768
Star Wars (1977)	583	4.358491
12 Angry Men (1957)	125	4.344000
Citizen Kane (1941)	198	4.292929
To Kill a Mockingbird (1962)	219	4.292237
One Flew Over the Cuckoo's Nest (1975)	264	4.291667
Silence of the Lambs, The (1991)	390	4.289744
North by Northwest (1959)	179	4.284916
Godfather, The (1972)	413	4.283293

Those results look realistic. Notice that we used boolean indexing to filter our movie_stats frame.

We broke this question down into many parts, so here's the Python needed to get the 15 movies with the highest average rating, requiring that they had at least 100 ratings:

movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
atleast_100 = movie_stats['rating'].size >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

The SQL equivalent would be:

SELECT title, COUNT(1) size, AVG(rating) mean
FROM lens
GROUP BY title
HAVING COUNT(1) >= 100
ORDER BY 3 DESC
LIMIT 15;

Limiting our population going forward

Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.

most_50 = lens.groupby('movie_id').size().sort_values(ascending=False)[:50]

The SQL to match this would be:

CREATE TABLE most_50 AS (
    SELECT movie_id, COUNT(1)
    FROM lens
    GROUP BY movie_id
    ORDER BY 2 DESC
    LIMIT 50
);

This table would then allow us to use EXISTS, IN, or JOIN whenever we wanted to filter our results. Here's an example using EXISTS:

SELECT *
FROM lens
WHERE EXISTS (SELECT 1 FROM most_50 WHERE lens.movie_id = most_50.movie_id);

Which movies are most controversial amongst different ages?

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.

users.age.plot.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');

Distribution of user ages

pandas' integration with matplotlib makes basic graphing of Series/DataFrames trivial. In this case, just call hist on the column to produce a histogram. We can also use matplotlib.pyplot to customize our graph a bit (always label your axes).

Binning our users

I don't think it'd be very useful to compare individual ages - let's bin our users into age groups using pandas.cut.

labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
lens['age_group'] = pd.cut(lens.age, range(0, 81, 10), right=False, labels=labels)
lens[['age', 'age_group']].drop_duplicates()[:10]

	age	age_group
0	60	60-69
397	21	20-29
459	33	30-39
524	30	30-39
782	23	20-29
995	29	20-29
1229	26	20-29
1664	31	30-39
1942	24	20-29
2270	32	30-39

pandas.cut allows you to bin numeric data. In the above lines, we first created labels to name our bins, then split our users into eight bins of ten years (0-9, 10-19, 20-29, etc.). Our use of right=False told the function that we wanted the bins to be exclusive of the max age in the bin (e.g. a 30 year old user gets the 30s label).

Now we can now compare ratings across age groups.

lens.groupby('age_group').agg({'rating': [np.size, np.mean]})

	rating
	size	mean
age_group
0-9	43	3.767442
10-19	8181	3.486126
20-29	39535	3.467333
30-39	25696	3.554444
40-49	15021	3.591772
50-59	8704	3.635800
60-69	2623	3.648875
70-79	197	3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.

lens.set_index('movie_id', inplace=True)
by_age = lens.loc[most_50.index].groupby(['title', 'age_group'])
by_age.rating.mean().head(15)

title                 age_group
Air Force One (1997)  10-19        3.647059
                      20-29        3.666667
                      30-39        3.570000
                      40-49        3.555556
                      50-59        3.750000
                      60-69        3.666667
                      70-79        3.666667
Alien (1979)          10-19        4.111111
                      20-29        4.026087
                      30-39        4.103448
                      40-49        3.833333
                      50-59        4.272727
                      60-69        3.500000
                      70-79        4.000000
Aliens (1986)         10-19        4.050000
Name: rating, dtype: float64

Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.

Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.

Behold! The magic of unstack!

by_age.rating.mean().unstack(1).fillna(0)[10:20]

age_group	0-9	10-19	20-29	30-39	40-49	50-59	60-69	70-79
title
E.T. the Extra-Terrestrial (1982)	0	3.680000	3.609091	3.806818	4.160000	4.368421	4.375000	0.000000
Empire Strikes Back, The (1980)	4	4.642857	4.311688	4.052083	4.100000	3.909091	4.250000	5.000000
English Patient, The (1996)	5	3.739130	3.571429	3.621849	3.634615	3.774648	3.904762	4.500000
Fargo (1996)	0	3.937500	4.010471	4.230769	4.294118	4.442308	4.000000	4.333333
Forrest Gump (1994)	5	4.047619	3.785714	3.861702	3.847826	4.000000	3.800000	0.000000
Fugitive, The (1993)	0	4.320000	3.969925	3.981481	4.190476	4.240000	3.666667	0.000000
Full Monty, The (1997)	0	3.421053	4.056818	3.933333	3.714286	4.146341	4.166667	3.500000
Godfather, The (1972)	0	4.400000	4.345070	4.412844	3.929412	4.463415	4.125000	0.000000
Groundhog Day (1993)	0	3.476190	3.798246	3.786667	3.851064	3.571429	3.571429	4.000000
Independence Day (ID4) (1996)	0	3.595238	3.291429	3.389381	3.718750	3.888889	2.750000	0.000000

unstack, well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.

If we would have used:

by_age.rating.mean().unstack(0).fillna(0)

We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on?

EDIT: I realized after writing this question that Wes McKinney basically went through the exact same question in his book. It's a good, yet simple example of pivot_table, so I'm going to leave it here. Seriously though, go buy the book.

Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this:

SELECT title, AVG(IF(sex = 'F', rating, NULL)), AVG(IF(sex = 'M', rating, NULL))
FROM lens
GROUP BY title;

Imagine how annoying it'd be if you had to do this on more than two columns.

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).

lens.reset_index('movie_id', inplace=True)
pivoted = lens.pivot_table(index=['movie_id', 'title'],
                           columns=['sex'],
                           values='rating',
                           fill_value=0)
pivoted.head()

	sex	F	M
movie_id	title
1	Toy Story (1995)	3.789916	3.909910
2	GoldenEye (1995)	3.368421	3.178571
3	Four Rooms (1995)	2.687500	3.108108
4	Get Shorty (1995)	3.400000	3.591463
5	Copycat (1995)	3.772727	3.140625

pivoted['diff'] = pivoted.M - pivoted.F
pivoted.head()

	sex	F	M	diff
movie_id	title
1	Toy Story (1995)	3.789916	3.909910	0.119994
2	GoldenEye (1995)	3.368421	3.178571	-0.189850
3	Four Rooms (1995)	2.687500	3.108108	0.420608
4	Get Shorty (1995)	3.400000	3.591463	0.191463
5	Copycat (1995)	3.772727	3.140625	-0.632102

pivoted.reset_index('movie_id', inplace=True)
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.sort_values().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');

Of course men like Terminator more than women. Independence Day though? Really?

Additional Resources

pandas documentation
pandas videos from PyCon
pandas and Python top 10
Tom Augspurger's Modern pandas series
Video from Tom's pandas tutorial at PyData Seattle 2015

Closing

This is the point where I finally wrap this tutorial up. Hopefully I've covered the basics well enough to pique your interest and help you get started with the library. If I've missed something critical, feel free to let me know on Twitter or in the comments - I'd love constructive feedback.