3-pointers after offensive rebounds
December 26, 2013 // scraping , pandas , data science , basketball , sports , simulation
I love college basketball. A lot. My beloved Marquette Golden Eagles Warriors are probably the only sports team that can really put me in agony.
Last season, I was watching a nail-biter against Notre Dame. With 3:36 left, Notre Dame's Jack Cooley grabbed an offensive board and promptly passed out to Pat Connaughton for a successful three-pointer. It was a dagger.
One thing stuck out to me though - after the shot, Jay Bilas, who was calling the game with Bill Raftery and Sean McDonough, stated that the best time to attempt a 3-pointer is after an offensive rebound.
Intuitively, this statement makes sense - the defensive front line is crashing the boards in hopes of getting a rebound to end the offensive possession, while the defensive backcourt is out on the wings looking for an outlet pass from their teammates, likely leaving their offensive counterparts unguarded.
I've never seen any data that indicates whether three pointers are more successful after an offensive rebound though, much less whether it's the best time to shoot one. It seemed like something worth investigating.
In the following analysis, we'll try to determine whether there is a material difference between "normal" 3P% (those not shot after an offensive rebound) and 3P% when the shot was preceded by an offensive rebound.
I'll be going step by step through data collection, munging, and analysis. If you're just interested in the answer, skip to the last section.
Getting the data¶
ESPN has play-by-play data for almost every NCAA Division I game. I've written a python script that will collect all of this data for a given date range. You can find the script here. If you're unfamiliar with web scraping, check out the tutorial I wrote previously.
Analysis¶
Here's an example of what the game files look like:
%cd ../../_code/cbb-play-data/
!head -n 5 330620024.psv
In the output above, we see what a line referring to a three pointer looks like. Let's see what an offensive rebound line looks like:
!grep Offensive 330620024.psv | head -n 5
Now let's start our analysis using pandas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import re
# read PSVs into DataFrame
games = []
files = glob.glob('*.psv')
for f in files:
df = pd.read_csv(f, sep='|')
df['game_id'] = f.replace('.psv', '')
games.append(df)
print 'Read {0} games'.format(len(games))
To start, we need to find all incidents of a three pointer immediately after an offensive rebound.
This format is kind of crappy though - since events are in separate columns for the home and away teams, we'd have to write logic to check against each column. Let's munge our data into a slightly different format - one column for team
, which will indicate home/away, and another for event
, which will store the description of what occurred.
games_df = pd.concat(games)
# add event_id to maintain event order
# we can use the index since pandas defaults to the Nth row of the file
games_df['event_id'] = games_df.index
# melt data into one column for home/away and another for event
# maintain play order by sorting on event_id
melted = pd.melt(games_df, id_vars=['event_id', 'game_id', 'time', 'score'],
var_name='team', value_name='event')
melted.sort_index(by=['game_id', 'event_id'], inplace=True)
# drop rows with NaN events - an event only belongs to one team
melted = melted[melted.event.notnull()]
print melted[10:15]
We need to know whether the three pointers were missed or made - let's write a function called get_shot_result
to extract the shot result from the event
column. We can apply it to every row that contains a three pointer, storing the results in a new column called shot_result
.
# label whether three pointers were made or missed
get_shot_result = lambda x: re.findall('(made|missed)', x)[0]
shot3 = melted.event.str.contains('Three Point')
melted['shot_result'] = melted[shot3].event.apply(get_shot_result)
Now let's write a function to label the events that meet our criteria - a three point attempt that was preceded by an offensive rebound. We can use shift(1)
to reference the event on the previous row.
def criteria(df):
"""Labels if the three pointer was preceded by an offensive rebound."""
df['after_oreb'] = ((df.event.str.contains('Three Point')) & \
df.event.shift(1).str.contains('Offensive Rebound'))
df.after_oreb.fillna(False, inplace=True)
return df
melted = melted.groupby('game_id').apply(criteria)
melted[melted.shot_result.notnull()].head(3)
Finally, we can calculate the 3P% for our groups and plot the results.
threes = melted[melted.shot_result.notnull()]
attempts = threes.groupby(['shot_result', 'after_oreb']).size().unstack(0)
attempts['perc'] = attempts.made.astype(float) / (attempts.made + attempts.missed)
print attempts
attempts.index = ['No', 'Yes']
plt.figure(figsize=[8, 6])
attempts.perc.plot(kind='bar')
plt.ylabel('3P%')
plt.xlabel('After Offensive Rebound?')
plt.grid(False);
Digging deeper¶
At the most basic level, the difference is negligible. Not all post-offensive rebound three pointers are created equally though. Let's investigate further to see whether three pointers shot shortly after the rebound are more successful. Let's look at only three pointers that were shot within seven seconds of the offensive rebound.
To do so, we'll need to munge our data a bit more in order to calculate the seconds elapsed between rebound and shot attempt.
melted['minutes'] = melted.time.apply(lambda x: int(x.split(':')[0]))
melted['seconds'] = melted.time.apply(lambda x: int(x.split(':')[1]))
Notice below that time-outs and end of periods are duplicated within the data (this is because they originally appeared as both a home and away event).
duped_cols = ['game_id', 'event_id', 'time', 'event']
melted[melted.duplicated(cols=duped_cols)][:3]
Let's get rid of them so we can easily label events within each period (keeping them in will throw our function off a bit).
melted.drop_duplicates(cols=['game_id', 'event_id', 'event'], inplace=True)
Now we can label each period based on when the End of ...
event appears - events before it are the first half; events after, the second half. To do so, we can use cumsum
, which will treat True
values as 1. This means we can just shift our results down a row and take the running total.
melted['period_end'] = melted.event.apply(lambda x: x.startswith('End of'))
melted[melted.period_end].head(3)
calculate_period = lambda x: x.shift(1).cumsum().fillna(0) + 1
melted['period'] = melted.groupby('game_id').period_end.apply(calculate_period)
Now we can use the period to calculate the total time left in the game.
melted.set_index('game_id', inplace=True)
# 40min regulation game + (# periods - 2 halves) * 5min OTs
gametime = lambda x: 40 + (x - 2) * 5
melted['gametime'] = melted.groupby(level=0).period.max().apply(gametime)
melted.reset_index('game_id', inplace=True)
Setting the game_id
as the index was necessary for this function since pandas naturally tries to match on indexes. If we were to try ...
melted['gametime'] = melted.groupby('game_id').period.max().apply(gametime)
It would not have worked because the Series object returned by our groupby
operation uses the game_id
as its index, while our DataFrame did not. pandas would have been unable to match the two and create the new gametime
column.
melted.groupby('gametime').game_id.nunique()
Notice above that some games have a total gametime of 35 minutes. All college basketball games are at least 40 minutes, so something is off.
It turns out there are some inconsistencies in ESPN's play-by-play data - a couple games do not have the End of the 1st Half.
event - here's an example. This throws off our period
and gametime
calculations. Such is life when dealing with scraped data though.
Let's keep things simple and just assume they were normal, non-OT games.
melted.loc[melted.gametime == 35, 'gametime'] = 40
Now let's normalize the event times to seconds left in the game. This will allow us to see how much time elapsed between the offensive rebound and three point attempt.
def clock_to_secs_left(df):
"""Calculates the total seconds left in the game."""
df['secs_left'] = np.nan
df.loc[df.period == 1, 'secs_left'] = (df.minutes * 60) + 1200 + df.seconds
df.loc[df.period > 1, 'secs_left'] = (df.minutes * 60) + df.seconds
return df
clock_to_secs_left(melted)
print melted[['game_id', 'time', 'event', 'period', 'secs_left']][:5]
We can finally see how much time elapsed between offensive rebound and three point attempt.
We'll create a new field which will store the seconds elapsed since the previous event. Then we'll create a new DataFrame called threes_after_orebs
which will hold three point attempts that were shot within seven seconds of an offensive rebound.
melted['secs_elapsed'] = melted.secs_left.shift(1) - melted.secs_left
mask = (melted.secs_elapsed >= 0) & (melted.secs_elapsed <= 7)
threes_after_orebs = melted[melted.after_oreb & mask]
Finally, let's group our data by the seconds elapsed and shot result to get the three point percentage for each bucket, which we can plot.
grouped = threes_after_orebs.groupby(['shot_result', 'secs_elapsed']).size()
grouped = grouped.unstack(0).fillna(0)
grouped['attempts'] = grouped.made + grouped.missed
grouped['percentage'] = grouped.made / grouped.attempts.astype(float)
t = threes.shot_result.value_counts()
t = float(t['made']) / (t['made'] + t['missed'])
figsize(12.5, 7)
plt.plot(grouped.percentage, label='O-Reb 3P%', color='#377EB8')
plt.hlines(t, 0, 7, label='"Normal" 3P%', linestyles='--')
plt.xlabel('Seconds Since Offensive Rebound')
plt.xticks(np.arange(8))
plt.ylabel('3-Point Percentage', labelpad=15)
plt.grid(False)
plt.legend(loc='lower right');
It looks like there might be some truth to Jay Bilas' statement that the best time to shoot a three pointer is after an offensive rebound. However, we can go one step further and simulate our way to a numeric value of how "correct" he was.
Simulations¶
To start, we'll create two Series based on the melted
DataFrame that we have been using throughout the analysis. One Series, which we'll call normal
, will hold the results of three pointers that were normally attempted - that is, they were not shot immediately after an offensive rebound. The other Series, after
, will contain the results of those shot after an offensive rebound.
True
will be used to indicate a made basket.
convert = lambda x: True if x == 'made' else False
normal_criteria = (melted.after_oreb == False) & melted.shot_result.notnull()
normal = melted[normal_criteria].shot_result.apply(convert)
after_criteria = (melted.after_oreb) & melted.shot_result.notnull() & \
(melted.secs_elapsed <= 7)
after = melted[after_criteria].shot_result.apply(convert)
print "After O-Reb 3P%:", after.mean()
print "Sample Size:", len(after)
print "\n"
print "All other 3P%:", normal.mean()
print "Sample Size:", len(normal)
print "\n"
print "Absolute difference: %.4f" % (after.mean() - normal.mean())
While we have data for 2,932 games, it turns out that three pointers shot within seven seconds of an offensive rebound aren't very common - it only occurred 4,435 times, while "normal" three pointers were shot 96,838 times.
The much smaller population means that we're more uncertain about the "true" success rate of those after offensive rebounds.
Using pymc, we can run a simulations to determine how likely it is that three pointers after offensive rebounds really are easier, while taking into account this uncertainty.
We'll first assume a uniform distribution between 30% and 40% for both "normal" three pointers and those after offensive rebounds. This means that we believe the "true" success rate of each to be somewhere between 30-40%. This seems reasonable given that the 3P% for all of NCAA Division I basketball during the 2012-2013 season was 33.89% [source].
We then generate observations using our normal
and after
Series. Our observations are an example of a Bernoulli distribution, meaning that the outcome is binary - the three pointer was either made (True) or missed (False).
We'll then run 20,000 simulations using our existing data and assumptions.
import pymc as pm
# no chance 3P% is out of this range
p_normal = pm.Uniform("p_normal", 0.3, 0.4)
p_after = pm.Uniform("p_after", 0.3, 0.4)
@pm.deterministic
def delta(p_normal=p_normal, p_after=p_after):
return p_after - p_normal
# scraped observations
obs_normal = pm.Bernoulli("obs_normal", p_normal, value=normal, observed=True)
obs_after = pm.Bernoulli("obs_after", p_after, value=after, observed=True)
m = pm.MCMC([p_normal, p_after, delta, obs_normal, obs_after])
m.sample(20000)
p_normal_samples = m.trace("p_normal")[:]
p_after_samples = m.trace("p_after")[:]
delta_samples = m.trace("delta")[:]
Finally, we can plot the results of our simulations.
figsize(12.5, 10)
ax = plt.subplot(311)
plt.xlim(0.3, 0.4)
plt.xticks(np.arange(0.3, 0.401, 0.01))
plt.ylim(0, 300)
plt.hist(p_normal_samples, histtype='stepfilled', bins=50, normed=True,
color='#E41A1C', label='3P% "Normal"')
plt.vlines(normal.mean(), 0, 300, linestyles='--', label='True "Normal" 3P%')
plt.legend()
plt.grid(False)
ax = plt.subplot(312)
plt.xlim(0.3, 0.4)
plt.xticks(np.arange(0.3, 0.401, 0.01))
plt.ylim(0, 300)
plt.hist(p_after_samples, histtype='stepfilled', bins=50, normed=True,
color='#4DAF4A', label='3P% After Off. Reb.')
plt.vlines(after.mean(), 0, 300, linestyles='--',
label='True 3P% After Off. Reb.')
plt.legend()
plt.grid(False)
ax = plt.subplot(313)
plt.xlim(-0.05, 0.05)
plt.xticks(np.arange(-0.05, 0.051, 0.01))
plt.ylim(0, 300)
plt.hist(delta_samples, histtype='stepfilled', bins=50, normed=True,
color='#377EB8', label='Delta')
plt.vlines(0, 0, 300, linestyles='--', label='$H_0$ (No difference)')
plt.legend()
plt.grid(False);
Notice the much narrower distribution at the top? This is because we had so many observations for "normal" three pointers. There wasn't much uncertainty! It's quite different than the second distribution, where the wider width of the distribution indicates are greater level of uncertainty on those attempted after offensive rebounds. This is due to the much smaller sample size.
The final distribution shows the difference between "normal" and "after" three pointers in our simulations. Much more often than not, "after" three pointers had a higher 3P% in the simulations.
In what percentage of simulations was "after" better than "normal" though?
print ("3P% after offensive rebounds was more successful "
"in {0:.1f}% of simulations").format((delta_samples > 0).mean() * 100)
Looks like Jay Bilas is pretty right about this one. While the absolute difference in 3P% is small, those after an offensive rebound were more successful in about 83% of our simulations.
Was there something I missed or got wrong? I'd love to hear from you.