As many of you know, I have been working on models for projecting future NBA performance of college players. I had a relatively finished product last month, but thanks to Zebano’s procurement of college data going back to the 80s (via Basketball-reference.com), I got tricked into re-jiggering the whole thing. Some information is lost with the new data. Older stats do not discriminate between ORBs and DRBs and I only have combine data starting in the 21st century. Still, the addition of many new data-points is a major benefit. I no longer worry about whether I am just finding which point-guard looks the most like Chris Paul or whether all centers are being underrated due to the lack of stud college centers in the past decade.
The current version of the model is not a finished product. I’ve been traveling and haven’t had much time to sit down and play with the data. However, I do not expect to make any major changes before the draft. So… expect an update of the model, but do not expect the story it tells to change dramatically.
Ultimately I want all data for every player to make, or try to make, the jump from college to pros. The data is not quite there, but it is getting closer. Few if any college teams recorded "minutes played", "assists", or defensive statistics until the early 80s. Even after that there are many missing data-points up until the 90s. That means many older players are not included in the data. Michael Jordan is the most disappointing omission, but it appears nobody knows how many minutes he played at UNC (if anyone can find this data it would be greatly appreciated).
In addition to basic box scores, I have physical specs: height, weight, and age (in days) and then more advanced physical specs like vertical and wingspan for recent prospects, as well as team info: "SRS", and "SOS", and then "pace" for more recent prospects.
Model 1 (Peak Win Shares):
The goal of the first model is to predict how many win shares a player is expected to produce in his peak season before his 26th birthday. The current iteration of this model relies solely on "Win Shares", but I have played with including an RAPM-based win generation stat and may average it into the WS measure in the future.
In order to predict peak win shares, I regressed a mixed-effect model predicting "Peak Win Shares" with all of the basic box score stats, age, position, SOS, SRS, and a couple interactions as fixed effects and "era" (college season’s in 5-year blocks from 1980-85 to 2010+) as random effects. While height and weight fall out of this iteration of the model, they may be included in the future. This process tells me how different quantifiables help explain NBA success historically, and allows us to then plug in new players in order to predict future success.
I use this model to predict future performance for a player during every college season. I then use an additional regression model that weights each season differently (the most recent season carries the most information, but past seasons improve the ultimate prediction).
Finally… I plan to run an additional model that accounts for combine statistics and % of rebounds that are ORBs for the more recent players for whom I have that data. I have not yet applied this step to the retrodictions below, but I may do so before the draft. I have played with the numbers and while it improves the prediction, the improvement is relatively small and the general order of things is largely preserved.
Model 2 (Outcome likelihood):
This model attempts to capture the high-variance gambling nature of the draft. Rather than trying to peg a specific expected production to each player, it gives them percent likelihoods of being a "bust" (0 WS), "bench-warmer" (> 0, < 5 WS), "starter" (> 5, < 10 WS), or "star" (> 10 WS) at their peak performance. This model uses multinomial regression with most of the same predictors as model 1 (though it includes height and weight).
For now this model only includes the most recent college season, but I may eventually take the time to include past seasons in a similar manner to model 1.
Results (Retrodicting 1983 to 2012)
Now for the fun part!
Each player’s score is calculated using a unique model built from all other players in the dataset. This means that all retrodiction results are out of sample and thus an honest test of the model’s ability to accurately predict future prospects.
Looking through the results you will see that reliance on this model would have made some excellent picks in the past (Rondo, Lowry, Granger, Millsap, Stockton, Drexler, Zach Randolph, Artest, Ray Allen…) and avoided some clunkers (Austin Rivers, Randy Foye, Olowakandi, Joe Alexander)..., but it also would have advocated some embarrassing picks (Bo Kimble, Derrick Chievious, taking three players before Patrick Ewing…) and thrown a hissy over some early picks who ended up being excellent players (Deron Wiliams, LaMarcus Aldridge, Dikembe Mutombo...)
Some of these errors are impossible to easily explain and just need to be accepted as examples of just how difficult it is to predict who will succeed in the NBA. However looking through the data, I may have identified a few subjective heuristics to keep in the back of your mind.
1) 1. No fatties! Sweetney, Oliver Miller and Sean May are all near the top of the list of players who this model liked more than reality did. Being over-weight seems to work much better in college than it does in the pros. I haven’t tried addressing this in the model yet because weight’s effect is complicated. Sometimes heavy is good, but too much heavy is typically bad. It also doesn’t help that most of the real problem cases are guys who continued to expand after they were drafted. Guys like Kevin Love who go the other direction have a much better track-record.
2) 2. Beware the tween forward. This shouldn’t be a tough sell around these parts. Michael Beasley, Derrick Williams, Donyell Marshall, Glenn Robinson, and Antoine Walker were all pegged as stud prospects by this model. The level of NBA success varies across these cases, but all were a considerable disappointment based on this model’s projections as well as their actually draft slot. History says to avoid college 4s who depend on outside shooting (*cough* Anthony Bennett *cough*).
3) 3. I have not found enough data on shot locations and assisted rates to include it in the model, but my analysis of the dataset at Hoop-Math found that assisted jumpers are good, unassisted rim attempts are very good, and unassisted jumpers are very bad. Especially when gauging your opinion of my 2013 projections, I recommend seeing how your favorite player fits these criteria and subjectively factoring that in… I should also mention that ORBs are better than DRBs but those are not distinguished in the model so look for guys who collect more of the former.
If any of you notice other patterns in the retrodiction data, please note them in the comments. Identifying where the model errs can really help guide future improvements.
By draft year:
Here is a link to the google-document with all of my retrodiction data. I have it sorted by draft class and expected win shares in peak season.
Top-15 at each position:
Here are the top scoring college prospects in the past 30 years. Shaq grades out as the single greatest college prospect in history. Not only that, but if he had come out following his absurd Sophomore season, he would have been even more tantalizing. His likelihood of being a star player actually eclipsed 90% following that year of abusing college players as a 19 year old.
The extremes really aren't the most interesting area with this model. There is pretty good agreement between the model and drafting teams in most of these cases. A lot of #1 and #2 picks in these lists.
What about the storied draft history of our puppies? Interestingly, the model actually would have liked some of our picks. Gerald Glass and William Avery looked like steals! Longley, Derrick, and Laettner looked like great prospects and Donyell Marshall was supposed to be the next great thing. Unfortunately the model was consistently wrong in these cases... but nailed it in the cases where it says we reached.
(Note: Many of the following 'WINS' scores have changed to some degree after fixing a code error. However, the changes are rarely substantial, and I don't want to bother going through and editing them. The story is still the same.)
The biggest step I need to take in order to improve this model is pace-adjust the earlier data. Currently there is a pace adjustment applied to all players available in the Draft Express dataset (ending in 2003), but not for any of the earlier prospects. I have not found publically available pace data, but I think I can calculate a pseudo-pace using data that is scrapable on basketball-reference’s college dataset. If anyone wants to earn a bushel of Hoopus points, they can write a script that runs through this (CBBR School Index) and sums total stats of all players on each college team and spits it out as a row with "team name", "season", "total team fg, fga, rebounds, steals…" I’m not sure how much a pace-adjustment will change, but it could be considerable and will certainly be positive.
Add more duds… There are a number of players who were drafted but never logged a professional minute of basketball. Before 2003, these guys are missing in my data. It is important to fix this, because these players carry information about what skillsets fail to translate to the NBA. I have started the process of manually pulling their numbers from college basketball reference, but I may or may not complete this process before the draft.
What about 2013??!!??
I have already run the numbers for the upcoming draft. However, in the interest of "content generation" I will be withholding the results for now and posting them by position throughout the next week. Stay tuned.