FanPost

[Plan] Basketball data gathering and number crunching

Hey guys,

I've been planning to write about this for quite some time. By the end of April I'll finish my second semester studying mathematical statistics, and I have already studied some econometrics before and I'm starting to "get it" (although I'll admit that I need to improve).

I would like to start up some kind of project of my own, doing some basketball data analysis. Before starting out, I plan to gather as much available data from the internet as possible.

What I am planning to do:

  • Create a database from online available data for input.
  • Since I don't have the hardware (or the money) to share this data, I plan to share the scripts that build the database.
  • Do a methodology research on current analytic models.
  • Do some modelling.
The software I'm planning to use would be
  • python3 for fetching the data and generating the database
  • postgresql for storing and managing the data
  • and R for analysis.
First of all, I plan to get every possible bit of information from basketball-reference, including
  • parsing play-by-play data
  • and parsing shot charts (I promised this to vjl)
If you have any suggestions from where I can download _raw_ data, or suggestions about how the DB should look like or suggestions about stuff to read, I'd be very thankful. Note that I don't need the data to be in a fancy format (like csv or whatever), I just need it to be parseable. Also I promise to share anything I stumble upon.
Whoever's interested could write me here or if there's a PM possible here somewhere, drop me one and I'll send my email address.
The project is a semi-longterm one, that is, I'd like to come up with some results by the end of August. I hope something nice will come out of it for all of us.
PS. (Only very loosely related) If you haven't already known: www.quandl.com is something of a must for anyone interested in data.