NBA Analytics Project Overview

I began learning AWS data tools this summer, and wanted to create a project that would bring a lot of what I learned together. As a basketball fan (specifically a Nuggets fan if you’re curious), I figured NBA data would be a great way to get my feet wet.

I figured I could start off by implementing a star schema to enable basic analytic queries and then expand things from there.

I decided to implement 2 fact tables - fact_game_stats and fact_player_stats - and 4 dimension tables - dim_dates, dim_games, dim_players, and dim_teams. The details of the dimensional model will be covered in a future blog post.

Architecture Overview

Here is a basic overview of the architecture:

Data is obtained from the NBA API, which returns a series of json files. This is stored in the raw datazone of S3, partitioned by game date to enable faster ETL performance.

Glue then processes the raw data and stores it in another data zone in Parquet format, which is a columnar format to enable faster analytical queries.

Athena is used to query this data without requiring it to be loaded into an actual database. And the query results are sent to the Streamlit frontend to visualize data.

The project has a solid foundation, and I could take it in a number of directions, perhaps with things like predictive analytics or real-time updates using a streaming service like Kinesis.