Create a Comprehensive and Profitable System for Predicting the Outcomes of NBA Games.
Under the 'comprehensive' aspect, the project will explore and test multiple plausible prediction methods, ensuring an evolving scope and prioritization based on the utility of the results.
The 'profitable' aspect aims to generate actionable insights for profitable betting decisions.
Initially, the project will focus on predicting the final score margins. In its long-term vision, it could expand to cover other betting domains like over/under and player props, embracing a broad definition of 'outcomes'.
NBA betting uniquely aligns with my interests, knowledge, skills, and goals.
How do we predict the outcome of an NBA game? From a data science viewpoint, the challenge lies in identifying the optimal feature set and designing an effective model architecture.
The vast amount of public data available for NBA games is both a blessing and a challenge: the options are plentiful, but selecting the right ones and managing them effectively is crucial.
NBA game predictions can be approached at multiple levels, each functioning independently or as part of a higher-level framework:
Data acquisition is complicated by the time series nature of sports data. Predictive metrics vary in utility based on the timeframe considered—team history, season-to-date stats, recent performance, or data from the last game. ML models require point-in-time data, which poses challenges in terms of availability and structuring.
See also: GitHub Discussion - Frameworks and Data Sourcing
| Raw Data | Traditional Statistics | Advanced Statistics | Subjective | Other |
|---|---|---|---|---|
|
|
|
|
|
The project focuses on predicting point spreads using team-level statistics. This is the most common betting market and provides a clear benchmark.
Vegas lines predict game winners and margins with an average miss of ~9 points per game. The graph below shows this discrepancy over time.
Vegas sets a high bar—they have extensive resources and data. The public Vegas lines also serve as both a benchmark and a feature for modeling.
The project is built around a SQLite database with data processing, feature engineering, model training, and prediction pipelines. A Flask web app and Dash dashboard provide the interface.
Note: This diagram reflects an earlier version of the project. See ARCHITECTURE.md for the current implementation.
Data collection is the most time-intensive part of the project. The data falls into three categories:
While acquiring current data is relatively straightforward, the real challenge lies in sourcing historical data for training and testing models.
Note: This diagram reflects an earlier version. See ARCHITECTURE.md for current data sources.
Note: This diagram reflects an earlier version. See ARCHITECTURE.md for current ETL details.
The ETL pipeline prepares data for modeling. The main steps:
Important considerations:
Analysis of Vegas spread accuracy across ~23,000 games from 2006-2026 shows the average miss has increased over time—from 9.12 points in the Traditional Era (2006-2016) to 10.49 points in the Modern Variance era (2020-2026). This trend coincides with the three-point revolution and increased game-to-game variance in the NBA.
Postseason games show slightly higher prediction error (~10.18 points) compared to regular season (~9.64 points), likely due to smaller sample sizes and higher-stakes adjustments.
Two approaches are explored: AutoML for quick iteration on tabular data, and a custom architecture for time-series modeling (see NBA AI).
The project uses AutoGluon for automated machine learning. AutoGluon trains and ensembles multiple models, handling feature engineering and hyperparameter tuning automatically.
Consistently outperforming Vegas lines remains a challenge. This is not unexpected given the complexity of the task and the additional hurdle of overcoming the vig (the bookmaker's charge).
The journey towards a custom model architecture initially centered around a traditional data science approach – combining various features from different sources. While this approach aligns with conventional methods and may still hold potential, it faces several challenges, including complex data collection and potentially mirroring the algorithms used by line setters. To move beyond these limitations and add an element of innovation, I am exploring a second approach, which includes the following key requirements:
This section applies model predictions to betting decisions. The process combines model outputs with current betting lines to identify potential value bets.
A simple bankroll management module tracks bet sizing and account balance using Kelly Criterion principles.
A Flask web app displays games and predictions. The Dash dashboard shows betting performance metrics.
# Clone the repository
git clone https://github.com/NBA-Betting/NBA_Betting.git
cd NBA_Betting
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies (choose one)
pip install -e ".[all]" # Full install with ML, web app, and dev tools
pip install -e "." # Core only (data collection and ETL)
pip install -e ".[ml]" # Core + AutoGluon for modeling
pip install -e ".[web]" # Core + Flask/Dash web app
# Set up environment variables
cp .env.example .env
# Edit .env with your ODDS_API_KEY (optional, for live odds)
# Run the data pipeline
python update_data.py
# Launch the web app
python start_app.py
The update_data.py script runs the full pipeline: data collection → ETL → predictions.
python update_data.py # Daily update (yesterday/today/tomorrow)
python update_data.py --season 2024 # Backfill full season
python update_data.py --date 2025-01-15 # Fix specific date
python update_data.py --collect-only # Skip ETL and predictions