Saturday, July 5, 2025
Sports World Today
  • Home
  • NFL
  • NBA
  • NCAAF
  • Soccer
  • More
    • Tennis
    • Cricket
    • Golf
    • F1
    • Boxing
    • MMA
No Result
View All Result
  • HOME
  • NFL
  • NBA
  • NCAAF
  • SOCCER
  • TENNIS
  • BOXING
  • MMA
  • CRICKET
  • GOLF
  • F1
No Result
View All Result
Sports World Today
No Result
View All Result
Home NCAAF

Developing a March Madness Prediction Model with XGBoost

sportsworldtoday by sportsworldtoday
March 15, 2025
in NCAAF
0 0
Developing a March Madness Prediction Model with XGBoost
Share on FacebookShare on Twitter

In some of our earlier Talking Tech ventures, we explored using a random forest classifier to anticipate play calls in college football. Similarly, I have a particular fondness for crafting artificial neural networks to forecast college football outcomes. But today, we’re shifting gears to investigate a new style of machine learning technique, nestled within the realm of ensemble methods. These methods amalgamate a diverse array of models to harness a collective strength in numbers. Although in a random forest, we create numerous decision trees, compile their findings, and push them into a unified result, our focus today will be on a method that’s just a bit less… arbitrary.

Gradient boosting shares quite a bit of its DNA with random forest approaches—think of them as siblings in the ensemble family. Both champion decision trees, and are equally versatile, applicable for either classification or regression tasks. But what differentiates them? While a random forest relies on crafting a whirlwind of decision trees, banking on the hope that misfit trees will average themselves out and the cream of the crop emerges, gradient boosting starts with a single tree, scrutinizes its errors, and incrementally evolves its successors to master precision. This leapfrogging continues all the way through.

Eventually, this technique yields a sequence of trees, each inheriting insights from its forerunners to refine accuracy. Yet, none of these creations are discarded; each contributes to the eventual model, solidifying this as another ensemble tactic. Such strategies often outperform random forests, and indeed, gradient-boosted models frequently shine in platforms like Kaggle.

For those dabbling in gradient boosting using Python, XGBoost and LightGBM top my list of library recommendations. Both offer robust solutions, though today we’ll play around with XGBoost. However, I’d suggest checking out LightGBM when you have a moment.

Let’s pivot to practicalities: employing the CBBD Python library, we’ll tap into the data from CollegeBasketballData.com’s REST API. Ensure you’ve got these packages ready: cbbd, pandas, sklearn, xgboost—install them via pip if necessary. We’ll kick off by importing all essentials and setting up our CBBD API key in the configuration. Need one? Head over to the CBBD website to grab yours.

import cbbd
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

configuration = cbbd.Configuration(
    access_token = 'your_api_key_here'
)

You’ll breathe easy knowing we’ll only make 22 API calls, comfortably within the free monthly 1000-call allocation by CBBD—plenty for repeated model runs.

Next, let’s archive all NCAA tournament games from 2013 through 2024. You’re welcome to venture further back if curiosity strikes. We’ll aptly use the tournament="NCAA" parameter to corral all the tournament games from a specified year.

games = []
with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    for season in range(2024, 2013, -1):
        results = games_api.get_games(season=season, tournament="NCAA")
        games += results

Interestingly, this returns us a suite of 686 games. Curious about the data details of a single game? Let’s peek at one now.

GameInfo(id=12010, source_id='401638579', season_label="20232024", season=2024, season_type=, start_date=datetime.datetime(2024, 3, 19, 18, 40, tzinfo=datetime.timezone.utc), start_time_tbd=False, neutral_site=True, conference_game=False, game_type="TRNMNT", tournament="NCAA", game_notes="Men's Basketball Championship - West Region - First Four", status=, attendance=0, home_team_id=114, home_team='Howard', home_conference_id=18, home_conference="MEAC", home_seed=16, home_points=68, home_period_points=[27, 41], home_winner=False, away_team_id=341, away_team='Wagner', away_conference_id=21, away_conference="NEC", away_seed=16, away_points=71, away_period_points=[38, 33], away_winner=True, excitement=4.7, venue_id=76, venue="UD Arena", city='Dayton', state="OH")

Our next move is to scoop up team statistics to include as features in our model. For this, we’ll lean on the CBBD Stats API, compiling regular season stats for the same years we gathered tournament game data. Remember to specify season_type="regular"—a crucial step to avoid inadvertently training a model on retrospective data.

Run the code below to fetch those team season stats.

stats = []
with cbbd.ApiClient(configuration) as api_client:
    stats_api = cbbd.StatsApi(api_client)
    for season in range(2024, 2013, -1):
        results = stats_api.get_team_season_stats(season=season, season_type="regular")
        stats += results

A glance at these stats reveals the plethora of metrics at our disposal:

TeamSeasonStats(season=2024, season_label="20232024", team_id=1, team='Abilene Christian', conference="WAC", games=32, wins=15, losses=17, total_minutes=1325, pace=61.1, team_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=43.2, attempted=1877, made=811), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.4, attempted=1393, made=646), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=34.1, attempted=484, made=165), free_throws=TeamSeasonUnitStatsFieldGoals(pct=73.1, attempted=729, made=533), rebounds=TeamSeasonUnitStatsRebounds(total=1070, defensive=756, offensive=314), turnovers=TeamSeasonUnitStatsTurnovers(team_total=12, total=404), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=635), points=TeamSeasonUnitStatsPoints(fast_break=319, off_turnovers=466, in_paint=1138, total=2320), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=38.8, offensive_rebound_pct=29.3, turnover_ratio=0.2, effective_field_goal_pct=47.6), assists=405, blocks=65, steals=253, possessions=2028, rating=114.4, true_shooting=52.8), opponent_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.5, attempted=1792, made=833), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=52.6, attempted=1227, made=645), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=33.3, attempted=565, made=188), free_throws=TeamSeasonUnitStatsFieldGoals(pct=68.7, attempted=723, made=497), rebounds=TeamSeasonUnitStatsRebounds(total=1171, defensive=859, offensive=312), turnovers=TeamSeasonUnitStatsTurnovers(team_total=23, total=478), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=619), points=TeamSeasonUnitStatsPoints(fast_break=316, off_turnovers=411, in_paint=1120, total=2351), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=40.3, offensive_rebound_pct=26.6, turnover_ratio=0.2, effective_field_goal_pct=51.7), assists=388, blocks=108, steals=206, possessions=2023, rating=116.2, true_shooting=55.7))

Now, the task is to seamlessly integrate team statistics with each game record for our data frame. We aim to create a list of dict objects combining this information, which can be easily imported into pandas.

For merging these datasets into a cohesive structure:

records = []
for game in games:
    record = game.to_dict()
    home_stats = [stat for stat in stats if stat.team_id == game.home_team_id and stat.season == game.season][0]
    away_stats = [stat for stat in stats if stat.team_id == game.away_team_id and stat.season == game.season][0]
    record['home_pace'] = home_stats.pace
    record['home_o_rating'] = home_stats.team_stats.rating
    record['home_d_rating'] = home_stats.opponent_stats.rating
    record['home_free_throw_rate'] = home_stats.team_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate'] = home_stats.team_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio'] = home_stats.team_stats.four_factors.turnover_ratio
    record['home_efg'] = home_stats.team_stats.four_factors.effective_field_goal_pct
    record['home_free_throw_rate_allowed'] = home_stats.opponent_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate_allowed'] = home_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio_forced'] = home_stats.opponent_stats.four_factors.turnover_ratio
    record['home_efg_allowed'] = home_stats.opponent_stats.four_factors.effective_field_goal_pct
    record['away_pace'] = away_stats.pace
    record['away_o_rating'] = away_stats.team_stats.rating
    record['away_d_rating'] = away_stats.opponent_stats.rating
    record['away_free_throw_rate'] = away_stats.team_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate'] = away_stats.team_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio'] = away_stats.team_stats.four_factors.turnover_ratio
    record['away_efg'] = away_stats.team_stats.four_factors.effective_field_goal_pct
    record['away_free_throw_rate_allowed'] = away_stats.opponent_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate_allowed'] = away_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio_forced'] = away_stats.opponent_stats.four_factors.turnover_ratio
    record['away_efg_allowed'] = away_stats.opponent_stats.four_factors.effective_field_goal_pct
    records.append(record)

With these records in place, we’ll transition into a pandas data frame, calculating a new column to represent the final score margin derived from the home and away scores.

df = pd.DataFrame(records)
df['margin'] = df.home_points - df.away_points

As we take stock of the data with df.head(), it’s time to engage in feature selection, deciding which columns to include in our model training.

Let’s gather a bird’s-eye view on the columns within the data frame:

df.columns

We’ll extract particular columns to train our model and identify the output we aim to predict, the margin.

features = [
    'home_o_rating',
    'home_d_rating',
    'home_pace',
    'home_free_throw_rate',
    'home_offensive_rebound_rate',
    'home_turnover_ratio',
    'home_efg',
    'home_free_throw_rate_allowed',
    'home_offensive_rebound_rate_allowed',
    'home_turnover_ratio_forced',
    'home_efg_allowed',
    'away_o_rating',
    'away_d_rating',
    'away_pace',
    'away_free_throw_rate',
    'away_offensive_rebound_rate',
    'away_turnover_ratio',
    'away_efg',
    'away_free_throw_rate_allowed',
    'away_offensive_rebound_rate_allowed',
    'away_turnover_ratio_forced',
    'away_efg_allowed',
    'homeSeed',
    'awaySeed'
]

outputs = ['margin']

df[features + outputs]

Feeling free to tailor and tinker with features is encouraged here—alter the stats as fits your goals. Finally, we’ll split the dataset into training and testing data sets, using previous seasons to train the model, while the 2024 games serve as our test set.

training = df.query("season != 2024").copy()
testing = df.query("season == 2024").copy()

Further slicing the training data into training and validation subsets offers a safeguard against overfitting and sharpens model accuracy.

X_train, X_valid, y_train, y_valid = train_test_split(training[features], training[outputs], train_size=0.8, test_size=0.2, random_state=0)

And finally, the moment we’ve been gearing toward—training our model using XGBRegressor. Were we tackling a classification problem, an XGBClassifier would be our tool of choice.

model = XGBRegressor(random_state=0)
model.fit(X_train, y_train)

Feel the buzz of success—it’s alive! Now, let’s put this model to work by using it to predict our validation set.

predictions = model.predict(X_valid)
predictions

Should these games have already played out, metrics like mean absolute error (or others) can measure prediction fidelity.

mae = mean_absolute_error(predictions, y_valid)
mae

A mean absolute error of ~7.96 emerges—a respectable start, approximately on par with benchmark MAEs. What now? Finetuning is ever-vital: modifying model parameters or amplifying input features might yield uprated results.

For exploratory fine-tuning, consider adjusting features like the number of estimators or learning rate.

model = XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4)
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
mae = mean_absolute_error(predictions, y_valid)

While my efforts didn’t slice any lower into the MAE terrain, your altercations might! It’s a journey of continuous enhancement.

Now, over to our test dataset: let’s predict outcomes and compare them against the real 2024 NCAA tournament results.

predictions = model.predict(testing[features])
testing['prediction'] = predictions
testing[['homeSeed', 'homeTeam', 'awaySeed', 'awayTeam', 'margin', 'prediction']]

And here’s the fun part—measuring the percentage of games our model nailed.

testing.query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing.shape[0]

At 64.3%, our model’s predictive prowess shines through. First-round prediction stats reveal a slightly enhanced 69.7% rate—neat, right? To zero in on these first-round numbers:

testing[testing['gameNotes'].str.contains('1st')].query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing[testing['gameNotes'].str.contains('1st')].shape[0]

With tangible results in hand, safeguarding your model for future re-use becomes convenient—save it as shown:

model.save_model('xgboostmodel')

When the occasion demands future predictions, loading it back up is straightforward:

model = XGBRegressor()
model.load_model('xgboostmodel')

Fancy predicting outcomes for theoretical matchups? Perfect for bracket strategizing—here’s how you might weave that magic:

stats = stats_api.get_team_season_stats(season=2025, season_type="regular")

def predict_game(model, stats, projected_home_seed, home_team, projected_away_seed, away_team):
    home_stats = [stat for stat in stats if stat.team == home_team][0]
    away_stats = [stat for stat in stats if stat.team == away_team][0]
    record = {
        'home_o_rating': home_stats.team_stats.rating,
        'home_d_rating': home_stats.opponent_stats.rating,
        'home_pace': home_stats.pace,
        'home_free_throw_rate': home_stats.team_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate': home_stats.team_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio': home_stats.team_stats.four_factors.turnover_ratio,
        'home_efg': home_stats.team_stats.four_factors.effective_field_goal_pct,
        'home_free_throw_rate_allowed': home_stats.opponent_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate_allowed': home_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio_forced': home_stats.opponent_stats.four_factors.turnover_ratio,
        'home_efg_allowed': home_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'away_o_rating': away_stats.team_stats.rating,
        'away_d_rating': away_stats.opponent_stats.rating,
        'away_pace': away_stats.pace,
        'away_free_throw_rate': away_stats.team_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate': away_stats.team_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio': away_stats.team_stats.four_factors.turnover_ratio,
        'away_efg': away_stats.team_stats.four_factors.effective_field_goal_pct,
        'away_free_throw_rate_allowed': away_stats.opponent_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate_allowed': away_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio_forced': away_stats.opponent_stats.four_factors.turnover_ratio,
        'away_efg_allowed': away_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'homeSeed': projected_home_seed,
        'awaySeed': projected_away_seed
    }
    return model.predict(pd.DataFrame([record]))[0]

predict_game(model, stats, 5, 'Michigan', 11, 'Dayton')

In this demonstration, we loaded data for the current season, sketched a method to craft a DataFrame record bundled with essential features, and used said method to divine a prediction. Here, the model opines that Michigan, as a 5 seed, could outstrip Dayton, an 11 seed, by 6.1 points. Magnifique!

In closing, I pass the bat over to you—our endeavor laid a sturdy foundation, yet countless enhancements lie await. From untapped Stats API features to opponent-adjusted statistics, and beyond the API bounds entirely, the field is rife with potential.

I’d be thrilled to witness your thoughts, whether on Twitter, Bluesky, Discord, or beyond. Enjoy building, and may the odds favor your brackets!

Tags: DevelopingMadnessMarchModelPredictionXGBoost
Previous Post

Pac-12 and Mountain West Request Mediation and Delay in California Poaching Penalties Case

Next Post

Supporting Your Child’s Journey in Martial Arts: A Guide for Parents

Next Post
Supporting Your Child’s Journey in Martial Arts: A Guide for Parents

Supporting Your Child's Journey in Martial Arts: A Guide for Parents

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Schedule for Dolphins, Bills, and Steelers Games

Schedule for Dolphins, Bills, and Steelers Games

May 7, 2025
Week 1 NFL Odds and Lines for All 16 Games After Schedule Release

Week 1 NFL Odds and Lines for All 16 Games After Schedule Release

May 15, 2025
Manchester City vs Wolves: Line-Ups, Stats, and Match Preview

Manchester City vs Wolves: Line-Ups, Stats, and Match Preview

May 1, 2025
Jayson Tatum Reveals Celtics Players Enjoy Watching Coach Joe Mazzulla’s Press Conferences

Jayson Tatum Reveals Celtics Players Enjoy Watching Coach Joe Mazzulla’s Press Conferences

April 29, 2025
KKR vs PBKS Dream11 Prediction: Today’s Match, Fantasy Cricket Tips, Playing XI, Pitch Report, Injury Updates – IPL 2025, Match 44

KKR vs PBKS Dream11 Prediction: Today’s Match, Fantasy Cricket Tips, Playing XI, Pitch Report, Injury Updates – IPL 2025, Match 44

April 25, 2025
Photos of the Day: Paige Bueckers Warms Up and More in the WNBA

Photos of the Day: Paige Bueckers Warms Up and More in the WNBA

May 17, 2025

Hello world!

1
Eagles Quarterbacks Coach Pursuing New Job Opportunity

Eagles Quarterbacks Coach Pursuing New Job Opportunity

0
Brooke Pryor Describes Justin Fields as ‘Cautious’; Steelers Aim to Find ‘Balanced Approach’

Brooke Pryor Describes Justin Fields as ‘Cautious’; Steelers Aim to Find ‘Balanced Approach’

0
With Rising Salaries, Now is an Optimal Time for NFL Coaches

With Rising Salaries, Now is an Optimal Time for NFL Coaches

0
Eagles’ Super Bowl Victory Could Mark the Beginning for Nussmeier QB Legacy

Eagles’ Super Bowl Victory Could Mark the Beginning for Nussmeier QB Legacy

0
LaLiga Preview: Team Updates, Predicted Lineups, TV Viewing Info, and Match Predictions

LaLiga Preview: Team Updates, Predicted Lineups, TV Viewing Info, and Match Predictions

0
Schedule and Weather Forecast for the 2025 British Grand Prix at Silverstone: Practice, Qualifying, and Race Times

Schedule and Weather Forecast for the 2025 British Grand Prix at Silverstone: Practice, Qualifying, and Race Times

July 4, 2025
Conor McGregor Challenges Michael Chandler on Donald Trump’s Lawn: The Dana White House?

Conor McGregor Challenges Michael Chandler on Donald Trump’s Lawn: The Dana White House?

July 4, 2025
Ex-Caddie Reveals Which Pro Tiger Woods Thought Could Challenge Him

Ex-Caddie Reveals Which Pro Tiger Woods Thought Could Challenge Him

July 4, 2025
What Jordan Clarkson and Guerschon Yabusele Bring to the Knicks’ Bench This Season

What Jordan Clarkson and Guerschon Yabusele Bring to the Knicks’ Bench This Season

July 4, 2025
Rashford, Garnacho, Sancho, Antony, and Malacia Seek Transfers from Manchester United – Transfer News Update

Rashford, Garnacho, Sancho, Antony, and Malacia Seek Transfers from Manchester United – Transfer News Update

July 4, 2025
Rashid Khan Announces Major Retirement Decision at 26

Rashid Khan Announces Major Retirement Decision at 26

July 4, 2025
Sports World Today

Stay ahead in the game with Sports World Today – your ultimate source for the latest sports news, live scores, expert analysis, and in-depth coverage of football, basketball, cricket, and more. Get real-time updates, exclusive interviews, and trending stories from the world of sports!

Categories

  • Boxing
  • Cricket
  • F1
  • Golf
  • MMA
  • NBA
  • NCAAF
  • NFL
  • Soccer
  • Tennis
  • Uncategorized

Recent News

  • Schedule and Weather Forecast for the 2025 British Grand Prix at Silverstone: Practice, Qualifying, and Race Times
  • Conor McGregor Challenges Michael Chandler on Donald Trump’s Lawn: The Dana White House?
  • Ex-Caddie Reveals Which Pro Tiger Woods Thought Could Challenge Him
  • DMCA
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
  • Cookie Privacy Policy
  • Contact us

Copyright © 2025 Sports World Today.
Sports World Today is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • NFL
  • NBA
  • NCAAF
  • Soccer
  • More
    • Tennis
    • Cricket
    • Golf
    • F1
    • Boxing
    • MMA

Copyright © 2025 Sports World Today.
Sports World Today is not responsible for the content of external sites.