Atlanta United FC
Austin FC
Charlotte FC
Chicago Fire FC
FC Cincinnati
Colorado Rapids
Columbus Crew
FC Dallas
D.C. United
Houston Dynamo FC
Inter Miami CF
Sporting Kansas City
Los Angeles FC
LA Galaxy
Minnesota United FC
CF Montréal
Nashville SC
New England Revolution
New York Red Bulls
New York City FC
Orlando City SC
Philadelphia Union
Portland Timbers
Real Salt Lake
San Diego FC
San Jose Earthquakes
Seattle Sounders FC
St. Louis City SC
Toronto FC
Vancouver Whitecaps FC
Back to Portfolio
Web Scraping · Python

Web Scraping MLS 2024

Collected, cleaned, and analyzed the 2024 Major League Soccer season to uncover insights on team performance, player statistics, and the Messi economic impact on attendance.

Python BeautifulSoup pandas plotly seaborn
513 Matches
5 Tools
2024 Year

Overview

Data exploration and analysis of the 2024 Major League Soccer season. The 2024 MLS season provided an excellent opportunity to explore team performance, player statistics, and match outcomes. This project aimed to collect, clean, and analyze data to uncover actionable insights about the league.

Methodology

  1. Data Collection & Preparation: Gathered official MLS data using Python tools such as requests (HTTP requests), BeautifulSoup (HTML parsing), and glob (batch file handling), loading everything into pandas DataFrames.
  2. Data Cleaning: Addressed missing values, standardized data formats, and validated datasets to ensure accuracy.
  3. Exploratory Analysis: Used statistical summaries and visualizations to identify key patterns.
  4. Visualization: Created interactive dashboards to communicate findings effectively.

Libraries Used

import pandas as pd
import numpy as np
import glob
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
plt.style.use('default')
import warnings
from tabulate import tabulate
warnings.filterwarnings('ignore')

1. Data Collection

To gather the necessary data, we utilized web-scraping techniques. The pandas library's read_html function was employed to extract structured data directly from the MLS website. By specifying the id attribute of the desired table (sched_all), we ensured precise extraction of the relevant dataset.

WkDayDateTime HomexGScorexG (Away)Away AttendanceVenueMatch Report
1Wed2024-02-2120:00Inter Miami1.42–00.8Real Salt Lake21 137Chase StadiumMatch Report
1Sat2024-02-2413:45LAFC1.52–11.8Seattle Sounders22 214BMO StadiumMatch Report
1Sat2024-02-2414:00Columbus Crew1.81–00.5Atlanta Utd20 406Lower.com FieldMatch Report
1Sun2024-02-2514:30FC Cincinnati0.90–00.5Toronto FC25 513TQL StadiumMatch Report
1Sun2024-02-2516:00Nashville SC0.10–01.4NY Red Bulls30 109Geodis ParkMatch Report
1Sat2024-02-2419:30Austin1.11–23.0Minnesota Utd20 738Q2 StadiumMatch Report
1Sat2024-02-2419:30Orlando City0.90–01.0CF Montréal24 249Inter&Co StadiumMatch Report
1Sat2024-02-2419:30Portland Timbers0.54–11.2Colorado Rapids21 411Providence ParkMatch Report
1Sat2024-02-2419:30Houston Dynamo1.01–10.2Sporting KC18 818Shell Energy StadiumMatch Report
1Sat2024-02-2419:30Philadelphia Union2.12–21.6Chicago Fire18 734Subaru ParkMatch Report

2. Data Cleaning

All values in the Date column were converted to a standard datetime format using pd.to_datetime() with errors='coerce'. Invalid or improperly formatted dates are replaced with NaT (Not a Time), preventing errors during conversion.

IndexRoundWkDayDateTime HomexGScorexG (Away)Away AttendanceVenue
15— all NaN —
30— all NaN —
44— all NaN —
59— all NaN —
73— all NaN —
88— all NaN —
103— all NaN —
118— all NaN —
133— all NaN —
148— all NaN —

To ensure the dataset contained only valid match rows, we applied two focused filters. MLS = MLS[MLS["Date"].notna()] removes blank separator rows that FBref inserts between rounds. MLS = MLS[MLS["Date"] != "Date"] drops repeated header rows. After these steps, the DataFrame contains only genuine match records.

3. Exploratory Analysis: Messi Economic Impact

The table below lists a sample of MLS matches with core fields: week/day/date/time, home and away teams, each side's xG, the final score, attendance, and venue. We also derived Home Goals, Away Goals, Total Match Goals, and a Goal Bucket label.

WkDayDateTime HomexGScorexG (Away)Away AttendanceVenueMatch Report Home GoalsAway GoalsTotal Goals Goal Bucket 0 Goals1 Goal2 Goals3 Goals4+ Goals
1Wed2024-02-2120:00Inter Miami1.42–00.8Real Salt Lake21 137Chase StadiumMatch Report202200100
1Sat2024-02-2413:45LAFC1.52–11.8Seattle Sounders22 214BMO StadiumMatch Report213300010
1Sat2024-02-2414:00Columbus Crew1.81–00.5Atlanta Utd20 406Lower.com FieldMatch Report101101000
1Sun2024-02-2514:30FC Cincinnati0.90–00.5Toronto FC25 513TQL StadiumMatch Report000010000
1Sun2024-02-2516:00Nashville SC0.10–01.4NY Red Bulls30 109Geodis ParkMatch Report000010000

To compute this KPI, we classify each match by total goals, clean bad records, and aggregate by outcome to see how scoring intensity distributes across results.

  1. Bucket: Turn the raw Score into a Goal Bucket (0, 1, 2, 3, or 4+) for every match.
  2. Clean: Filter out rows where parsing failed (wrong dash, missing, or non-numeric values).
  3. Group: Aggregate by Outcome (Home / Draw / Away), summing each goal bucket and counting Total Matches.
  4. Read: Quickly see scoring distribution by result and convert to percentages to support pricing, marketing, and scheduling decisions.
KPI — Goals by Outcome
Outcome0 Goals1 Goal2 Goals3 Goals4+ GoalsTotal Matches
Away029165459158
Draw30046044120
Home033406894235
Total3062102122197513

We filtered all matches where Inter Miami played (home or away). For each game, we calculated the team's goals scored, then grouped by Outcome and Goal Bucket. The bar chart below shows whether wins cluster in 2–3 or 4+ goal games and how the pattern shifts home vs. away, quantifying a possible "Messi effect".

Inter Miami goal distribution by outcome and goal bucket
Figure 2. Inter Miami goals by outcome and goal bucket — 2024 MLS season.

Key Insights

01 Inter Miami leads in high-scoring home wins (4+ goals), consistent with Messi's attacking impact on team dynamics.
02 Home victories cluster in the 3–4+ goal buckets; draws tend to be low-scoring (0–2 goals), suggesting home advantage drives attacking play.
03 Away wins are more evenly distributed across goal buckets, suggesting defense-first road strategies across the league.
Explore the project