Atlanta United FC logo
Austin FC logo
Charlotte FC logo
Chicago Fire FC logo
FC Cincinnati logo
Colorado Rapids logo
Columbus Crew logo
FC Dallas logo
D.C. United logo
Houston Dynamo FC logo
Inter Miami CF logo
Sporting Kansas City logo
Los Angeles FC logo
LA Galaxy logo
Minnesota United FC logo
CF Montréal logo
Nashville SC logo
New England Revolution logo
New York Red Bulls logo
New York City FC logo
Orlando City SC logo
Philadelphia Union logo
Portland Timbers logo
Real Salt Lake logo
San Diego FC logo
San Jose Earthquakes logo
Seattle Sounders FC logo
St. Louis City SC logo
Toronto FC logo
Vancouver Whitecaps FC logo

WebScraping MLS‑2024

Data exploration and analysis of the 2024 Major League Soccer season. The 2024 Major League Soccer (MLS) season provided an excellent opportunity to explore team performance, player statistics, and match outcomes. This project aimed to collect, clean, and analyze data to uncover actionable insights about the league.

Methodology

Para llevar a cabo este análisis, se implementaron las siguientes etapas:

  1. Data Collection & Preparation: Gathered official MLS data using Python tools such as requests (HTTP requests), BeautifulSoup (HTML parsing), and glob (batch file handling), loading everything into pandas DataFrames for further processing.
  2. Data Cleaning: Addressed missing values, standardized data formats, and validated datasets to ensure accuracy.
  3. Exploratory Analysis: Used statistical summaries and visualizations to identify key patterns.
  4. Visualization: Created interactive dashboards to communicate findings effectively.

We import the necessary libraries:


import pandas as pd
import numpy as np
import glob
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
plt.style.use('default')
import warnings
from tabulate import tabulate
warnings.filterwarnings('ignore')
  

1. Data Collection

To gather the necessary data for this project, we utilized web‑scraping techniques. The pandas library in Python provides the read_html function, which was employed to extract structured data directly from the MLS website. The specific URL targeted contains the scores and fixtures for the 2024 Major League Soccer season. By specifying the id attribute of the desired table (sched_all), we ensured precise extraction of the relevant dataset. This approach allows us to work with a clean and organized table for further analysis.

WkDayDateTime HomexGScorexG (Away)Away AttendanceVenueMatch Report
1Wed2024-02-2120:00 Inter Miami1.42–00.8Real Salt Lake 21 137Chase StadiumMatch Report
1Sat2024-02-2413:45 LAFC1.52–11.8Seattle Sounders 22 214BMO StadiumMatch Report
1Sat2024-02-2414:00 Columbus Crew1.81–00.5Atlanta Utd 20 406Lower.com FieldMatch Report
1Sun2024-02-2514:30 FC Cincinnati0.90–00.5Toronto FC 25 513TQL StadiumMatch Report
1Sun2024-02-2516:00 Nashville SC0.10–01.4NY Red Bulls 30 109Geodis ParkMatch Report
1Sat2024-02-2419:30 Austin1.11–23.0Minnesota Utd 20 738Q2 StadiumMatch Report
1Sat2024-02-2419:30 Orlando City0.90–01.0CF Montréal 24 249Inter&Co StadiumMatch Report
1Sat2024-02-2419:30 Portland Timbers0.54–11.2Colorado Rapids 21 411Providence ParkMatch Report
1Sat2024-02-2419:30 Houston Dynamo1.01–10.2Sporting KC 18 818Shell Energy StadiumMatch Report
1Sat2024-02-2419:30 Philadelphia Union2.12–21.6Chicago Fire 18 734Subaru ParkMatch Report

2. Data Cleaning

This step ensures that all values in the Date column are converted into a standard datetime format, allowing for consistent handling of dates across the dataset. By using pd.to_datetime() with errors='coerce', invalid or improperly formatted dates are replaced with NaT (Not a Time), preventing errors during the conversion process. This transformation is crucial for enabling chronological analysis, filtering by date, and performing time‑based calculations efficiently.

IndexRoundWkDayDateTime HomexGScorexG (Away)Away AttendanceVenue
15— all NaN —
30— all NaN —
44— all NaN —
59— all NaN —
73— all NaN —
88— all NaN —
103— all NaN —
118— all NaN —
133— all NaN —
148— all NaN —

To ensure the dataset contained only valid match rows, we applied two focused filters to the Date column. First, MLS = MLS[MLS["Date"].notna()] removes every row where the parsed value is NaT—these are blank separator lines that FBref inserts between rounds. Second, MLS = MLS[MLS["Date"] != "Date"] drops the occasional row where the literal text "Date" reappears as a repeated header inside the table. After these two steps, the DataFrame contains only genuine match records with proper datetime values, eliminating noise and preventing downstream errors in time‑based analyses.

3. Exploratory Analysis: Messi economic impact on MLS

The table below lists a sample of MLS matches with the core fields we scraped and cleaned: week/day/date/time, home and away teams, each side’s xG, the final score, attendance, and venue. We also derived Home Goals, Away Goals, Total Match Goals, and a Goal Bucket label. This gives quick context for each game and shows the raw inputs we’ll use to build the KPI in the next section.

WkDayDateTime HomexGScorexG (Away)Away AttendanceVenue Match Report Home GoalsAway GoalsTotal Match Goals Goal Bucket 0 Goals1 Goal2 Goals3 Goals4+ Goals
1Wed2024-02-2120:00 Inter Miami1.42–00.8Real Salt Lake 21 137Chase Stadium Match Report202 200100
1Sat2024-02-2413:45 LAFC1.52–11.8Seattle Sounders 22 214BMO Stadium Match Report213 300010
1Sat2024-02-2414:00 Columbus Crew1.81–00.5Atlanta Utd 20 406Lower.com Field Match Report101 101000
1Sun2024-02-2514:30 FC Cincinnati0.90–00.5Toronto FC 25 513TQL Stadium Match Report000 010000
1Sun2024-02-2516:00 Nashville SC0.10–01.4NY Red Bulls 30 109Geodis Park Match Report000 010000

To compute this KPI, we classify each match by total goals, clean bad records, and aggregate by outcome to see how scoring intensity distributes across results.

  1. Bucket: Turn the raw Score into a Goal Bucket (0, 1, 2, 3, or 4+) for every match.
  2. Clean: Filter out rows where parsing failed (wrong dash, missing, or non-numeric values).
  3. Group: Aggregate by Outcome (Home / Draw / Away), summing each goal bucket and counting Total Matches.
  4. Read: Quickly see scoring distribution by result and convert to percentages to support pricing, marketing, and scheduling decisions.
KPI — Goals by Outcome
Outcome 0 Goals 1 Goal 2 Goals 3 Goals 4+ Goals Total Matches
Away029165459158
Draw30046044120
Home033406894235
Total3062102122197513

We filtered all matches where Inter Miami played (home or away). For each game, we calculated the team’s goals scored, then grouped by Outcome (Home / Draw / Away) and by Goal Bucket (0, 1, 2, 3, 4+). The bar chart below counts how many club matches fall into each combination, so you can quickly see whether wins cluster in 2–3 or 4+ goal games, whether draws tend to be low-scoring, and how the pattern shifts home vs. away. This view helps quantify a possible “Messi effect”: more high-scoring matches and more favorable results.

Illustrative image of web scraping process for MLS data