This project aims to predict the outcomes of Premier League football matches using machine learning models. It explores various features to determine their importance in predicting match results—whether it’s a home win, draw, or away win.
- Midterm Project: Premier League Football Prediction
Predicting the outcomes of football matches has always been a challenging yet fascinating task for sports analysts and enthusiasts. This project focuses on the Premier League, aiming to build robust machine learning models to forecast match results—whether it’s a home win, draw, or away win. By leveraging historical match data and team statistics, the project seeks to identify key features that influence match outcomes. Despite the inherent unpredictability and dynamic nature of sports betting markets, this project aspires to provide valuable insights and potentially profitable predictions.
The raw data for this project is sourced from Football Data. The focus is exclusively on the Premier League, covering seasons from 2005/2006 to 2024/2025. The raw data files can be found here.
For the last twenty years, the tendency has been for the home teams to win the majority of games, which is really common for this type of sport because there is always an associated home factor.
During data gathering, the following key steps are performed:
The data_gathering
function in the 01_data_gathering.py
script encapsulates these steps. It ensures the necessary directories exist, downloads the CSV files for the specified seasons, checks the columns in the files, concatenates the data, and saves the processed data.
For more details, see the 01_data_gathering.py script.
The 02_data_preparation.py
script performs the following key steps:
- Data Cleaning: Fix column names, handle missing values, and ensure data integrity.
- Feature Engineering: Create new features such as goal difference, total shots, shot accuracy, and time-based features.
- Rolling Averages: Calculate rolling averages for various statistics over 3 and 5 game windows.
- Cumulative Points: Compute cumulative points for home and away teams.
- Normalize Betting Odds: Convert betting odds to implied probabilities.
- Save Processed Data: Save the processed data for the current season (2024/2025) and the final prepared dataset to CSV files.
For more details, see the 02_data_preparation.py script.
The 03_data_eda.py
script is dedicated to Exploratory Data Analysis (EDA). It includes the following key steps:
- Data Checking: Check data types, missing values, unique values, duplicates, and outliers.
- Correlation Analysis: Identify highly correlated features using a correlation matrix.
- Variance Inflation Factor (VIF): Calculate VIF to check for multicollinearity and remove features with high VIF values.
- Cluster Maps: Plot clustered heatmaps to visualize feature correlations.
- Target Distribution: Visualize the distribution of the target variable.
- Saving Data: Save the cleaned and processed data for modeling and backtesting.
For more details, see the 03_data_eda.py script.
The 04_train_model.py
script covers the following key steps:
- Data Preprocessing: Prepare the data for modeling.
- Feature Selection: Use Recursive Feature Elimination with Cross-Validation (RFECV) to select important features.
Check documentation about RFECV in: Scikit_learn-RFECV
- Model Evaluation: Evaluate models using RandomForest and XGBoost classifiers.
For example, here the model was overfitting in training data
- Hyperparameter Tuning: Tune hyperparameters to reduce overfitting.
After hyperparameter tuning we were able to decrease overfitting
- Model Finalization: Finalize the best model using a pipeline and save it for future predictions.
For more details, see the 04_train_model.py script.
The 05_back_testing_market.py
script includes the following key steps:
- Loading the Model and Data: Load the trained model and test datasets.
- Making Predictions: Generate predictions and prediction probabilities using the model.
- Preparing Data for Analysis: Combine predictions with actual results and market probabilities.
- Calculating Brier Scores: Compute Brier scores for both the model's predictions and the market probabilities.
- Comparing Performance: Compare the average Brier scores of the model and the market.
For more details, see the 05_back_testing_market.py script.
- Python 3.8 or higher
- Docker
- Pipenv
Use git clone
to copy the repository to your local machine and navigate into the project directory.
git clone <repository-url>
cd repository
Replace repository-url
with the actual URL of the repository (for example, from GitHub, GitLab, etc.)
git clone https://github.com/username/repository.git
cd repository
First, open a terminal and change to the directory where your Pipfile
and Pipfile.lock
are located.
cd /path/to/your/project
In the project directory, use pipenv install
to create the virtual environment and install all dependencies specified in the Pipfile.lock
.
pipenv install
This command will:
- Create a virtual environment if one doesn’t already exist.
- Install the dependencies exactly as specified in the
Pipfile.lock
.
To activate the virtual environment, use:
pipenv shell
Now you're in an isolated environment where the dependencies specified in the Pipfile.lock
are installed.
Build the Docker image:
docker build -t <docker_image_name> .
Run the Docker container:
docker run -it --rm -p 9696:9696 <docker_image_name>
Note:
If you get an error with[ 5/11] RUN 'pipenv install --system --deploy'
, try turning off your VPN.
To run Elastic Beanstalk, follow these steps:
-
Install the AWS Elastic Beanstalk CLI: Ensure you have the AWS CLI and Elastic Beanstalk CLI installed. You can install the Elastic Beanstalk CLI using pip:
pip install awsebcli
-
Initialize Elastic Beanstalk: Navigate to your project directory and initialize Elastic Beanstalk:
eb init -p docker -r <region> <project_name>
Follow the prompts to set up your application. Choose the appropriate region and project name
-
Create an Environment and Deploy: Create a new environment and deploy your application:
eb create <project_name> --enable-spot
Replace
<project_name>
with your desired environment name.-
Load Balancer Configuration
- Error Message: "At least two subnets in two different Availability Zones must be specified..."
- Elastic Beanstalk is configured to use a load balancer, but it requires at least two subnets in two different AZs to distribute traffic properly.
-
Subnet Issue
- Error Message: "No default subnet for availability zone: 'eu-west-2a'"
- The environment is attempting to create an auto-scaling group, but it cannot find a valid subnet in the specified AZ (eu-west-2a).
- This typically happens when:
- Your VPC does not have subnets in the specified AZ.
- There’s no default subnet configured for the region.
Solution Steps
-
Verify Default Subnets To confirm the existing default subnets in your VPC:
AWS CLI: Run the following command to list subnets:
aws ec2 describe-subnets --filters Name=default-for-az,Values=true
This will display all the default subnets in your VPC.
-
Configure Elastic Beanstalk with the Correct Subnets
aws elasticbeanstalk update-environment \ --environment-name `<project_name>` \ --option-settings file://options.json
Create an options.json file with subnet settings (add your subnets ids):
[ { "Namespace": "aws:ec2:vpc", "OptionName": "Subnets", "Value": "subnet-XXXXXXX,subnet-XXXXXXX" } ]
-
-
Terminate the Environment: When you are done, you can terminate the environment to stop incurring charges:
eb terminate <environment-name>
Open a new terminal and run the test script:
python tests/test_predict.py # to test locally
python tests/test_predict_aws.py # to test aws
To use the prediction service, send a POST request to the /predict endpoint with the following JSON payload locally or configure the test script accordantly:
curl -X POST http://127.0.0.1:9696/predict \
-H "Content-Type: application/json" \
-d '{
"home_team": "arsenal",
"away_team": "liverpool",
"date": "2024-12-16"
}'
To run the Streamlit app locally, follow these steps:
-
Ensure you have all dependencies installed and the virtual environment activated as described in the Installing Dependencies section.
-
Navigate to the project directory where
app.py
is located. -
Run the Streamlit app using the following command:
streamlit run app.py
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes