This project analyzes Formula 1 race data to predict tyre life using machine learning models such as LightGBM, Random Forest, and Linear Regression. It includes data pipelines for session extraction, feature engineering, and model training. The codebase supports data cleaning, feature selection, and model evaluation, with a focus on tyre compound analysis and stint strategies. See more project story: www.linkedin.com/in/dіana-antoniuk-067b28362
Deployed project: https://antoniukdin34.pythonanywhere.com/
Demo info: The demo showcases the best-performing model by plotting the entire test set in a graph, so you can see how the model works with real data. Most predictions are accurate, with a few outliers visible in the last graph. The first three graphs show the data I gathered for the project. I chose this approach to make the results and model behavior clear and understandable.
- Read up on F1 strategy and talked to people on LinkedIn to define the problem.
- Tested the FastF1 API with a single race (Monaco 2022) to get familiar with the data.
- Built a dataset covering 2022–2025, tracking every lap and tyre change.
- Tried out several models—random forest worked best for predicting tyre stints.
- Switched from one-hot encoding (separate columns for Soft/Medium/Hard) to numeric encoding (Soft=1, Medium=2, Hard=3).
- Cleaned up the data, normalized features, and removed outliers (like a tyre stint of only 3 laps).
- Noticed the model was focusing on irrelevant features (like year), so simplified the feature set.
- Ended up with a model using just three features: circuit length, compound, and tyre life.
- Normalized per track, since “Soft” tyres at Monaco aren’t the same as “Soft” at Monza.
- Used scatter plots to spot and remove outliers—some tyre lives just didn’t fit the trend.
- Dealt with data collisions (same features, different results) by refining the dataset.
- Automated extraction of F1 session data using FastF1
- Data cleaning, normalization, and feature engineering scripts
- Model training with LightGBM, Random Forest, and Linear Regression
- Tools for encoding categorical features and handling outliers
- Visualization of feature importances and tyre compound statistics
- Utilities for filtering and transforming CSV datasets
- Modular pipeline for experimenting with different features and models
Dataset_Preparation/: Scripts and CSVs for data cleaning and preparationModel_Training/: Model training scripts and baseline modelsREADME.md: Project instructions and overview
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process
.\venv\Scripts\Activate.ps1
python -m venv venv
venv\Scripts\activate
pip install fastf1 pip install scikit-learn pip install seabornrn
pip install -r requirements.txt
python batch_pipeline.py