Building a ML-Powered F1 Race Predictor

Building a ML-Powered F1 Race Predictor: From Data to Deployment

June 20, 2025 Category: Machine Learning & Data Science

How I built a machine learning system that predicted Formula 1 podium finishers with 91% accuracy

Formula 1 is a racing sport, and notoriously unpredictable. 20 drivers battling across different circuits, varying weather conditions, and constantly evolving car performance, predicting race outcomes is an interesting data challenge. I wanted to use machine learning find patterns in this chaos.

My goal was to build a system that could predict the top 3 finishers for any F1 race with atleast 80% accuracy.

🏁 The Goal

Formula 1 presents an interesting prediction challenges in sports. F1 features evolving technology, rule changes, and unpredictable variables like weather and strategy calls that can completely shuffle race outcomes.

My aim was to build a system that could cut through this complexity to identify the underlying patterns that determine race results. The goal: achieve an 80%+ accuracy in predicting podium finishers.

📊 Step 1: Data Collection

Discovering the OpenF1 API

I used the OpenF1 API, this API allowed me to access historical F1 data from the 2023 season onwards for free. It contained:

Race results and qualifying positions
Detailed weather conditions
Driver and team information
Session-by-session data

The Collection Process

I built a comprehensive data collection pipeline that gathered:


# This is a sample of the data I collected
- 2023: 440 driver-race combinations (~22 races)
- 2024: 479 driver-race combinations (~23 races)  
- 2025: 220 driver-race combinations (~10 races)
- Total: 1,139 complete records

Key insight: The 2025 data revealed major changes - Lewis Hamilton moved to Ferrari, 5 new rookies joined, and McLaren emerged as the dominant team. This real-world complexity would be crucial for testing my model as I choose to train the model on data from 2023 and 2024 and test the model on data from 2025.

Technical Challenges

The API had rate limits (10 requests per 10 seconds), so I implemented:

Automatic retry logic with exponential backoff
5-second delays between requests

🔧 Step 2: Feature Engineering

Although pretty comprehensive, the raw data alone wasn't enough. I wanted to engineer features that captured the essence of F1 performance.

Core Predictive Features

1. Qualifying Position (Most Important)

Primary predictor - as the starting position heavily influences race outcome.
It became 22% of my model's decision-making process.

2. Driver Recent Form

Calculated as the rolling average of last 5 race positions.
It captures current momentum and confidence.

3. Team Performance Metrics

Season-average team position
Team win/podium rates
Critical for understanding how car competitive the car is

4. Circuit-Specific History

It checked how each driver historically performs at specific tracks (For Eg: Some drivers excel at Monaco but struggle at Monza).

Advanced Feature Engineering


# Example: Building historical context
for each_race:
    driver_history = get_races_before_this_date(driver, race_date)
    features['driver_races_completed'] = len(driver_history)
    features['driver_career_wins'] = count_wins(driver_history)
    features['driver_recent_avg_position'] = recent_5_races_avg(driver_history)

The Innovation: Time-aware feature engineering. For each race prediction, I used data available before that race.

🤖 Step 3: Model Development

Algorithm Selection

I tested three approaches:

Logistic Regression: Clean, interpretable baseline
Random Forest: Handles non-linear relationships and feature interactions
Gradient Boosting: Often the best performer for tabular data

The Training Strategy

Time-Based Split (Critical Decision):

Training: 2023-2024 data (919 samples)
Testing: 2025 data (200 samples)

This simulates real-world usage - training on historical data to predict future races.

Results That Exceeded Expectations


🏆 FINAL MODEL: Random Forest
✅ Test Accuracy: 91.0%
✅ AUC Score: 0.967 (near-perfect ranking ability)
✅ Perfect races: 3 out of 10 (100% accuracy)
✅ 90%+ accuracy: 6 out of 10 races

💻 Step 4: Building the Dashboard - User Experience

I wanted users to have two options to make a prediction. Option 1 they choose a race and the model generated the results. Option 2 they set the qualifying results and treack weather and the model predicted the output.

Key Interface Innovations

Tab 2 Grid-Based Layout:

Instead of 20 separate input boxes, I created a visual grid:

P1-P5: Pole position and front runners
P6-P10: Midfield competitors
P11-P15: Back of the midfield positions
P16-P20: Back of the grid

Driver Dropdowns:

Each position has a dropdown showing "Driver Name (Team)" - much more intuitive than typing positions.

Real-Time Weather Controls:

Wet race toggle
Temperature sliders
Humidity controls

🚀 Step 5: Cloud Deployment - Production Ready

The Update Challenge

Keeping in mind that I wanted to run the model regularly (after each qualifying and race session), I choose to deploy the model on Streamlit Community Cloud and setup a Fallback system that trained the model if it wasn't able to load the existing model.

The Solution: Smart Fallback System

I implemented a robust deployment strategy:


@st.cache_resource
def load_or_train_model():
    try:
        # Try to load saved model
        return joblib.load('model.pkl')
    except:
        # If incompatible, retrain automatically
        return train_model_with_cloud_environment()

This ensures the app always works, regardless of environment differences.

📈 Real-World Performance

Validation Results

Testing on actual 2025 races:

Race	Accuracy	Notes
Chinese GP	100% 🏆	Perfect prediction
Miami GP	100% 🏆	Perfect prediction
Monaco GP	100% 🏆	Perfect prediction
Bahrain GP	95% ⭐	Near perfect
Canadian GP	80% ✅	Solid performance

🎯 Key Learnings & Takeaways

Technical Lessons

1. Data Quality > Model Complexity

Clean, well-engineered features outperformed complex algorithms
Time-aware feature engineering prevented data leakage

2. Real-World Validation is Everything

Cross-validation gave 91% accuracy
Real 2025 races also gave 91% accuracy
The model truly learned generalizable patterns

3. Deployment Complexity

Version compatibility is a real challenge
Smart fallback systems are essential
User experience matters as much as model accuracy

F1 Insights

1. Qualifying positions are Key

Starting position explains 22% of race outcomes
But 78% comes from other factors - that's where ML added value

2. Form Matters More Than History

Recent performance outweighs career statistics (case in point Max verstrappens recent results or Hamilton in Ferrari)
F1 changes so rapidly with new car and engine upgrades so current form dominates

🏁 Conclusion

Building this F1 prediction system taught me that machine learning isn't just about algorithms - it's about understanding the domain, engineering meaningful features, and creating systems that works.

The 91% accuracy isn't just a number - it represents a system that genuinely understands Formula 1's complex dynamics. From Lewis Hamilton's struggles at Ferrari to McLaren's dominance, the model captured the sport's evolving narrative.

Most importantly: This project proves that with the right approach, even the most unpredictable sports can yield to data science.