From raw data to actionable predictions in 5 steps. No black box — here's exactly how the pipeline works, what data we use, and how we validate results.
Every 3-4 days, our pipeline pulls fresh data from 3 independent sources: match results, standings, and fixture schedules from football-data.org; real-time bookmaker odds from The Odds API (20+ bookmakers); and expected goals (xG), shot data, and advanced stats from understat.com.
Raw data is transformed into 70+ predictive features per match. These include rolling form metrics (last 5/10 games), ELO ratings updated after every match, expected goals differentials, head-to-head records, home/away performance splits, market odds implied probabilities, and more.
We train separate LightGBM gradient-boosted models for each league. The Bundesliga is high-scoring and suits aggressive models; Serie A is defensive and rewards conservative predictions. Per-league training captures these differences. Hyperparameters are tuned using Optuna with 100-200 trials on GPU.
For each upcoming fixture, the league-specific LightGBM model outputs win/draw/loss probabilities, an over/under 2.5 goals prediction, and the most likely exact scoreline.
All predictions are evaluated using strict temporal splits — we only test on data the model hasn't seen. Value bets are flagged by comparing model probabilities against bookmaker odds. Smart coupons are generated by combining the highest-EV picks. Results are pushed to the app.
Match results, standings, fixtures, and team data for all 8 leagues. Our primary source for historical match outcomes and scheduling.
Real-time bookmaker odds from 20+ bookmakers worldwide. We use these to calculate implied probabilities and detect value bets where our model disagrees with the market.
Expected goals (xG), shot data, and advanced match statistics. xG is one of the most powerful predictive features in our model — it measures the quality of chances created.
We calculate 70+ features per match. Raw data is transformed into signals that capture team strength, momentum, scoring patterns, and market expectations. Here are the main categories:
A gradient-boosted decision tree model optimized for tabular data. We train one model per league with Optuna hyperparameter tuning (100-200 GPU-accelerated trials).
Outputs: Win/Draw/Loss probabilities, Over/Under 2.5 prediction
Exact scoreline estimation derived from LightGBM probability outputs. Accounts for home/away goal distributions and league-specific scoring patterns.
Outputs: Most likely exact scoreline (e.g. 1-0, 2-1)
Many prediction sites claim high accuracy but test their models on data the model has already seen. This is called data leakage and produces misleadingly high accuracy numbers.
We use strict temporal splits: the model trains only on past data and predicts only on future matches. The last 20% of time-ordered data is reserved for evaluation. This mirrors how the model operates in production — it never has access to future information.
All published accuracy numbers come from 140+ days of this genuine out-of-sample evaluation. Check the actual results on our model performance page.