Our Data Science Team are sharing another resource for the World Cup Datathon. This model will also be entered into the competition.

The Simple Approach


  • The fundamental idea required to build a machine learning model is to structure your data in the following format:
Target Feature 1 (eg. goals scored in last match) Feature 2 (eg. rolling average of margin for 5 games) More Features
Win/Lose/Draw 2 +1.3
  • This can be a little bit more difficult in a sports model context as you will want to include the stats for both teams in the model and you’ve commonly got the fixture structured like home team vs away team in a single row
  • In ML-DiMaria we created a simple set of 4 features for each team: the game margin in the last game, a rolling average of their last 5 game margins, the maximum goals they’ve score in the last 10 games, and the maximum goals they’ve allowed in the last 10 games
  • These were chosen randomly for this exercise and you could use your Soccer knowledge and / or an iterative process to choose more and better features to improve your model
  • Then we push this data set (^ looks something like above) through a random forest machine learning model – we use the randomForest package in R to do this but you could use a similar python approach or any other of the many ML packages in both languages
  • Finally, after we have a trained model, we predict the games in the 2014 world cup and compare with the actual results using the MultiLogLoss function within the MLMetrics package
  • Turns ML-DiMaria did a lot worse than makelELO!


  • ML-DiMaria leaves us a lot more room to move and improve
  • We can add and remove feature columns, change our ML architecture / approach amongst many other things you can find online to improve your ML scores
  • We suggest that a combination of ML-DiMaria and makelELO might be a good thing to try also!

The Final Product


  • This script helps with a few things you’ll encounter when you move from the lab / training to actually creating a submission
  • First you’ll first want to expand your training set to all the data we’ve provided to improve your model
  • Then you’ll need to get your hands dirty a little bit creating the features for the 2018 world cup as you merge two different datasets together things get a bit awkward.
  • A lot of the work is done in this script though and you should be able to shape it for your needs and create a bigger, better machine learning model

Related Articles

How To Use ELO To Model the World Cup

How to use ELO to model the World Cup

2018 FIFA World Cup – Outright Betting Preview

Football Formlabs goes through a detailed preview of the World Cup and makes selections for the Outright Winner

World Cup Datathon Dataset

Get your World Cup Datathon Dataset here.