Construction of a public dataset of Blood Bowl matches played on FUMBBL.com
This blogpost is about Blood Bowl, a boardgame of fantasy football that can also be played online at FUMBBL.com. The goal of this blog post is to use Python API and HTML scraping to fetch online Blood Bowl match outcome data, and to create a publicly available dataset ready for analysis and visualization. In a separate blog post I’ll showcase some analyses on this dataset (also known as Nufflytics in reference to Nuffle, the god of Blood Bowl). The dataset is primarily constructed to analyze rule changes that were introduced on FUMBBL.com last year, and how these affected the relative strengths of the different teams coaches can pick to field against each other.
N.b. an updated version of this post is available as a Jupyter Notebook in a separate Github Repository fumbbl_datasets.
In writing this blog post, I took inspiration from a relatively new development in Open Science, that of the Data paper (Chavan & Penev, 2011). A data paper is a (ideally peer reviewed) publication of a dataset as a stand alone research output. A Data paper can be thought of similar to the “Methods” section of a traditional research article, though with greater detail. A data paper describes the contents of the dataset, the data acquisition process, and includes a discussion of the motivation and considerations regarding experimental design (if applicable). Data papers do not provide any analysis nor results / conclusions. The dataset itself should be online available at a data repository such as Zenodo, Figshare or Dryad.
To give an impression of how this works out in practice, here are two examples: The first data paper, A public data set of spatio-temporal match events in soccer competitions, was published in “Scientific data” with the dataset hosted at Figshare. The second is collected from Twitter: A large-scale COVID-19 Twitter chatter dataset for open scientific research – an international collaboration, with the open dataset published on Zenodo.
*Chavan, V., and Penev, L. (2011). The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinformatics 12(Suppl. 15):S2. doi:10.1186/1471-2105-12-S15-S2
Software needed to reproduce this blog post
This blogpost is written as a Jupyter notebook containing Python code, and is fully reproducible. The idea is to make Blood Bowl data analysis accessible to others. Using open source tooling reduces the barriers for others to build on other people’s work.
import random
import time
import requests # API library
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
Blood Bowl online: FUMBBL
The FUMBBL website (https://fumbbl.com) contains a large amount of data. From coach pages, with their teams, to team rosters, with players, and match histories. It’s all there.
To obtain FUMBBL data, we need to fetch it match by match, team by team. To do so, the site creator Christer Kaivo-oja, from Sweden, has made an API that allows us to easily fetch data. What follows is a short demonstration how the API works, before we fetch the FUMBBL match and team data of the last 12 months.
Behold, the power of Requests
I use the Python Requests library to make the API call over HTTPS and obtain the response from the FUMBLL server. The response is in the JSON format, a light-weight data-interchange format which is both easy to read and write for humans, and easy to parse and generate by computers. So this makes it a natural choice for an API.
Here is an example of what is available at the coach level. The full documentation of the API can be found at (https://fumbbl.com/apidoc/).
response = requests.get("https://fumbbl.com/api/coach/teams/gsverhoeven")
# display the complete JSON object {}
response.json()
{'id': 255851,
'name': 'gsverhoeven',
'teams': [{'id': 1003452,
'coachId': 255851,
'name': 'Hillywood Hellraisers',
'rosterId': 50,
'race': 'Human',
'teamValue': 1090000,
'canLfg': 'Yes',
'isLfg': 'Yes',
'games': '3',
'divisionId': 1,
'division': 'Ranked',
'leagueId': 0,
'league': None,
'status': 'Retired',
'raceLogos': [{'size': 32, 'logo': 486290},
{'size': 48, 'logo': 486291},
{'size': 64, 'logo': 486292},
{'size': 96, 'logo': 486293},
{'size': 128, 'logo': 486294},
{'size': 192, 'logo': 486295}]},
{'id': 1035833,
'coachId': 255851,
'name': 'Pharaoh Munchers',
'rosterId': 51,
'race': 'Tomb Kings',
'teamValue': 960000,
'canLfg': 'Yes',
'isLfg': 'No',
'games': '0',
'divisionId': 1,
'division': 'Ranked',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 486296},
{'size': 48, 'logo': 486297},
{'size': 64, 'logo': 486298},
{'size': 96, 'logo': 486299},
{'size': 128, 'logo': 486300},
{'size': 192, 'logo': 486301}]},
{'id': 1035835,
'coachId': 255851,
'name': 'Blackbox Bastards',
'rosterId': 50,
'race': 'Human',
'teamValue': 840000,
'canLfg': 'No',
'isLfg': 'No',
'games': '1',
'divisionId': 10,
'division': 'Blackbox',
'leagueId': 0,
'league': None,
'status': 'Retired',
'raceLogos': [{'size': 32, 'logo': 486290},
{'size': 48, 'logo': 486291},
{'size': 64, 'logo': 486292},
{'size': 96, 'logo': 486293},
{'size': 128, 'logo': 486294},
{'size': 192, 'logo': 486295}]},
{'id': 1036599,
'coachId': 255851,
'name': 'Seven cities of Gold',
'rosterId': 52,
'race': 'Lizardmen',
'teamValue': 1000000,
'canLfg': 'Yes',
'isLfg': 'No',
'games': '0',
'divisionId': 1,
'division': 'Ranked',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 486302},
{'size': 48, 'logo': 486303},
{'size': 64, 'logo': 486304},
{'size': 96, 'logo': 486305},
{'size': 128, 'logo': 486306},
{'size': 192, 'logo': 486307}]},
{'id': 1038960,
'coachId': 255851,
'name': 'Hillywood Hellraisers 2.0',
'rosterId': 50,
'race': 'Human',
'teamValue': 1110000,
'canLfg': 'Yes',
'isLfg': 'Yes',
'games': '4',
'divisionId': 1,
'division': 'Ranked',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 486290},
{'size': 48, 'logo': 486291},
{'size': 64, 'logo': 486292},
{'size': 96, 'logo': 486293},
{'size': 128, 'logo': 486294},
{'size': 192, 'logo': 486295}]},
{'id': 1050267,
'coachId': 255851,
'name': 'Gooische Heidebeukers',
'rosterId': 4974,
'race': 'Orc',
'teamValue': 1330000,
'canLfg': 'Yes',
'isLfg': 'Yes',
'games': '6',
'divisionId': 5,
'division': 'League',
'leagueId': 14883,
'league': 'Benelux Hate Bowl',
'status': 'Post Match Sequence',
'raceLogos': [{'size': 32, 'logo': 486332},
{'size': 48, 'logo': 486333},
{'size': 64, 'logo': 486334},
{'size': 96, 'logo': 486335},
{'size': 128, 'logo': 486336},
{'size': 192, 'logo': 486337}]},
{'id': 1052980,
'coachId': 255851,
'name': '[2020] Hillywood Hellraisers',
'rosterId': 4964,
'race': 'Human',
'teamValue': 1000000,
'canLfg': 'Yes',
'isLfg': 'No',
'games': '0',
'divisionId': 2,
'division': 'Competitive',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 486290},
{'size': 48, 'logo': 486291},
{'size': 64, 'logo': 486292},
{'size': 96, 'logo': 486293},
{'size': 128, 'logo': 486294},
{'size': 192, 'logo': 486295}]},
{'id': 1053065,
'coachId': 255851,
'name': '[2020] Grrrl power',
'rosterId': 5141,
'race': 'Amazon',
'teamValue': 960000,
'canLfg': 'Yes',
'isLfg': 'No',
'games': '0',
'divisionId': 2,
'division': 'Competitive',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 486194},
{'size': 48, 'logo': 486192},
{'size': 64, 'logo': 486195},
{'size': 96, 'logo': 486191},
{'size': 128, 'logo': 486193},
{'size': 192, 'logo': 486190}]},
{'id': 1060696,
'coachId': 255851,
'name': 'Hakflem support team',
'rosterId': 4978,
'race': 'Underworld Denizens',
'teamValue': 730000,
'canLfg': 'Yes',
'isLfg': 'No',
'games': '0',
'divisionId': 2,
'division': 'Competitive',
'leagueId': 0,
'league': None,
'status': 'Active',
'raceLogos': [{'size': 32, 'logo': 603383},
{'size': 48, 'logo': 603381},
{'size': 64, 'logo': 603378},
{'size': 96, 'logo': 603382},
{'size': 128, 'logo': 603380},
{'size': 192, 'logo': 603379}]}]}
Parsing JSON data
Let’s have a closer look at the JSON data structure here.
We have a list of key-value pairs.
Using brackets []
we can choose keys and retrieve their values.
Some keys contain simple values, such as name
,
response.json()['name']
'gsverhoeven'
but some return as value a new list of key-value pairs, such as teams
.
Actually this is a “list of”lists of key-value pairs”, since we have a separate list for each team.
Even the list of a single team contains new structure, for example under the key raceLogos
.
response.json()['teams'][2]['raceLogos']
[{'size': 32, 'logo': 486290},
{'size': 48, 'logo': 486291},
{'size': 64, 'logo': 486292},
{'size': 96, 'logo': 486293},
{'size': 128, 'logo': 486294},
{'size': 192, 'logo': 486295}]
response.json()['teams'][2]['name']
'Blackbox Bastards'
What data do we need? And in what shape?
Now we know how the data comes in, we need to think about which variables we want, and how to structure them. The most straightforward level to analyze race strength is to look at match outcomes. At its core, the data consists of matches played by teams, commanded by coaches. Furthermore, we expect race strength to change over time, as new strategies are discovered by the players, or new rules get introduced. So the time dimension is important as well.
So, let’s go with a flat data frame with rows for each match, and columns for the various variables associated with each match. These would include:
- Coach ids
- Team races
- Team ids
- Date of the match
- Outcome (Touchdowns of both teams)
With this basic structure, we can add as many match related variables in the future, keeping the basic structure (each row is a match) unchanged.
So lets get the match data!
Step 1: API scraping the match data: df_matches
So we are mostly interested in the current ruleset, this is BB2020
. This ruleset became available for play on FUMBBL at september 1st 2021, and two months later, already some 5000 games have been played. We also want to compare with the previous ruleset, where we have much more data available. How far do we go back?
Let’s go for roughly 12 months of BB2016
ruleset matches, and a few months of BB2020
matches.
The easiest way to collect match data over a particular period of time is to just loop over match_id
. The most recent match at the time of writing was 4.347.800, and since roughly 100.000 matches are played each year, we can fiddle about and we find match 4.216.258 played on august 1st, 2020. So that means we need to collect some 130K matches.
VERY IMPORTANT: We do not want to overload the FUMBBL server, so we make only three API requests per second. In this way, the server load is hardly affected and it can continue functioning properly for all the Blood Bowl coaches playing their daily games!
To collect 130K matches, we will need 130000*0.333/3600 = 12 hours.
# estimated hours fetching data
(4347800-4216257)*0.333/3600
12.16773
df_matches = pd.DataFrame(columns=['match_id', 'match_date', 'match_time',
'team1_id', 'team1_coach_id', 'team1_roster_id', 'team1_race_name', 'team1_value',
'team2_id', 'team2_coach_id', 'team2_roster_id', 'team2_race_name', 'team2_value',
'team1_score', 'team2_score'])
target = 'data/df_matches_' + time.strftime("%Y%m%d_%H%M%S") + '.h5'
print(target)
end_match = 4347800
begin_match = 4216258
n_matches = end_match - begin_match
full_run = 0
print(n_matches)
if(full_run):
for i in range(n_matches):
api_string = "https://fumbbl.com/api/match/get/" + str(end_match - i)
# wait 0.33 s on average between each API call
wait_time = (random.uniform(0.5, 1) + 0.25)/3
time.sleep(wait_time)
match = requests.get(api_string)
match = match.json()
if match: # fix for matches that do not exist
match_id = match['id']
match_date = match['date']
match_time = match['time']
team1_id = match['team1']['id']
team2_id = match['team2']['id']
team1_score = match['team1']['score']
team2_score = match['team2']['score']
team1_roster_id = match['team1']['roster']['id']
team2_roster_id = match['team2']['roster']['id']
team1_coach_id = match['team1']['coach']['id']
team2_coach_id = match['team2']['coach']['id']
team1_race_name = match['team1']['roster']['name']
team2_race_name = match['team2']['roster']['name']
team1_value = match['team1']['teamValue']
team2_value = match['team2']['teamValue']
#print(match_id)
df_matches.loc[i] = [match_id, match_date, match_time,
team1_id, team1_coach_id, team1_roster_id, team1_race_name, team1_value,
team2_id, team2_coach_id, team2_roster_id, team2_race_name, team2_value,
team1_score, team2_score]
else:
# empty data for this match, create empty row
match_id = int(end_match - i)
df_matches.loc[i] = [np.NaN, np.NaN, np.NaN,
np.NaN,np.NaN,np.NaN,np.NaN,
np.NaN,np.NaN,np.NaN,np.NaN,
np.NaN,np.NaN, np.NaN, np.NaN] # try np.repeat([np.NaN], 13, axis=0) next time
df_matches.loc[i]['match_id'] = int(match_id)
if i % 100 == 0:
# write tmp data as hdf5 file
print(i, end='')
print(".", end='')
df_matches.to_hdf(target, key='df_matches', mode='w')
# write data as hdf5 file
df_matches.to_hdf(target, key='df_matches', mode='w')
else:
# read from hdf5 file
df_matches = pd.read_hdf('data/df_matches_20211215_212935.h5')
df_matches.shape
data/df_matches_20211230_192705.h5
131542
(131542, 15)
Since we manually filled the pandas
DataFrame, most of the columns are now of object
datatype.
We need to change this to be able to work properly with the data, as well as store it properly.
Here I convert each column manually, however I later found out about DataFrame.infer_objects()
, that can detect the proper dtype automatically.
This I will try next time.
# convert object dtype columns to proper pandas dtypes datetime and numeric
df_matches['match_date'] = pd.to_datetime(df_matches.match_date) # Datetime object
df_matches['match_id'] = pd.to_numeric(df_matches.match_id)
df_matches['team1_id'] = pd.to_numeric(df_matches.team1_id)
df_matches['team1_coach_id'] = pd.to_numeric(df_matches.team1_coach_id)
df_matches['team1_roster_id'] = pd.to_numeric(df_matches.team1_roster_id)
df_matches['team2_id'] = pd.to_numeric(df_matches.team2_id)
df_matches['team2_coach_id'] = pd.to_numeric(df_matches.team2_coach_id)
df_matches['team2_roster_id'] = pd.to_numeric(df_matches.team2_roster_id)
df_matches['team1_score'] = pd.to_numeric(df_matches.team1_score)
df_matches['team2_score'] = pd.to_numeric(df_matches.team2_score)
# calculate match score difference
df_matches['team1_win'] = np.sign(df_matches['team1_score'] - df_matches['team2_score'])
df_matches['team2_win'] = np.sign(df_matches['team2_score'] - df_matches['team1_score'])
# mirror match
df_matches['mirror_match'] = 0
df_matches.loc[df_matches['team1_race_name'] == df_matches['team2_race_name'], 'mirror_match'] = 1
# 5K mirror matches
df_matches.query('mirror_match == 1').shape
(5081, 18)
# Make BB2020 Khorne team name equal to BB2016 to compare them more easily
df_matches.loc[df_matches['team1_race_name'] == "Khorne", 'team1_race_name'] = 'Daemons of Khorne'
df_matches.loc[df_matches['team2_race_name'] == "Khorne", 'team2_race_name'] = 'Daemons of Khorne'
Dataprep: transforming the team values
In Blood Bowl, teams can develop themselves over the course of multiple matches. The winnings of each match can be spend on buying new, stronger players, or replace the players that ended up getting injured or even killed. In addition, players receive so-called star player points (SPP) for important events, such as scoring, or inflicting a casualty on the opponent. Therefore, a balancing mechanism is needed when a newly created “rookie” team is facing a highly developed opposing team with lots of extra skills and strong players.
Blood Bowl solves this by calculating for both teams their Current team value. The Team value difference for a match determines the amount of gold that the weaker team can use to buy so-called inducements. These inducements are temporary, and can consists of a famous “star player” who joins the team just for this match. Another popular option is to hire a wizard that can be used to turn one of the opposing players into a frog.
It is well known that the win rates of the teams depend on how developed a team is. For example, Amazons are thought to be strongest at low team value, as they already start out with lots of block and dodge skills, whereas a Chaos team start out with almost no skills. So if we compare win rates, we would like take into account the current team value. Now as this can differ between the two teams in a match up, I reasoned that the highest team value is most informative about the average strength level of both teams, because of the inducement mechanism described above.
In the dataset, we have for each match the current team values of both teams as a text string.
We transform the text string 1100k
into an integer number 1100
, so that we can calculated the difference as tv_diff
, and pick for each match the maximum team value and store it as tv_match
. Finally, we create a team value bin tv_bin
to be able to compare win rates for binned groups of matches where races have comparable team strength / team development.
# convert team value 1100k to 1100 integer and and above / below median (= low / high TV)
df_matches['team1_value'] = df_matches['team1_value'].str.replace('k$', '')
df_matches['team1_value'] = df_matches['team1_value'].fillna(0).astype(np.int64)
df_matches['team2_value'] = df_matches['team2_value'].str.replace('k$', '')
df_matches['team2_value'] = df_matches['team2_value'].fillna(0).astype(np.int64)
df_matches['tv_diff'] = np.abs(df_matches['team2_value'] - df_matches['team1_value'])
df_matches['tv_match'] = df_matches[["team1_value", "team2_value"]].max(axis=1)
df_matches['tv_bin'] = pd.cut(df_matches['tv_match'],
bins = [0, 950, 1250,1550, 1850, float("inf")],
labels=['< 950', '1.1K', '1.4K', '1.7K', '> 1850']
)
Dropping empty matches
Some match_id’s do not have match information attached to them, presumably these matches were not played or some real life event interfered. These match_ids are dropped from the dataset to get rid of the NAs in all the columns.
df_matches = df_matches.dropna(subset=['match_date'])
# 131K matches
len(df_matches)
131509
Dataprep: getting the dates right
To see time trends, its useful to aggregate the data by week. For this we add week_number
for each date, and from this week number, we convert back to a date to get a week_date
. This last part is useful for plotting with plotnine
, as this treats dates in a special way. We use the ISO definition of week, this has some unexpected behavior near the beginning / end of each year, that we fix manually.
The data starts in week 36 (september) of 2020, and stops halfway week 44 in 2021.
df_matches['week_number'] = df_matches['match_date'].dt.isocalendar().week
# cannot serialize numpy int OR Int64 when writing HDF5 file, so we go for plain int as all NAs are gone now
df_matches['week_number'] = df_matches['week_number'].fillna(0).astype(int)
# add year based on match date (but want it based on match ISO week)
df_matches['year'] = pd.DatetimeIndex(df_matches['match_date']).year
# manual fix year for ISO week 2020-53 (2020 has 53 ISO weeks, including a few days in jan 2021)
df_matches.loc[(df_matches['year'] == 2021) & (df_matches['week_number'] == 53), 'year'] = 2020
df_matches['week_year'] = df_matches['year'].astype(str) + '-' + df_matches['week_number'].astype(str)
df_matches['week_date'] = pd.to_datetime(df_matches['week_year'].astype("string") + '-1', format = "%Y-%U-%w")
# manual fix of week date (grrrr)
df_matches.loc[(df_matches['week_date'] == '2021-01-04') & (df_matches['week_number'] == 53), 'week_date'] = pd.to_datetime('2020-12-31')
Step 2: HTML Scraping the inducements and coach rankings for each match
Next, we collect for all the matches in df_matches
the inducements and coach rankings.
This information is not available through the API, but is presented on a HTML page at https://fumbbl.com/FUMBBL.php?page=match&id=4350014 summarizing information for in this case match 4350014.
I highly recommend this tutorial for a great introduction to BeautifulSoup
.
In addition, to clean up the scraped text, I used the re
Python module (Regular expressions), part of the Python standard library to extract the actual inducements from the text string that contains them.
from bs4 import BeautifulSoup
import re
df_inducements = pd.DataFrame(columns=['match_id', 'team1_inducements', 'team2_inducements', 'coach1_ranking', 'coach2_ranking'])
target = 'data/df_inducements_' + time.strftime("%Y%m%d_%H%M%S") + '.h5'
print(target)
end_match = 4347800
begin_match = 4216257
n_matches = end_match - begin_match
full_run = 0
print(n_matches)
if(full_run):
for i in range(n_matches):
match_id = end_match - i
api_string = "https://fumbbl.com/FUMBBL.php?page=match&id=" + str(match_id)
# wait 0.33 s on average between each GET call
wait_time = (random.uniform(0.5, 1) + 0.25)/3
time.sleep(wait_time)
response = requests.get(api_string)
soup = BeautifulSoup(response.content, 'html.parser')
if soup.find("div", {"class": "matchrecord"}) is not None:
# match record is available
inducements = soup.find_all("div", class_="inducements")
pattern = re.compile(r'\s+Inducements: (.*)\n')
match = re.match(pattern, inducements[0].get_text())
if match:
team1_inducements = match.group(1)
else:
team1_inducements = ''
match = re.match(pattern, inducements[1].get_text())
if match:
team2_inducements = match.group(1)
else:
team2_inducements = ''
coach_rankings = soup.find_all("div", class_="coach")
coach1_ranking = coach_rankings[0].get_text()
coach2_ranking = coach_rankings[1].get_text()
df_inducements.loc[i] = [match_id, team1_inducements, team2_inducements, coach1_ranking, coach2_ranking]
if i % 100 == 0:
# write tmp data as hdf5 file
print(i, end='')
print(".", end='')
df_inducements.to_hdf(target, key='df_inducements', mode='w')
# write data as hdf5 file
df_inducements.to_hdf(target, key='df_inducements', mode='w')
else:
# read from hdf5 file
df_inducements = pd.read_hdf('data/df_inducements_20211215_112148.h5')
df_inducements.info()
data/df_inducements_20211230_192709.h5
131543
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131509 entries, 0 to 131508
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 match_id 131509 non-null object
1 team1_inducements 131509 non-null object
2 team2_inducements 131509 non-null object
3 coach1_ranking 131509 non-null object
4 coach2_ranking 131509 non-null object
dtypes: object(5)
memory usage: 6.0+ MB
df_inducements.query("match_id == 4347799")
match_id | team1_inducements | team2_inducements | coach1_ranking | coach2_ranking | |
---|---|---|---|---|---|
1 | 4347799 | 1 bloodweiser babe, Mercenary Merc Thrall Line… | CR 154.6 (+0.97) Emerging Star Vini | motay666 Emerging Star CR 151.12 (-0.97) |
Dataprep: coach rankings
We want to extract the part CR 152.53
from the scraped coach information field. Just as we matched on Inducements:
, we can match on CR
and grab the contents directly after that, stopping when we encounter a whitespace.
We first play around a bit and test until we discover the proper Regular Expression to use :-)
#pattern = re.compile(r'\s+Inducements: (.*)\n')
pattern = re.compile(r'.*CR (.*)\s\(.*')
match = re.match(pattern, df_inducements.loc[0, 'coach2_ranking'])
if match is not None:
print(match.group(1)) # group(0) is the whole string
else:
print("match is none")
df_inducements.loc[0, 'coach2_ranking']
144.51
'LoxodonP Veteran CR 144.51 (-0.76)'
Got ’m! Now that we have figured it out, we can write the code that extracts the coach rankings:
# Dataprep fix match_id
df_inducements['match_id'] = pd.to_numeric(df_inducements.match_id)
# Dataprep: add the coach rankings as separate cols
df_inducements['coach1_CR'] = df_inducements['coach1_ranking'].str.extract(r'.*CR (.*)\s\(.*')
df_inducements['coach2_CR'] = df_inducements['coach2_ranking'].str.extract(r'.*CR (.*)\s\(.*')
df_inducements['coach1_CR'] = pd.to_numeric(df_inducements['coach1_CR'])
df_inducements['coach2_CR'] = pd.to_numeric(df_inducements['coach2_CR'])
df_inducements['CR_diff'] = np.abs(df_inducements['coach1_CR'] - df_inducements['coach2_CR'])
df_inducements['CR_diff'] = df_inducements['CR_diff'].astype(float)
df_inducements['cr_diff2'] = df_inducements['coach1_CR'] - df_inducements['coach2_CR']
df_inducements['cr_bin'] = pd.cut(df_inducements['cr_diff2'], bins = [-1*float("inf"), -30, -20, -10, -5, 5, 10, 20, 30, float("inf")],
labels=['{-Inf,-30]', '[-30,-20]', '[-20,-10]', '[-10,-5]', '[-5,5]', '[5,10]', '[10,20]', '[20,30]', '[30,Inf]'])
df_inducements.dtypes
match_id int64
team1_inducements object
team2_inducements object
coach1_ranking object
coach2_ranking object
coach1_CR float64
coach2_CR float64
CR_diff float64
cr_diff2 float64
cr_bin category
dtype: object
Dataprep match inducements for each team
The next trick is to use pandas
explode()
method (similar to separate_rows()
in tidyverse
R) to give each inducement its own row in the dataset.
This creates a dataframe (inducements
) similar to df_mbt
with each match generating at least two rows.
team1_inducements = df_inducements[['match_id', 'team1_inducements']]
team2_inducements = df_inducements[['match_id', 'team2_inducements']]
# make column names equal
team1_inducements.columns = team2_inducements.columns = ['match_id', 'inducements']
# row bind the two dataframes
inducements = pd.concat([team1_inducements, team2_inducements], ignore_index = True)
# convert comma separated string to list
inducements['inducements'] = inducements['inducements'].str.split(',')
# make each element of the list a separate row
inducements = inducements.explode('inducements')
# strip leading and trailing whitespaces
inducements['inducements'] = inducements['inducements'].str.strip()
# create "star player" label
inducements['star_player'] = 0
inducements.loc[inducements['inducements'].str.contains("Star player"), 'star_player'] = 1
# create "card" label
inducements['special_card'] = 0
inducements.loc[inducements['inducements'].str.contains("Card"), 'special_card'] = 1
Add inducement info to df_matches
Here we add df_inducements
to df_matches
. This contains each players inducements as a single string, not convenient for analysis.
The inducements
dataframe cannot easily be added to df_matches
. We can however, extract information from inducements
at the match level and add this to df_matches
. Here, I show how to add a 1/0 flag has_sp
that codes for if a match included any star player.
df_matches = pd.merge(df_matches, df_inducements, on='match_id', how='left')
df_sp = (inducements
.groupby("match_id")
.agg(has_sp = ("star_player", "max"))
.reset_index()
)
df_matches = pd.merge(df_matches, df_sp, on = "match_id", how = "left")
df_matches['match_id'] = pd.to_numeric(df_matches.match_id)
Step 3: Create matches by team DataFrame
When analyzing the data, we also like to have a dataframe df_mbt (df_matches_by_team)
that contains, for each match, a separate row for each team participating in that match.
This structure is nicely visualized at the Nufflytics blog.
Such a dataset is suitable for analysis by team, e.g. win rates. We can extend this further by adding, at the match level, data that is specific for each team - coach pair, such as team value, coach rating etc.
For example, we can imagine adding more team level data, such as casualties caused during the match, or team composition at the start of the match etc.
# make two copies, one for each team in the match
team1_data = df_matches[['match_id', 'match_date', 'week_number', 'year', 'week_year', 'week_date', 'team1_id',
'team1_coach_id', 'team1_race_name', 'team1_value', 'team1_win', 'tv_diff', 'tv_match',
'tv_bin', 'mirror_match', 'coach1_CR', 'CR_diff', 'has_sp']].copy()
team2_data = df_matches[['match_id', 'match_date', 'week_number', 'year', 'week_year', 'week_date', 'team2_id',
'team2_coach_id', 'team2_race_name', 'team2_value', 'team2_win', 'tv_diff', 'tv_match',
'tv_bin','mirror_match', 'coach2_CR', 'CR_diff', 'has_sp']].copy()
team1_data.columns = team2_data.columns = ['match_id', 'match_date', 'week_number', 'year', 'week_year', 'week_date', 'team_id',
'coach_id', 'race_name', 'team_value', 'wins', 'tv_diff', 'tv_match',
'tv_bin', 'mirror_match', 'coach_CR', 'CR_diff', 'has_sp']
# combine both dataframes
df_mbt = pd.concat([team1_data, team2_data])
Adding outcome weights
One way to measure team strength is to calculate a win rate. If we want to calculate win rates, we need to decide how to weigh a draw. In Blood Bowl data analysis, it seems that a 2:1:0 (W / D / L) weighting scheme is most commonly used. So if we want to compare with others, it makes sense to adapt this scheme as well. If we divide these weights by two we get something that, if we average it, we can interpret as a win rate.
This scheme has the advantage that the weighted average win percentage over all matches is always 50%, creating a nice reference point allowing conclusions such as “this and that team has an x percent above average win percentage”.
df_mbt.loc[df_mbt['wins'] == 0, 'wins'] = 0.5
df_mbt.loc[df_mbt['wins'] == -1, 'wins'] = 0
# convert to float
df_mbt['wins'] = df_mbt['wins'].astype(float)
At this point, Lets have a look at our dataset again:
df_mbt.query("coach_id == 255851").sort_values('match_date')
match_id | match_date | week_number | year | week_year | week_date | team_id | coach_id | race_name | team_value | wins | tv_diff | tv_match | tv_bin | mirror_match | coach_CR | CR_diff | has_sp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99692 | 4248074 | 2020-11-13 | 46 | 2020 | 2020-46 | 2020-11-16 | 1003452.0 | 255851.0 | Human | 1000 | 0.0 | 0 | 1000 | 1.1K | 0 | 150.00 | 0.80 | 0.0 |
77673 | 4270093 | 2021-01-17 | 2 | 2021 | 2021-2 | 2021-01-11 | 1003452.0 | 255851.0 | Human | 1000 | 0.0 | 40 | 1000 | 1.1K | 1 | 149.22 | 3.63 | 0.0 |
32991 | 4314794 | 2021-07-01 | 26 | 2021 | 2021-26 | 2021-06-28 | 1003452.0 | 255851.0 | Human | 980 | 0.0 | 10 | 990 | 1.1K | 0 | 147.60 | 1.37 | 0.0 |
31599 | 4316193 | 2021-07-10 | 27 | 2021 | 2021-27 | 2021-07-05 | 1035835.0 | 255851.0 | Human | 980 | 0.0 | 20 | 1000 | 1.1K | 0 | 146.62 | 9.49 | 0.0 |
25968 | 4321824 | 2021-08-16 | 33 | 2021 | 2021-33 | 2021-08-16 | 1038960.0 | 255851.0 | Human | 970 | 0.5 | 30 | 1000 | 1.1K | 1 | 145.93 | 9.30 | 0.0 |
23609 | 4324190 | 2021-09-01 | 35 | 2021 | 2021-35 | 2021-08-30 | 1038960.0 | 255851.0 | Human | 1000 | 0.0 | 10 | 1000 | 1.1K | 0 | 145.90 | 10.19 | 0.0 |
21018 | 4326781 | 2021-09-13 | 37 | 2021 | 2021-37 | 2021-09-13 | 1038960.0 | 255851.0 | Human | 1010 | 1.0 | 80 | 1010 | 1.1K | 0 | 144.24 | 6.02 | 0.0 |
12941 | 4334858 | 2021-10-11 | 41 | 2021 | 2021-41 | 2021-10-11 | 1038960.0 | 255851.0 | Human | 1110 | 0.0 | 150 | 1260 | 1.4K | 0 | 145.22 | 16.80 | 0.0 |
10050 | 4337749 | 2021-10-22 | 42 | 2021 | 2021-42 | 2021-10-18 | 1050267.0 | 255851.0 | Orc | 1080 | 0.0 | 10 | 1090 | 1.1K | 0 | NaN | NaN | 0.0 |
9027 | 4338772 | 2021-10-26 | 43 | 2021 | 2021-43 | 2021-10-25 | 1050267.0 | 255851.0 | Orc | 1040 | 1.0 | 50 | 1040 | 1.1K | 0 | NaN | NaN | 0.0 |
3783 | 4344017 | 2021-11-15 | 46 | 2021 | 2021-46 | 2021-11-15 | 1050267.0 | 255851.0 | Orc | 1150 | 0.0 | 170 | 1150 | 1.1K | 0 | NaN | NaN | 0.0 |
1186 | 4346614 | 2021-11-25 | 47 | 2021 | 2021-47 | 2021-11-22 | 1050267.0 | 255851.0 | Orc | 1020 | 0.5 | 150 | 1170 | 1.1K | 0 | NaN | NaN | 0.0 |
Great! Almost there. There is still something missing though, we need to know, for all the teams in our matches dataset, in what division or league they are playing, and what version of the rules they use. For these we turn to the API again, to fetch more data, now on the team level.
Step 4: Fetch data on team division and ruleset
Let grab for all teams in df_mbt
the team division and ruleset.
A limitation of the FUMBBL API is that it shows only the latest version of the teams and leagues data. This hides the fact that leagues have changed their rules since they were first created. For example, the NAF used BB2016 rules up until summer of 2021, and thereafter switched to the new BB2020 ruleset for their latest online tournament. So we have to use our “domain knowledge” here to interpret the data properly.
# make list of all teams that need to be fetched
team_ids = list(df_mbt['team_id'].dropna())
# get unique values by converting to a Python set and back to list
team_ids = list(set(team_ids))
len(team_ids)
46831
So we have to fetch data for 47K different teams.
We use the same approach as above, looping over all team_id
’s and making a separate API call for each team.
IMPORTANT: here too, we limit ourselves to a maximum of 3 API calls per second to avoid overloading the FUMBBL server
df_teams = pd.DataFrame(columns=['team_id', 'division_id', 'division_name', 'league' ,
'ruleset', 'roster_id', 'race_name', 'games_played'])
target = 'data/df_teams_' + time.strftime("%Y%m%d_%H%M%S") + '.h5'
print(target)
fullrun = 0
if fullrun:
print('fetching team data for ', len(team_ids), ' teams')
for t in range(len(team_ids)):
api_string = "https://fumbbl.com/api/team/get/" + str(int(team_ids[t]))
wait_time = (random.uniform(0.5, 1) + 0.25)/3
time.sleep(wait_time)
team = requests.get(api_string)
team = team.json()
# grab fields
team_id = team['id']
division_id = team['divisionId']
division_name = team['division']
ruleset = team['ruleset']
league = team['league']
roster_id = team['roster']['id']
race_name = team['roster']['name']
games_played = team['record']['games']
# add to dataframe
df_teams.loc[t] = [team_id, division_id, division_name, league, ruleset, roster_id, race_name, games_played]
if t % 100 == 0:
# write tmp data as hdf5 file
print(t, end='')
print(".", end='')
df_teams.to_hdf(target, key='df_teams', mode='w')
df_teams.to_hdf(target, key='df_teams', mode='w')
else:
# read from hdf5 file
df_teams = pd.read_hdf('data/df_teams_20211215_211746.h5')
df_teams['roster_name'] = df_teams['roster_id'].astype(str) + '_' + df_teams['race_name']
df_teams.shape
data/df_teams_20211230_192715.h5
(46831, 9)
Dataprep: Add ruleset_version and division_name
FUMBBL allows coaches to create their own rulesets to play their own leagues and tournaments with. For example, there is a so-called “Secret League” where coaches can play with “Ninja halflings”, “Ethereal” spirits etc. Instead of plain normal regular “Halflings” and “Shambling Undead” :-)
Since we want the team strength for the official rulesets BB2016 and BB2020, we need to distinguish those matches from the matches that are played under different rules.
Lets have look at the various divisions and leagues, which rulesets are used, and which races are played how often. There are a lot of small leagues being played on FUMBBL. We only look at divisions and leagues with a sufficient volume of matches, or otherwise we do not have sufficient statistics for each race.
So I aggregated the data by division, league and ruleset, and filtered on at least 150 different teams that have played at least once last year.
Apart from the main “Divisions” that are part of FUMBBL, there were a few user-run leagues present in this table, so I looked up their names on FUMBBL and what ruleset is used (BB2016, BB2020 or some other variant). This information (contained in an xlsx) is added to the dataset below.
# add ruleset_version and division_name from xlsx
ruleset_division_names = pd.read_excel('data/ruleset_division_names.xlsx', engine='openpyxl')
df_teams = pd.merge(df_teams, ruleset_division_names, on= ['league', 'ruleset', 'division_id'], how='left')
df_teams['division_name'] = df_teams['new_division_name']
df_teams = df_teams.drop('new_division_name', 1)
df_teams['division_id'] = pd.to_numeric(df_teams.division_id)
df_teams['roster_id'] = pd.to_numeric(df_teams.roster_id)
df_teams['team_id'] = pd.to_numeric(df_teams.team_id)
df_teams['games_played'] = pd.to_numeric(df_teams.games_played)
df_teams['league'] = pd.to_numeric(df_teams.league)
df_teams['ruleset'] = pd.to_numeric(df_teams.ruleset)
(df_teams
.groupby(['ruleset', 'league', 'division_id', 'division_name', 'ruleset_version'], dropna=False)
.agg( n_teams = ('ruleset', 'count')
)
.sort_values('n_teams', ascending = False)
.query('n_teams > 150')['n_teams']
.reset_index()
)
ruleset | league | division_id | division_name | ruleset_version | n_teams | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 10 | Blackbox | bb2016 | 10439 |
1 | 1 | 0 | 1 | Ranked | bb2016 | 7944 |
2 | 4 | 0 | 2 | Competitive | bb2020 | 6532 |
3 | 6 | 0 | 5 | Regular_league | bb2016 | 5942 |
4 | 303 | 10263 | 5 | Secret League | bb2016 | 2554 |
5 | 2228 | 9298 | 5 | NAF | mixed | 1466 |
6 | 2 | 0 | 3 | Stunty Leeg | bb2016 | 607 |
7 | 2198 | 14708 | 5 | SL BB2020 | bb2020 | 452 |
8 | 888 | 11676 | 5 | LegaGladio | mixed | 345 |
9 | 4 | 14713 | 5 | Test Open League BB2020 | bb2020 | 296 |
10 | 1049 | 12026 | 5 | NAF 7s | bb2016 | 196 |
11 | 432 | 10455 | 5 | CIBBL | bb2016 | 189 |
Dataprep: Merging the match data with the team data
For each match in the df_mbt
DataFrame we can now add the team-level information from df_teams
.
As both datasets contain ‘race_name’, we drop one of them.
df_mbt = pd.merge(df_mbt, df_teams.drop('race_name', 1), on='team_id', how='left')
df_matches = pd.merge(df_matches, df_teams.drop(['race_name', 'roster_id', 'roster_name', 'games_played'], 1), left_on='team1_id', right_on = 'team_id', how='left')
df_matches['team1_id'] = pd.to_numeric(df_matches.team1_id)
df_matches = df_matches.drop('team_id', 1)
Step 5: adding team tiers
According to this article from the NAF from 2017, already since 2010 efforts were made to balance things out a bit between the different team strengths. For example, the weaker teams get more gold to spend on players, or get more so-called “Star player points” to spend on skilling players up. According to the NAF, traditionally team tiering consists of three groups, with Tier 1 being the strongest teams, and tier 3 the weakest teams. The GW BB2020 rule book also contains three tier groups, that are similar to the NAF tiers: except for Humans and Old World Alliance. And in november 2021, Games Workshop published an update of the three tier groups, now with High Elves moving from tier 2 to tier 1, and Old World Alliance moving back to tier 2.
This is most naturally added to the df_mbt
dataframe, as it can differ for each team in a match.
race_tiers = pd.read_excel('data/race_tiers_mapping.xlsx', engine='openpyxl')
race_tiers = race_tiers[ ['race_name', 'bb2020_tier', 'naf_tier', 'bb2020_nov21_tier']]
race_tiers = race_tiers.dropna()
# add bb2020 tiers
df_mbt = pd.merge(df_mbt, race_tiers, on='race_name', how='left')
Save all prepped datasets as HDF5 files
target = 'data/df_inducements_final.h5'
df_inducements.to_hdf(target, key='df_inducements', mode='w', format = 't', complevel = 9)
target = 'data/inducements_final.h5'
inducements.to_hdf(target, key='inducements', mode='w', format = 't', complevel = 9)
target = 'data/df_matches_final.h5'
df_matches.to_hdf(target, key='df_matches', mode='w', format = 't', complevel = 9)
target = 'data/df_mbt_final.h5'
df_mbt.to_hdf(target, key='df_mbt', mode='w', format = 't', complevel = 9)
Choosing a license for the public dataset
An important part of making data publicly available is being explicit about what is allowed if people want to use the dataset. However, before we do so, we have to check if we are actually allowed to publish the data. This is explained nicely in a blogpost by Elizabeth Wickes.
Since our data will come from the FUMBBL.com website, we check the Privacy policy where all users, including myself have agreed on when signing up. It contains this part which is specific to the unauthenticated API, which we use to fetch the data, as well as additional public match data, such as which inducements are used in a match, and the Coach rankings of the playing coaches that were current when the match was played.
Content you provide through the website
All the information you provide through the website is processed by FUMBBL. This includes things such as forum posts, private message posts, blog entries, team and player names and biographies and news comments. Data provided this way is visible by other people on the website and in most cases public even to individuals without accounts (not including private messages), and as such are considered of public interest. If direct personal information is posted in public view, you can contact moderators to resolve this. Match records are also considered content in this context, and is also considered of public interest. This data is collected as the primary purpose of the website and it is of course entirely up to you how much of this is provided to FUMBBL.
Third party sharing
Some of the public data is available through a public (*i.e. unauthenticated*) API, which shares some of the information provided by FUMBBL users in a way suitable for third-party websites and services to process.
The data available through the unauthenticated API is considered non-personal as it only reflects information that is public by its nature on the website. The authenticated API will only show information connected to the authenticated account.
I conclude that since the match data is already considered public content, there is no harm in collecting this public data in a structured dataset and placing this data in a public repository. The final step is then to decide what others are allowed to do with this data. In practice, this means choosing a license under which to release the dataset. I decided to choose a CC0 license: this places the data in the public domain, and people can use the dataset as they wish. Citing or mentioning the source of the data would still be appreciated of course.