close
close

first Drop

Com TW NOw News 2024

Raygun: A Bayesian Analysis | R-bloggers
news

Raygun: A Bayesian Analysis | R-bloggers

The 2024 Paris Olympics will see the debut of a new Olympic sport: breakdancing. The sport made headlines after Australian competitor Raygun performed an unusual routine that was ridiculed by many internet commentators. Is Raygun a misguided academic of the sort I’ve written about before? Or was she making a profound statement about the cultural politics of breakdancing? I’m not qualified to say, since I know nothing about breakdancing. But thankfully, there are people who Are qualified to say, namely the Olympic judges. The IOC, perhaps wary of previous scandals involving judges, has posted all their scores online, so I’ve scraped them for further analysis.

Was Raygun really that much worse than the rest of the field? Could she have won a fight against one of the other competitors on a good day? Was the judging fair? Read on to find out!

Scoring system

For future reference, it is important to understand how the scoring works. The competition was decided by a series of one-on-one battles. Each battle lasted two rounds in the heats, or three rounds starting in the quarterfinals. Each round was scored by nine judges. The judges scored the contestants in five categories: Technique, Vocabulary, Originality, Execution, and Musicality. Each judge gives a single number for each category. The number can range from -20 to 20. A positive number indicates that the first dancer (red) was better, and a negative number indicates that the second dancer (blue) was better. The scores for each judge are added together to give a total score between -100 and 100. Each judge with a positive total score awards one point to the red contestant. Each judge with a negative total score awards one point to the blue contestant. The contestant with more points wins the round. The contestant who wins more rounds wins the battle.

Raygun lost all three of her heats 9-0, but we can learn more about her performance by diving deeper into the scores.

Facts

Here’s an example of what the data looks like:

The dancers in the pos And neg columns are red and blue respectively. So here all judges preferred 671 in total, although judge A thought Sunny was better in the categories Vocabulary, Originality and Musicality. The NaN values ​​in the table appear to represent zero (i.e. the judge had no preference and gave no score in the category).

Average score per dancer

We can get a first impression of the dancers’ performance by simply taking the average of their scores in all categories.

However, it is a bit unfair to do this, as a negative score for a dancer is also a positive score for her opponent, so this gives an unfairly high score to a dancer like Nicka who was up against a low-scored dancer like Raygun. The overall ranking does not really reflect how the event went. In fact, India and 671 ended up in the bronze medal match. We will come up with a better way to rank the dancer below.

The judges

The following graph shows how each judge scored the dancers overall over the 72 rounds of the competition. Remember that each score measures how much better one dancer was than another, so the overall scores should be around zero. The variance of the scores decreased as the competition progressed and the weaker dancers were eliminated.

It seems that the Ukrainian jury, Intact, had strong opinions about many of the contestants, especially Raygun, who received very low scores from him.

On the other hand, Raygun did receive positive reviews from some judges in certain categories (especially the Originality category).

Research

I examined the scores to see if the categories contributed equally to the overall score. The scoring system seemed fair overall. The first major component of the scores had a positive weighting on each category, indicating that simply adding the scores together is probably a good way to rank the dancers. The method of awarding one point per judge helps to remove the influence of very large positive or negative scores. Overall, the scoring system seemed fit for purpose.

But that doesn’t tell us who the best dancer was. Was the gold medal awarded to the right person? How much worse was Raygun than the others?

One question that particularly interested me was: What if Raygun had fought one of the weakest dancers, like Elmamouny? Could she have won? And if so, with what probability? To answer this question, I turned to Bayesian statistics.

I assumed that dancer $A$ had an unknown skill level $\mu_A$ and an unknown standard deviation $\sigma$, so that the scores in the battle between $A$ and $B$ were draws from a normal distribution \(s_{AB} \sim N(\mu_A – \mu_B, \sigma^2)\) where $s_{AB}$ is the $(9 \times \text{number of rounds}) \times 5$ matrix of scores in the battle between $A$ and $B$. This very simple model assumes that these scores (whether they represent Technique, Originality, Musicality, Vocabulary, or Execution) are simply drawn from a normal distribution centered around the difference in the skill levels of the dancers.

The normality assumption doesn’t really apply when you include Intact because of its extreme scores, so I decided to model only the other 8 raters.

The following R function simulates a fight.

simulate_battle  0 )
    comp2_score 

The advantage of using the normal distribution is that you can write a Gibbs sampler with explicit formulas for the updates for $\mu$ and $\sigma$.

I decided to use the first competitor in alphabetical order, 671, as a reference point, so I assumed that $\mu_{671} =0$.

Inference

The following script downloads the data, temporarily removes the scores from Intact, and calculates how many scores there are for each competitor.

bgirls 

The update for $\sigma$ uses a sample from an inverse gamma distribution

sample_sigma 

and the update for $\mu$ follows a normal distribution. If you calculate everything out on paper, some adjustments have to be made depending on whether the dancer was on the positive (red) or negative (blue) side of the battle.

sample_mu 

Finally, these are put inside a loop in order to perform Gibbs sampling. At the same time, the results of a Raygun-Elmamouny battle are calculated at each step.

gibbs 

Finally, here is the code to perform the inference.

set.seed(2024)
G 

The values of $\mu_A$ for each dancer are plotted in the figure below. These values can be thought of as a skill level calculated form the ratings received rather than from the knockout system used in the actual competition.

This puts the gold medallist Ami at the top, followed by silver medallist Nicka. The next two competitiors, India and 671, battled for the bronze medal, with 671 winning. So these skill ratings do seem to give a fair reflection of the competition.

But wait! If these numbers reflect the outcome of the competition, what’s the point of calculating them at all? We have the results of the competition! Why do we need a skill level for each competitor?

Well, remember that I wanted to see how well Raygun would have performed in a theoretical battle against her closest rival Elmamouny. And we can now calculate that!

battles  battles(,2) + 1)
win2  battles(,4) + 1)
draw1 

(We need to remember to add in one point for Elmamouny from the Ukrainian judge.)

Raygun wins in 65 out of 1000 simulated battles. That’s a probability of 0.65%. But she does have a 9.5% chance of winning at least one round, which is significantly more than zero.

So, although Raygun was by far the worst competitor, perhaps she was not a total write-off. If she had battled against the second-worst, she would have had an outside chance of winning a round.

I am reminded of my own brief career as a professional dancer. This happened while I was looking for my first job after leaving academia. Despite not being very good at dancing, I was somehow selected to appear in a movie (I made the final cut, too! Tommy Lee Jones is in it!) The other dancers were wondering whether perhaps the film crew had been meaning to pick someone else with a similar name instead of me. I said to one of the film crew “You know, I’m not as good at this as the other dancers”. She replied: “Don’t worry. If everyone was really good, it wouldn’t be realistic.”

Perhaps the same could be said of the Olympic breaking final?