close
close

first Drop

Com TW NOw News 2024

Perturbing a non-symmetric probability distribution by @ellis2013nz
news

Perturbing a non-symmetric probability distribution by @ellis2013nz

I’ll admit it: I was among the 34% of people who picked the first, wrong answer on this quiz on Mastodon:

The original Toot article and accompanying responses are available here.

My reasoning—and my only excuse is that I didn’t think about it very much—was simplistic. I imagined a distribution that had the same mean and the same median, and in my mind all two or three examples of that were symmetric, like a normal or uniform distribution. And I correctly thought that if you add some random noise to that, the mean and the median would stay the same.

But even when Thomas gave the correct answer (“not necessarily”) and explained that if there is a small local skew in the original distribution around the mean/median, I struggled to understand why this was the case.

“To keep the mean and the median the same, you need at least local symmetry around the mean, in principle. Otherwise the mean stays the same and the median moves”

Ultimately I decided to think about it like this:

  • First, adding random noise with a mean of zero leaves the mean of the original distribution unchanged – this is elementary probability theory.
  • When you add noise, half of the observations you are adding value to were originally below the median, and half were above it. On average, the magnitude of the noise is the same. But if your original distribution has local skew, then observations on one side of the median have a different chance of being perturbed by enough distance to “cross” the median (and thus change the median of the resulting distribution). So the median of the new distribution will be changed, in a direction that depends on the direction and level of skew near the original mean/median.

To understand this, I did some data simulations. I built on Thomas’ suggestion in the Mastodon thread that (with a continuous distribution) “the examples look a little contrived, but nothing fundamental changes. Take a positively skewed distribution, add a small bump far to the left to bring the mean down until it equals the median. X+E will have a median that is higher than the mean.”

First, there is the task of taking a skewed distribution and adding “a little bump on the left side.” I chose to do this with a mixture distribution, 10 parts standard log-normal (i.e. and to the power of an N(0, 1) normal distribution) and 1 part normal with parameters chosen such that the mean and median are identical. Now there might be a way to calculate the correct parameters analytically, but it is much easier (for me) to find them numerically, so I created a function to generate the mixture given a set of parameters and took the optim() function for Nelder-Mead general numerical optimization:

library(tidyverse)

# ------------------perturbing a skewed continues distribution
# Make a mixture distribution, 1 part normal and 10 parts standard exponential
# normal, and return the difference between the median and mean
mixture 

This gives us:

> best
$par
(1) -6.421733  6.757088

$value
(1) 3.841277e-09

We can now generate our data from that – in this case 100,000 observations, to be sure – and we see that the results are actually very close for mean and median

# Generate data from that mixture
n 
> c(mean(x), median(x))
(1) 0.9121877 0.9124763

All very well. Now I just want to add some random noise – say a standard N(0, 1) normal distribution – to each observation…

# perturb it a little
y 

…and look what it does to the mean and median:

> # now the median has shifted but mean has stayed the same:
> c(mean(y), median(y))
(1) 0.9128076 1.0989388
> # compared to original:
> c(mean(x), median(x))
(1) 0.9121877 0.9124763

OK, as predicted. Can a visualization help? Here’s one showing the original distribution and, superimposed, the version perturbed with white noise. In this version, the horizontal axis has been given a modulus transformation (one of the very first things I blogged about – a great way to visualize data that feels like it needs a logarithm or Cox-Box transformation, but which inconveniently contains negative values). This transformation is good for seeing the “bump” on the left, present in both the original and perturbed distributions:

The modulus transformation makes it harder to understand the skewness, though, so here’s the same graph with an untransformed x-axis. This time we can see the original skewness around the original mean and median, and perhaps that helps us understand what’s going on.

It’s helpful for me at least to think of it this way, in the original scale. It helps me to imagine adding random noise to the original distribution. And then you see how you do that to the points just to the left of the original median, pushing them over it (pulling the median up), while at least some of the points to the right of the median aren’t perturbed enough to the left to push them over the median and pull it down.

Here is the code for drawing these graphs:

set.seed(124)
p3 
  sample_n(10000) |>
  gather(variable, value) |>
  ggplot(aes(x = value, colour = variable, fill = variable)) +
  #geom_rug() +
  geom_density(alpha = 0.5) +
  scale_x_continuous(transform = scales::modulus_trans(p=0),
                     breaks = c(-40, -20, -10, -5, -2, -1,  0, 1, 2, 5, 10, 20, 40)) +
  geom_vline(xintercept = mean(x), colour = "red") +
  geom_vline(xintercept = median(y), colour = "steelblue")  +
  annotate("text", x = 1.3, y = 0.6, label = "Post-jitter median", hjust = 0,
           colour = "steelblue") +
  annotate("text", x = 0.86, y = 0.6, 
           label = "Original equal mean and median, also post-jitter mean", 
           hjust = 1, colour = "red") +
  labs(x = "Value (modulus transformed scale)", y = "Density", colour = "", fill = "",
       title = "Adding jitter to a mixture of a skewed and symmetrical distributions",
       subtitle = "The mean stays the same with the jitter, but the median moves if the distribution wasn't symmetrical around the original mean.",
       caption = "Based on an idea in a toot by Thomas Lumley") 

# with transform:
print(p3)       

# without transform:
p4 

That’s all folks. What to do with those pesky distributions. While the simulated example above with exactly matching mean and median is clearly contrived, a distribution that is a mix of a skewed log-normal with a chunk of something far to the left is not uncommon in some areas of economics – such as corporate profits or individual incomes (they are usually log-normal and strictly positive-shaped, but a small selection introduces some degree of loss, and it is definitely a mix of two distributions, not some easily described single mathematical function).