Single pass mean and standard deviation questions

zhgoh

21 Apr 2022

Recently I have came across an interview question that is quite interesting which asked for the computation of math operations like mean, standard deviation in a most effective manner.

Question

The original question was something like this, given a list of numbers, design a solution that is able to support adding of new numbers and,

Get the current mean.
Get the standard deviation.

In case you are still wondering, the restrictions for the solution have to be as optimized as possible, in terms of space and computations, that means our solution will not store the original numbers when we add it. I will briefly discuss the solution to part 1 and move to our topic of discuss about 2.

Snippet

We can assume the snippet below is provided for us to work on.

class Solution:
  def __init__(self):
    pass

  def add(self, num: float) -> None:
    pass

  def mean(self) -> float:
    pass

  def std_dev(self) -> float:
    pass

Solutions

Part 1

Initial Design: Brute force

My initial design was to store a list of numbers as I add them and computing the mean is simply looping through all elements and dividing them by them by the len of the list.

class Solution:
  def __init__(self):
    self.numbers = []

  def add(self, num: float) -> None:
    self.numbers.append(num)

  def mean(self) -> float:
    return sum(self.numbers) / len(self.numbers)

Which is less than ideal and not as optimized as because of the time complexity of getting mean which is o(n), can we do better than that?

Better Design: Cache the sum

Well after giving some thoughts, when we analyze the add function, we realized that we only ever need 2 items to derive the mean, which is the total sum divided by the total numbers we have added. Therefore making some slight tweaks, this is our next attempt. The time complexity is now O(1) where we just have to perform the math operations.

class Solution:
  def __init__(self):
    self.count = 0
    self.sum = 0

  def add(self, num: float) -> None:
    self.count += 1
    self.sum += num

  def mean(self) -> float:
    return self.sum / self.count

Part 2

The standard deviation formula was provided as such,

SD = sqrt((Σ(x_i - x_mean)^2) / n)
SD^2 = (Σ(x_i - x_mean)^2) / n

We can try our best to simply the above formula

SD^2 = (Σ(x_i - x_mean)^2) / n

Rewrite it as
SD^2 = (1/n) * Σ(x_i - x_mean)^2

and expand the terms inside the summation
SD^2 = (1/n) * Σ(x_i^2 - 2 * x_i * x_mean + x_mean^2)

and then we can put the summation inside and simplify
SD^2 = (1/n) * (Σ(x_i^2) - Σ(2 * x_i * x_mean) + Σ(x_mean^2))

since, constants (those without reliance of i) in summation are not affected by the variables, we can move it out
SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * Σ(x_i)  + x_mean^2 * Σ(1))

we can Σ(1) it as n, as we are summing up 1 n times
SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * Σ(x_i)  + x_mean^2 * n)

there is one more trick we can do to simplify, Σ(x_i) is really just n * x_mean,
multiplying n with the x_mean will also gives us the full sum, same as the summation of x_i
SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * n * x_mean  + x_mean^2 * n)

We can now introduce the (1/n) into the expression again and simplify further
SD^2 = (1/n) * Σ(x_i^2) - x_mean^2

Finally we can get the SD as, notice we do not need to recompute anything, however we do need to compute the sum squared
every iteration
SD = sqrt((1/n) * Σ(x_i^2) - x_mean^2)

And when we write the math expression as code,

import math
class Solution:
  def __init__(self):
    self.count = 0
    self.sum = 0

    # Additional sum squared
    self.sum_sq = 0

  def add(self, num: float) -> None:
    self.count += 1
    self.sum += num

    # Compute sum squared when adding the number
    self.sum_sq += num ** 2

  def mean(self) -> float:
    return self.sum / self.count

  def std_dev(self) -> float:
    variance = self.sum_sq / self.count - self.mean() ** 2
    return math.sqrt(variance)

This question is quite interesting because I did not manage to come up with the solution during the interview, however after some research, I got to understand about this manner of computing standard deviation in one pass.

Thanks to the following page for helping me figure the standard deviation part: