Single pass mean and standard deviation questions
Recently I have came across an interview question that is quite interesting which asked for the computation of math operations like mean, standard deviation in a most effective manner.
Question
The original question was something like this, given a list of numbers, design a solution that is able to support adding of new numbers and,
Get the current mean.
Get the standard deviation.
In case you are still wondering, the restrictions for the solution have to be as optimized as possible, in terms of space and computations, that means our solution will not store the original numbers when we add it. I will briefly discuss the solution to part 1 and move to our topic of discuss about 2.
Snippet
We can assume the snippet below is provided for us to work on.
Solutions
Part 1
Initial Design: Brute force
My initial design was to store a list of numbers as I add them and computing the mean is simply looping through all elements and dividing them by them by the len of the list.
class Solution: def __init__(self): self.numbers = [] def add(self, num: float) -> None: self.numbers.append(num) def mean(self) -> float: return sum(self.numbers) / len(self.numbers)
Which is less than ideal and not as optimized as because of the time complexity of getting mean which is o(n), can we do better than that?
Better Design: Cache the sum
Well after giving some thoughts, when we analyze the add function, we realized that we only ever need 2 items to derive the mean, which is the total sum divided by the total numbers we have added. Therefore making some slight tweaks, this is our next attempt. The time complexity is now O(1) where we just have to perform the math operations.
Part 2
The standard deviation formula was provided as such,
We can try our best to simply the above formula
SD^2 = (Σ(x_i - x_mean)^2) / n Rewrite it as SD^2 = (1/n) * Σ(x_i - x_mean)^2 and expand the terms inside the summation SD^2 = (1/n) * Σ(x_i^2 - 2 * x_i * x_mean + x_mean^2) and then we can put the summation inside and simplify SD^2 = (1/n) * (Σ(x_i^2) - Σ(2 * x_i * x_mean) + Σ(x_mean^2)) since, constants (those without reliance of i) in summation are not affected by the variables, we can move it out SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * Σ(x_i) + x_mean^2 * Σ(1)) we can Σ(1) it as n, as we are summing up 1 n times SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * Σ(x_i) + x_mean^2 * n) there is one more trick we can do to simplify, Σ(x_i) is really just n * x_mean, multiplying n with the x_mean will also gives us the full sum, same as the summation of x_i SD^2 = (1/n) * (Σ(x_i^2) - 2 * x_mean * n * x_mean + x_mean^2 * n) We can now introduce the (1/n) into the expression again and simplify further SD^2 = (1/n) * Σ(x_i^2) - x_mean^2 Finally we can get the SD as, notice we do not need to recompute anything, however we do need to compute the sum squared every iteration SD = sqrt((1/n) * Σ(x_i^2) - x_mean^2)
And when we write the math expression as code,
import math class Solution: def __init__(self): self.count = 0 self.sum = 0 # Additional sum squared self.sum_sq = 0 def add(self, num: float) -> None: self.count += 1 self.sum += num # Compute sum squared when adding the number self.sum_sq += num ** 2 def mean(self) -> float: return self.sum / self.count def std_dev(self) -> float: variance = self.sum_sq / self.count - self.mean() ** 2 return math.sqrt(variance)
This question is quite interesting because I did not manage to come up with the solution during the interview, however after some research, I got to understand about this manner of computing standard deviation in one pass.
Thanks to the following page for helping me figure the standard deviation part: