This lecture is based on the problem called “Simple Skewness” (8VC Venture Cup 2016 - Elimination Round, Codeforces; Problem E), and you can find a link to it here.
In this problem we are trying to find a subset which maximizes the difference between the mean and the median! You can find the explanation that this lecture is based on over at Petr’s blog.
We claim that for arrays of even size with non-negative skewness, you can remove the larger of the two middle elements to ensure that you get to an array with an odd number elements whose skewness is no less than in the original array. This can be shown with the following computation, which I skipped in class.
Recall that the simple skewness of a set of numbers is defined as the difference between the mean and the median of .
Claim. Let be a set of integers with non-negative simple-skewness, and let be the upper middle element, i.e, the -th element if the elements of are written in sorted order. We claim that the simple skewness of the set is at least the simple skewness of the original set .
Proof (version 1; from this comment).
Let be the mean, let and be the center elements, and let be the median.
By the assumption that has non-negative skewness, we have that .
Now, after dropping , the median decreases to by , which is .
The mean decreases to , which is a decrease by . Since (given that is even), we have that:
and by the assumption that , we have that:
Thus, the mean decreases by at most as much as the median, as desired.
(To see that this indeed implies the claim, see the slightly more elaborate explanation below.)
Proof (version 2, in the search of a more visual/intuitive approach).
Let’s establish some notation. Using to denote the mean of a set, to denote the median of a set, and to denote the simple skewness of a set, we have:
We want to prove that:
under the assumption that:
Let and denote the -th and -th elements of , respectively, when the elements of are listed in sorted order. Then the assumption above translates to:
Notice that , and since , we know that the mean decreases when we “drop ” from , i.e, .
So rewriting what we want to prove:
in other words, the claim above is equivalent to saying the following:
dropping from decreases the median by at least as much as it decreases the mean, assuming the mean is at least as large as the median,
and showing this will imply the claim.
Let’s work out what these decreases amount to. Let denote the sum of all elements in except for and . First, consider the change of mean:
Simplifying further, we have:
On the other hand, consider the change of median:
Now let us write , which is the sum of all the integers in , in terms of , as follows:
where is the sum of for all such that and is the sum of for all such that .
Observe that by our assumption that:
we have the following:
Let’s now go back to what we wanted to show: the decrease in the median is at least the decrease in the mean, substituting :
But this simplifies to:
Since (recall that is even), we see that the above is equivalent to:
which is indeed true from the assumption that the mean is at least the median, as shown above. 🥳
Remark. Do we really need the assumption that the simple skewness of is non-negative to begin with? It turns out that yes, we do! Without this assumption the statement may not be true, here's a counter example (h/t: Harshil Mittal): and . Then but . I am not sure if a weaker version of the claim holds without the assumption; i.e, does there always exist some element whose removal preserves (i.e, does not decrease) the simple skewness of the original set?
If you have either a visual proof for the claim made or some ideas for this more general question, please join the conversation I’ve opened on Twitter inspired by this situation here.
The code discussed in the video can be found here.