The power of selection

$\newcommand{\Var}{\mathrm{Var}}$ $\newcommand{\second}{2\text{nd}}$ $\newcommand{\kth}{k\text{-th}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\Ltwo}[1]{\|#1\|_2}$ $\newcommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}$

If you put in work to select additive components of some random variable, how far out can you get in the distribution of that variable?

This post will focus on normally distributed variables, which is handy since the sum of many individually small random variables is roughly normally distributed by the Central Limit Theorem.

(Note: In places this post is long-ish and discursive (and explains an error I made) because it's trying to get a mathematical understanding of selection that can inform mathematical intuitions about more complicated kinds of selection. If you just want a summary of the numerical situation, look at the tables and graphs.)

Code for tables and diagrams are in this Github repository.

Thanks to Sam Eisenstat for many conversations and ideas about the math here.

Table of Contents:

1. Simple selection

Normal distribution

Say we have a real-valued random variable $X$ that's normally distributed. That is, the probability density function (PDF) is:

$$\phi_{\sigma, \mu}(x) = {\frac {1}{\sigma {\sqrt {2\pi }}}} e^{-{\dfrac {1}{2}}\left({\dfrac {x-\mu }{\sigma }}\right)^{\textstyle 2}}$$

This looks like (from Wikipedia):

We may as well assume that the mean $\mu$ is 0 and the standard deviation $\sigma$ is 1, which is equivalent to measuring the outcomes in units of $\sigma$. Then we have:

$$\phi(x) = {\frac {1}{{\sqrt {2\pi }}}} e^{-{\dfrac {x^2}{2}}}$$

This is the "standard normal distribution".

Now, say we sample $X$ a bunch, and pick the highest value. What value do we get?

Chance of at least...

To ask the question another way, what's the chance that we get something that's at least $c$ if we take $n$ samples?

We can read off the diagram above that the chance of a sample being greater than 1 is about 0.15. For the maximum of $n$ samples $\{X_i\}_{i \in [n]}$ to be greater than 1, that's the same thing as at least one of the samples being greater than 1. What's the chance of that?

We can invert: the chance of at least one sample greater than 1, is the opposite of the chance that all samples are less than 1, i.e. $1-P(\bigwedge_{i=1}^n X_i < 1)$. By independence, that's just $1-(1-0.15)^n = 1-0.85^n$. So e.g. after 10 samples we have about a 0.8 chance of getting one greater than 1.

So for a given threshold $c$ and a number of samples $n$, the chance we get something at least $c$ is roughly $1-(1-P(X > c))^n$. But what do we actually get if we take $n$ samples?

Rarity of 1 in $n$

If we sample $n$ times independently, the expected number of samples greater than $c$ is just $n$ times the expected number with one sample, which is $P(X > c)$. So if $P(X > c) \approx 1/n$, the expected number is $nP(X > c) \approx 1$. That doesn't necessarily mean the probability is high, since it could be that the expectation of the number of big enough samples is high because there's a low chance of a high number, rather than a high chance of one or two samples being big enough. But the higher-order terms, the probabilities of getting 2 or 3 or more samples bigger than $c$, are not too big. And anyway, now we have a reasonable guess: $n$ samples gives you something that's at least as big as something that you're $1/n$ likely to get in one sample.

When is the expression $1-(1-P(X > c))^n$ reasonably big? When $(1-P(X > c))^n$ is small. If $c$ is small, this expression is small, i.e. with $n$ samples we easily surpass a small threshold. When do we stop having a good chance of surpassing $c$?

When $P(X > c) \approx 1/n$, the expression is $(1-1/n)^n$. As $n \to \infty$, this expression limits to $1/e$ (which can be seen by noting that the leading terms in the binomial expansion match up with the Taylor series for $e^z$ evaluated at $-1$). Therefore if we take $n$ samples, we have a probability roughly $1-1/e$ of getting at least one sample that's big enough to be rare at the $1/n$ level.

This kind of makes sense: if you take a million samples, the biggest one you get is something that's roughly one-in-a-million big.

How big is that though? And how many samples do you have to take to get something at least $c$ big?

Cumulative distribution function

If you take a million samples, the biggest one you get is something that's roughly one-in-a-million big. So what's something that's one-in-a-million big? In other words, for what $c$ does $P(X > c) = 1/1000000$?

We know the probability density function $\phi$ of $X$. What that means is, we know the probability of $X$ being in some interval by integrating $\phi$ over that interval:

$$P(X > c) = \int_{c}^{\infty} \phi(x) dx$$

It turns out we can't express this integral simply. So we just give it a name. By convention, we name the function that asks how likely it is for $X$ to be less than some number:

$$P(X < c) = \int_{-\infty}^{c} \phi(x) dx = \Phi(c)$$

This function $\Phi$ is the cumulative distribution function (CDF) of $\phi$. (Note: $\Phi$ is sometimes expressed in terms of the error function, but Python provides us with $\Phi$ and $\Phi^{-1}$ as well as $\text{erf}$, so I don't know of a strong reason to use $\text{erf}$.)

Samples to standard deviations

For what $c$ does $P(X > c) = 1/1000000$? That is, for what $c$ does $P(X < c) = 999999/1000000$? In general, given $n$, what value of $c$ is rare at the $1/n$ level? That's just:

$$\Phi^{-1}\left(\dfrac{n-1}{n}\right)$$

Here's a graph of this function:

This says for example that what's rare at the one-in-a-million level is about 4.7 standard deviations above the mean. Reading the other way, the rarity of 4 standard deviations above the mean is about one in 30,000. Some values, where "e" means "10^" (my computer exploded after 8 standard deviations because the CDF incorrectly gives 1.0):

The polynomial shown is a rough estimate for mental calculation; it will break down outside this range, though.

Order of magnitude rarity to standard deviations:

So for example something that's 1 in 1000 rare is 3.1 standard deviations above the mean.

If you take $n$ samples, the biggest one you get is something that's roughly one-in-$n$ big, which is $\Phi^{-1}\left(\frac{n-1}{n}\right)$, so these values also tell us roughly what to expect when taking the maximum value out of a sample of values.

The rest of this section will go into more depth in a way that will help later with estimating second-largest samples, and that will fill out the above conclusions.

Maximum sample out of $n$

Above we gave a heuristic argument for estimating the maximum of $n$ samples. What is more precisely the distribution over the maximum of $n$ samples? What are the CDF and PDF of the distribution of $\max_{i \in [n]} X_i$? Can we justify our estimates?

So what's the probability that $\max_{i \in [n]} X_i < c$? It's the probability that all the $X_i$ are less than $c$, and since they're independent (and all have the same distribution, $\phi$, and CDF $\Phi$), that's just:

$$P\left(\max_{i \in [n]} X_i < c\right) = \prod_{i \in [n]} P(X_i<c) = \Phi(c)^n $$

This gives us the CDF of the maximum: $\Phi_{n,\max}(x) = \Phi(x)^n$. So we can get the PDF $\phi_{n,\max}$ by taking a derivative:

$$\phi_{n,\max}(x) = \dfrac{d}{dx} \Phi_{n,\max}(x) = \dfrac{d}{dx} \Phi(x)^n = n \Phi(x)^{n-1} \phi(x) $$

where $\phi(x) = \dfrac{d}{dx}\Phi(x)$ is the PDF of the normal distribution. Dotted lines show CDFs, solid lines show PDFs:

These distributions have substantial variance. It would be reasonable to make point estimates by taking the expected value of our max sample, $E(\max X_i) = \int_{-\infty}^{\infty} x \phi_{n,\max}(x) dx$. I don't know how to compute that, but instead we can take the median. That's just the value of $x$ such that the CDF of $\Phi_{n,\max}(x) = 1/2$. So it's:

$$\Phi_{n, \max}^{-1}(1/2) = \Phi^{-1}\left((1/2)^{1/n}\right)$$

This looks like:

Is this a reasonable estimate? We can compare to our previous estimate and to samples:

Pretty close. The median is slightly higher, and matches up well with the sample median and mean. The samples give a sense of the distribution (each translucent dot over $x$ is the maximum from a separate set of $x$ samples from the normal distribution). Zooming in for low sample sizes, on a linear scale, including a numerical estimate of $E(\max X_i)$:

2. Selecting one component among others

Say we have (normally distributed) random variables $X$ and $Y$. If we can control $X$ by selecting from $n$ samples, but we can't control $Y$, how much can we control $Z = X+Y$?

(Technical note: talking about "controlling $Z$" doesn't make formal sense if $Z$ is just a random variable; formally $Z$ could be taken as a function of whatever random variable takes the place of $X$ in the expression $X+Y$, outputting a random variable.)

Controlling one additive component, in general

Our question can be broken up into: how far do you get by selecting the maximum from some samples of a variable; and how far do you get in the sum $Z$ by pushing out one of the components $X$?

To answer the second question first, supposed we have a value $x$ which is $s$ standard deviations out from the mean of $X$. That is, $x=s\sigma_X$, where $\sigma_X = \sqrt{\Var(X)}$ is the standard deviation of $X$.

Note that the way we're measuring "how far out" is the value for $X$ that we got, is in terms of standard deviations of $X$, rather than, say, its actual numerical value. So if $X$ is distributed as $\phi_{1/2}$, i.e. normally distributed with variance $1/2$, we'd say that the numerical value $1.5$ is $s=3$ (standard deviations) out. The reason for doing this is that we can then analyze $X$ as though its distribution is standard normal, i.e. $\phi_1$, and then re-scale the results when we have to relate $X$ to other variables.

Since variance is additive, we have

$$\Var(Z) = \Var(X) + \Var(Y)$$

So, if we have a numerical value $s\sigma_X$ which is $s$ standard deviations of $X$ above the mean of $X$, then how much does that contribute to $Z$? Numerically it contributes $s\sigma_X$, and in terms of the scale of $Z$, it's $s\sigma_X/\sigma_Z$ standard deviations of $Z$.

In other words, if you have a value that's $s$ standard deviations of $X$, then it's

$$s\frac{\sqrt{\Var(X)}}{\sqrt{\Var(Z)}}$$

standard deviations of $Z$.

Controlling one component by selecting

To answer our question about selecting one component, we can combine the above formula with the variances with our estimate for how far you get from selecting from some samples. In detail:

For example, $Z$ might be the average of $k$ coin flips $Z = \frac{1}{k}\sum_{i \in [k]} C_i$, where each $C_i$ is a coin flip with an equal chance of being $1$ or $-1$. The variance of $C_i$ is $\Var(C_i) = E(C_i^2) = 1$, so since variance is additive for independent variables, $\Var(Z) = \frac{1}{k}k\Var(C_0) = 1$. If $k$ is large, then $Z$ is approximately distributed like the standard normal, i.e. it has PDF $\phi$ and CDF $\Phi$. Now let $X = \frac{1}{k}\sum_{i \in [m]} C_i$ and $Y = \frac{1}{k}\sum_{i \in [k-m]} C_i$. Then we have $\Var(X) + \Var(Y) = \frac{m}{k} + \frac{k-m}{k} = 1 = \Var(Z)$, as expected.

So in general, say $Z$ is standard normally distributed with $Z= X+Y$, and $X$ and $Y$ are independent and each normally distributed. How far can we push $Z$ if we can optimize $X$ but not $Y$? If we take the maximum of $n$ samples from $X$ and add that to one sample from $Y$, what do we get?

Answering this fully in terms of the distribution of the result might be complicated, since we're asking about the variable $X_{n,\max} + Y$, where $X_{n,\max}$ is distributed as $\phi_{n, \max}$, so to get the PDF we'd have to convolve together (scaled versions of) $\phi_{n, \max}$ and $\phi$. But we can be happy with just estimating the mean, which simplifies by linearity of expectation:

$$E(X_{n,\max} + Y) = E(X_{n,\max}) + E(Y) = E(X_{n,\max})$$

And we can keep in mind that the variance of the sum will be larger than the variance of $X_{n,\max}$ by $\Var(Y)$, so in that way the distribution is different from the distribution of $X_{n,\max}$.

We've already estimated $X_{n,\max}$ as the median of its CDF, and seen that empirically that's a good estimate of its mean. Well, we estimated $X_{n,\max}$ when $X$ is standard normally distributed, but here $X$ is distributed like $f_{r, 0}$, i.e. normal with variance $r = m/k < 1$. (The $\phi$ notation is overloaded here.) So $X_{n,\max}$ is here a version of the max of samples from the standard normal scaled by $r$.

To say it another way, we know $X_{n,\max}$ in units of the standard deviation of $X$, $\sqrt{\Var(X)}$. What we want is to know $X_{n,\max}$ in units of the standard deviation of $Z$, so that we can say stuff like "If we take the maximum of $n$ samples from $X$, then we'll get something that's so-and-so many standard deviations out compared to a random sample from $Z$.". So to correct our estimate, we simply multiply by the standard deviation of $X$ (since we've assumed $\Var(Z) = 1$; otherwise also divide by $\sqrt{\Var(Z)}$):

$$X_{n,\max} + Y \approx X_{n,\max} \approx \sqrt{\Var(X)} \Phi_{n, \max}^{-1}(1/2)$$

To say it all more concretely, if $Z$ is the average of some coin flips and we divide them up in half, so $X$ is the average of the first half, then the variance of $X$ is $1/2$. So the standard deviation of $X$ is $1/\sqrt{2}$. So if we could optimize $X$ to be $s$ standard deviations (that is, standard deviations of $X$) above the mean, we'd have optimized the whole thing, the average of all the coin flips, to be $s/\sqrt{2}$ standard deviations above the mean (standard deviations of the whole thing, now). If $X$ is $2/3$ of the coin flips, then $s$ gives us $s\sqrt{2/3}$, if its $1/10$ we get $s/\sqrt{10}$, etc. We can plug numbers in to this to see how strong processes such as gamete selection are; $X$ is the set of SNPs that our PGS knows about on the gamete we control, and $Y$ is the other DNA in that gamete, the DNA in the other gamete, and environmental factors affecting a trait.

3. Selecting all of many components

What if we can select both $X$ and $Y$? If we're taking $n$ samples of each, then we get:

$$E(X_{n,\max} + Y_{n,\max}) = \left(\sqrt{\Var(X)} + \sqrt{\Var(Y)}\right) E(Z_{n,\max})$$

$$\approx \left(\sqrt{\Var(X)} + \sqrt{\Var(Y)}\right) \Phi_{n, \max}^{-1}(1/2)$$

So for example if $X$ is $2/3$ of the variance of $Z$, and we take $100$ samples from each of $X$ and $Y$, then we get

$$\left(\sqrt{2/3} + \sqrt{1/3}\right) \Phi_{100, \max}^{-1}(1/2) \approx \frac{1+\sqrt{2}}{\sqrt{3}} 2.45 \approx 3.4$$

If we'd taken $100$ samples from $Z$, we'd have gotten about $2.45$ standard deviations. So instead of getting something that's one in $10^{2.16}$, we get something that's one in $10^{3.49}$, which is more than an order of magnitude.

Generalizing

What if we take $k$ equal parts of $Z$? I.e. $Z=\sum_{i \in [k]} X_i$, where the $X_i$ are i.i.d. normal? So then the variance of each $X_i$ is $1/k$. If we take the best of $n$ samples from each $X_i$, we get above the mean by this many standard deviations:

$$ k \sqrt{\Var(X_0)} E(Z_{n,\max}) = \sqrt{k} E(Z_{n,\max})$$

Our median approximation gives this as roughly $\sqrt{k} \Phi_{n, \max}^{-1}(1/2)$.

This formula can be used to compute approximately the power of chromosome selection for one gamete. If we assume that we are selecting, from a set of $n$ haploid genomes (i.e. $n/2$ full diploid genomes), the haploid genome made of all the top-scoring chromosomes of each kind, then we'll get something that's roughly this many standard deviations out:

$$ \sqrt{23}  E(Z_{n,\max}) \approx 4.8 E(Z_{n,\max}) $$

In other words, for a given number of samples $n$, you get $4.8$ times as far with chromosome selection rather than haploid genome selection (which would give you $E(Z_{n,\max})$, since here $Z$ is the score of a haploid genome sampled randomly). In the same way we can preliminarily estimate that you get about $\sqrt{46} \approx 6.8$ times as far with chromosome selection rather than full genome selection.

In general, if we take the max of $n_i$ samples of each $X[i]$ in a family of variables (and where $Z$ is the sum of the $X_i$), we get

$$E\left(\sum_i \max_{ k \in [n_i]} X[i]_k\right) = \sum_i\Var(X[i]) E\left(\max_{ k \in [n_i]} Z_k \right)$$

Bump in the road

That formula (with all the $n_i$ equal) can be used to compute more exactly the power of chromosome selection for one gamete, correcting for the fact that chromosomes vary in length by a factor of 5, the fact that a gamete only accounts for $1/2$ of the total genetic variance, and the fact that chromosome 23 can't be selected for (if a parent is male). (See next section about the full genome.) Using the estimate

$$E\left(\max_{ k \in [n]} Z_k\right) \approx \Phi_{n, \max}^{-1}(1/2) = \Phi^{-1}\left((1/2)^{1/n}\right)$$

we have this graph for the standard deviations you get if you start with a number of full genomes shown on the x-axis:

There's something wrong here. The estimate was supposed to be an exact computation of the median of the distribution. Indeed, $\Phi^{-1}\left((1/2)^{1/n}\right)$ is an exact computation of the median of the distribution of $\max_{ k \in [n]} Z_k$, as shown in an earlier graph. So how can it be that the sample median is consistently larger here, than the median we computed? If you take many samples, their median has exactly equal chance of being above or below the median of the distribution.

The issue is that we added together a bunch of medians of the $\max_{ k \in [n]} X_k$, and treated that as though it were the median of the sum $\max_{ k \in [n]} Z_k$ of the variables. This would be correct if the variables being summed had symmetric distributions. Then the median would be equal to the expected value, and expected values sum. But in this case, as shown in a graph above, the maximums skew to the right, so their medians don't add. (E.g., the estimate above that

$$\left(\sqrt{2/3} + \sqrt{1/3}\right) E\left(\max_{ k \in [100]} Z_k\right) \approx 3.4$$

is off by about $.1$ from the real value of about $3.5$.)

So, to get accurate numbers (at least, so the error is not visible on this graph!), we have to compute $E(Z_{n,\max})$ more exactly. See Gwern on order statistics for detailed thoughts about this. We'll just use Python, scipy's quadrature method (see code). By definition,

$$E(Z_{n,\max}) = \int_{-\infty}^{\infty} x \phi_{n,\max}(x) dx$$

So from here on, in general we'll just take as given the expected value of a variable as long as we have the PDF of the variable. Redoing the graph using the numerical estimate:

That looks better. This says, for example, that using one donor's genome, a gamete could be selected such that the child is predicted to have +1.7 SDs in terms of measured genetic variation in a trait (where the other gamete is assumed random). If parents choose to use two contributors to the constructed gamete, that gives another +1.5 SDs; and so on. (To model the effect on phenotypes, all these numbers have to be corrected downward, by multiplying by something like 1/2 or 2/3 for highly heritable polygenic traits, to account for genes not modeled by the PGS used and for environmental influence.) Looking further:

Tables of values

Say $X$ and $Y$ have the same variance; if we take $n$ samples of $X$ and $m$ samples of $Y$, in terms of standard deviations we get:

$$\frac{1}{\sqrt{2}} \left(E(Z_{m,\max}) + E(Z_{n,\max}) \right)$$

Here's a table of values for various numbers of samples $n$ and $m$. (Computed numerically.) The table shows what a one-in-[row label] value of $X$ plus a one-in-[column label] value of $Y$ gets you, expressed as standard deviations of $Z$ (meaning, a normal distribution with variance $\Var(X) + \Var(Y)$):

This can be approximated as:

$$1 + \left(\log_{10}(R) + \log_{10}(C)\right)/2$$

It's an okay approximation:

So e.g. you can do in your head that if you take two independent things of equal variance and take the best of 100 samples of one and 10,000 samples of the other, the sum will be roughly $1+(2+4)/2 = 4$ standard deviations out. (Really it's $4.5$.)

This can be used for example to compute the power of selecting both gametes that go into an embryo; we have to scale down to account for genes not picked up by our PGS and for environmental effects. If the PGS "accounts for fraction $f\in [0,1]$ of the variance", assuming that means that the PGS describes the phenotype $Z$ as $X+Y$ where $X$ is perfectly predicted and has variance $f$, then we scale down by multiplying the standard deviations by $\sqrt{\Var(X)} = \sqrt{f}$. There's another subtle but major point: if the gametes to be selected on are derived from a single donor, they're drawn from a tighter distribution. See the next subsection; the upshot is that the expectation of the variance within a donor's gametes is somewhere around $1/6$, probably a little more, of the variance of a gamete drawn randomly from the population.[NOTE: this is incorrect; it's $1/2$, not $1/6$.

What if we take $k$ equal parts of $Z$? As described above, if we take the best of $n$ samples from each subdivision, we get above the mean by this many standard deviations:

$$\sqrt{k} E(Z_{n,\max})$$

This table has the number $n$ of samples on the x-axis and the number $k$ of divisions of $Z$ on the y-axis:

Note that this implies that knitting together the genomes of two parents very tightly (that is, selecting with many subdivisions) can get very far. Also, induced hyperrecombination would get fairly far.

More precisely, subdividing up every summand into $p$ parts multiplies the selection power by $\sqrt{p}$. That is, say we select some parts and get $s$ SDs, like so:

$$E\left(\sum_i \max_{ k \in [n_i]} X[i]_k\right) = s$$

Then say we divide up each $X[i]$ into $p$ i.i.d. parts, $X[i][j]$ for $j<p$, so that $X[i] = \sum_j X[i][j]$ and $\Var(X[i][j]) = \Var(X[i])/\sqrt{p}$. The power of selection on each $X[i]$ is amplified by $\sqrt{p}$, as we saw above, and this factors out:

$$E\left(\sum_i \sum_{j \in [p]} \max_{ k \in [n_i]} X[i][j] _k\right) = $$ $$\sum_i \sum_{j \in [p]} E\left(\max_{ k \in [n_i]} X[i][j] _k\right) = $$ $$\sum_i \sum_{j \in [p]} \frac{1}{\sqrt{p}} E\left(\max_{ k \in [n_i]} X[i] _k\right) = $$ $$\sqrt{p}\sum_i  E\left(\max_{ k \in [n_i]} X[i] _k\right) = \sqrt{p} s$$

Aside: gamete variance

[NOTE: this section is incorrect. Inter-sibling variance is half of population variance.]

Each individual gamete simply has the variance of the whole population, but gametes derived from a single donor are far from independent of each other. So maximizing over such a sample doesn't have as much variance to work with and doesn't go as far. How does much variance is there in a gamete derived from a fixed donor?

To simplify, restrict attention to a single chromosome number. The donor then has two chromosomes of that number, and the question is about the variance in a single chromosome derived from those two chromosomes through meiosis.

To further simplify, suppose that crossover always happens exactly once, in the middle of the chromosome, so that the only question is whether the result is the first half of one and the second half of the other chromosome, or vice versa. (What happens if we don't make this assumption?) In this context, this is equivalent to (induces the same distribution as) just choosing one of the two chromosomes.

So, given two chromosomes, we have their two scores, $C_0$ and $C_1$. These are drawn independently from the population distribution. Conditioning on $C_0$ and $C_1$, what is the variance in $C_2$, the score of a random selection between the two chromosomes? It's:

$$E(C_2^2) - E(C_2)^2 = \frac{C_0^2 + C_1^2}{2} - \left(\frac{C_0 + C_1}{2}\right)^2$$

which could be written as

$$= \frac{C_0^2 - 2C_0C_1 +  C_1^2}{4} = \left(\frac{C_0 - C_1}{2}\right)^2$$

What, then, is the expectation, over the $C_i$, of the variance in $C_2$ conditional on the $C_i$? It's:

$$E\left(\left(\frac{C_0 - C_1}{2}\right)^2\right) - E\left(\frac{C_0 - C_1}{2}\right)^2$$

Since the $C_i$ are i.i.d., this rearranges to:

$$\Var(C_0/2) + \Var(C_1/2) = \Var(C_0)/2$$

In other words, on these simplifying assumptions, the expected variance within a donor's gametes is half of the variance in the population. But what about with a more realistic uniform crossover model? We get:

$$E(C_2^2) - E(C_2)^2 = \int_{[0,1]} dx (xC_0 + (1-x)C_1)^2 - \left(\int_{[0,1]} dx (xC_0 + (1-x)C_1)\right)^2$$ $$=\int dx (x(C_0 - C_1) + C_1)^2 - \left(\int dx (x(C_0 - C_1) + C_1)\right)^2$$ $$=\left. (C_0 - C_1)^2x^3/3 + 2C_1(C_0-C_1)x^2/2 + C_1^2x - \left((C_0-C_1)x^2/2 + C_1x\right)^2\right|^1_0 $$ $$= (C_0 - C_1)^2/3 + C_1(C_0-C_1) + C_1^2 - \left((C_0-C_1)/2 + C_1\right)^2 $$ $$= (C_0 - C_1)^2/3 + C_1(C_0-C_1) + C_1^2 - (C_0-C_1)^2/4 - C_1^2 - C_1(C_0-C_1)$$ $$= (C_0 - C_1)^2/3  + C_1^2 - (C_0^2-2C_0C_1+C_1^2)/4 - C_1^2 $$ $$= (C_0 - C_1)^2/3  - (C_0-C_1)^2/4 $$ $$= (C_0 - C_1)^2/12$$

This is just $1/3$ of the result with the "always in the middle" crossover model. (WolframAlpha agrees with this calculation.)

So the upshot is that, on a somewhat more realistic model, the expected variance within a donor's gametes is $1/6$ of the variance in the population. An even more realistic model might say there's a bit more variance, since crossover is concentrated somewhat away from the telomeres. However, the active DNA (coding or regulatory) is more concentrated where there are crossovers, so that the effects somewhat cancel, and the induced distribution on a polygenic score will be more like the result for uniform crossover.

4. Selecting the top two

What happens if we pick the two largest samples out of $n$ samples?

Analyzing the second largest

First, how big is the second largest sample? For large sample sizes, the second largest will be roughly as big as the largest, so we can just use $E(Z_{n,\max})$ as an estimate. To understand what's happening, and get estimates for small sample sizes and $n$-th largest, we'll analyze the situation more.

We can analyze the CDF of the second largest sample. What's the probability that the second largest of $n$ samples of $X$ is less than $c$? Either the largest sample is $<c$ or not. The probability $P(\max_i X_i < c)$ is just $$P\left(\bigwedge_{i=1}^n (X_i < c)\right) = \prod_{i=1}^n P(X<c) = \Phi(c)^n$$ and this event implies that the second largest is $<c$. If instead the largest is $\ge c$, what's the probability that the second largest is $<c$? That happens exactly when exactly one sample is $\ge c$ and the other $n-1$ are $<c$. There are $n$ samples which can be the largest, so the probability of this is

$$ n P\left(\bigwedge_{i=1}^{n-1} (X_i < c)\right) P(X>c) = n\Phi(c)^{n-1}(1-\Phi(c))$$

So the total probability that the second largest is $<x$ is:

$$\Phi_{n,\second}(x) = \Phi(x)^n + n\Phi(x)^{n-1}(1-\Phi(x)) = (1-n)\Phi(x)^n + n\Phi(x)^{n-1}$$

Taking a derivative, we get the PDF:

$$\phi_{n, \second}(x) = n(1-n)\Phi(x)^{n-1}\phi(x) + n(n-1)\Phi(x)^{n-2}\phi(x)$$ $$= n(n-1)\phi(x)\left(\Phi(x)^{n-2}-\Phi(x)^{n-1}\right)$$

Digression: $k$-th largest

By generalizing the above reasoning, we can get the CDF for the $k$-th largest of a sample of $n$. What is the probability that the $k$-th largest is $<c$? That happens exactly when there's some $j<k$ such that exactly $j$ samples are $\ge c$. That is,

$$P(Z_{n,\kth} <c) = \sum_{j<k} P(\text{$j$ of the samples are $\ge c$, $n-j$ are $<c$})$$

There are ${n \choose j}$ many selections of the indices of the $j$ samples to be $\ge c$. For a given selection to happen, we have to have $n-j$ events that are $\Phi(c)$ likely and $j$ events that are $1-\Phi(c)$ likely to happen, all independent. So:

$$P(Z_{n,\kth} <c) = \sum_{j<k} {n \choose j} \Phi(c)^{n-j}(1-\Phi(c))^{j}$$

Graphing the second largest

Here's the PDFs for the second largest, along with those for the first largest:

Also with the CDFs:

We can numerically estimate the expectation like so:

$$E(Z_{n,\second}) = \int_{-\infty}^{\infty} x \phi_{n,\second}(x) dx$$

Here's what this gives (showing both $Z_{n,\max}$ and $Z_{n,\second}$) (recalling that each dot represents the largest (green) or second largest (red) sample from one experiment drawing $x$ samples; samples are taken separately for largest and second largest):

Zooming in for low sample sizes, on a linear scale:

Sum of top two

What happens if we take the sum of the top two largest samples? By the linearity of expectation, we have:

$$E(Z_{n,\max} + Z_{n,\second}) = E(Z_{n,\max}) + E(Z_{n,\second})$$

and we already know how to compute the terms on the right. (Since $Z_{n,\max}$ and $Z_{n,\second}$ are not independent, the variance of their sum is not necessarily the sum of their variances.) Graphing, and rescaling by $1/\sqrt{2}$ so that the y-axis is in terms of standard deviations of a sum $Z + Z'$, with $Z$ and $Z'$ i.i.d.:

Zooming in for low sample sizes, on a linear scale:

5. Selecting the top two of each of many components

Combining the computation of the top two of $n$ samples and controlling additive components, we can compute

$$\sum_{i \in [\text{# of autosome types}]} \sqrt{\dfrac{\text{length of $i$-th autosome}}{\text{total length of genome}}}\left( E(Z_{n,\max}) + E(Z_{n,\second})\right)$$

This excludes the sex chromosome and only accounts for the effect of selecting the autosomes, so this is a slight underestimate. (How much selection can be applied to the sex chromosome depends on the sexes of the input genomes.) The result:

Note that this graph can be approximately gotten from the graph of the sum of the top two samples, taking into account that this graph has full genomes on the x-axis and hence gives twice the samples, and multiplying by $22/\sqrt{23} \approx 4.6$. That gives a slight overestimate because it doesn't account for the chromosomes having different lengths.

6. Selecting for multiple scores

Say we select a random element for multiple independent scores. How does selection power trade off between the different scores? If we select enough to get $s$ SDs above the mean of a normal distribution, and we just select based on one score $S_0$, then in expectation, we could get an element that scores $s$ above the mean according to $S_0$, and scores at the mean according to the other scores $S_i$. But what if we select for a weighted sum of the scores? E.g. if we get $s-1$ SDs on $S_0$, how many SDs do we get on $S_1$?

Formal setup

Formally we have a random element $X$ taking values in a set $\Omega$, and we have some scores $S_i:\Omega \to \R$ for $i \in I$. For $x \in \Omega$, let $S(x)$ denote the vector of scores, so $S(x)_i = S_i(x)$. For simplicity, we'll assume that the scores are independent and standard normally distributed. That is: $$\forall  x \in \Omega, v \in \R^{|I|}: P(S(x) = v) = \prod_{i \in I} P(S_i(x) = v_i) =  \prod_{i \in I} \phi(v_i)$$

Given a weighting, i.e. a vector $\alpha \in \R^{|I|}$ such that $\Ltwo{\alpha} = \sqrt{\sum_i \alpha_i^2} = 1$, define the weighted score $S_\alpha(x) = \sum_i \alpha_i S_i(x)$.

We're asking about the scores $S_i$ conditioned on the score $S_\alpha$. E.g., what is

$$P(S_i(x) = y \bigm|  S_\alpha(x) = s)$$

Multivariate gaussians

We have the prior probability

$$P(S_0(x) = s_0, S_1(x) = s_1) = \phi(s_0)\phi(s_1) \propto e^{\left(-{\dfrac {s_0^2}{2}}\right)} e^{\left(-{\dfrac {s_1^2}{2}}\right)} =  e^{\left(-{\dfrac {s_0^2+s_1^2}{2}}\right)}$$

We can write this as $$e^{\left(-{\dfrac {r^2}{2}}\right)}$$

where $r=\sqrt{s_0^2+s_1^2}$ is the distance of $(s_0, s_1)$ from the origin. So the probability depends only on the radius. Here's the density (source: modified from this post by whuber):

If we conditioned on the event $S_0(x) = s_0$, that gives us

$$P(S_1(x) = s_1 \bigm| S_0(x) = s_0 ) = \frac{\phi(s_0)\phi(s_1)}{\phi(s_0)} = \phi(s_1)$$

In other words, the distribution on the green line is just the standard normal again, since $S_1$ is independent of $S_0$ and is standard normally distributed:

Here's a cutaway to show what the conditional distribution looks like (source: copied from this post by whuber):

If instead we condition on the event $S_\alpha(x) = \alpha_0 S_0(x) + \alpha_1 S_1(x) = s_\alpha$, what do we get? We're looking at a line defined by $$S_1(x) = \frac{s_\alpha}{\alpha_1}-\frac{\alpha_0 }{\alpha_1}S_0(x)$$

What's the conditional distribution on this line?

Since the joint distribution is rotationally symmetric, we can just rotate down to the horizontal axis, without changing the values on the green line:

This shows that the distribution restricted to the diagonal green line is the same as the distribution on the vertical green line, which is just a standard normal distribution. The mean of this distribution is the point on the green line that passes closest to the origin. That happens at the point $(x,y)$ where a line (yellow) that's drawn from there to the origin is orthogonal to the green line. So to find the mean, we want to find that point.

The case with two scores

We can solve this algebraically, using that the slope of the yellow line is $-\alpha_1/\alpha_0$. That gives the point $(s_\alpha\alpha_0, s_\alpha\alpha_1)$ as the closest point.

What does this imply? In order to avoid rescaling values of $S(x)$, we take $\Ltwo{\alpha} = \sqrt{\alpha_0^2 + \alpha_1^2} = 1$, so that

$$\Var(S_\alpha(x)) = \sum_i \Var(\alpha_i S_i(x)) = \sum_i \alpha_i^2 \Var(S_i(x)) = \sum_i \alpha_i^2 = 1$$

So, if we got a value of $S_\alpha(x)$ that's $s_\alpha$ above the mean (hence, $s_\alpha$ standard deviations above the mean), and the mean value for $S_0(x)$ that we get is $s_0$ standard deviations above the mean of $S_0(x)$, then we have $\alpha_0 = s_0/s_\alpha$. Then $\alpha_1 = \sqrt{1-\alpha_0^2} = \sqrt{1-(s_0/s_\alpha)^2}$, so we get a mean score for $S_1(x)$ that's above the mean by $s_\alpha\sqrt{1-(s_0/s_\alpha)^2} = \sqrt{s_\alpha^2-s_0^2}$.

So for example, say you sample enough to get +10 SDs. If the weighting is $$S_\alpha(x) = \frac{9}{10}S_0(x) + \sqrt{1-\left(\dfrac{9}{10}\right)^2}S_1(x)$$ then you get $10\frac{9}{10} = 9$ SDs on $S_0(x)$, and $\sqrt{10^2-9^2}\approx 4.3$ SDs on $S_1(x)$.

The case with $n$ scores

Above we gestured at finding the nearest point algebraically. It's easier to see what's going on by approaching the general case geometrically. So we have the weighting $\alpha$ such that $\Ltwo{\alpha} = 1$, and we have $S_\alpha(x) = \alpha^{\intercal} x$ (where we're conflating $S(x)$ with $x$, so $x_i = S_i(x)$). Then we set $S_\alpha(x) = \alpha^{\intercal} x = s_\alpha$, and ask to find $x$ closest to the origin that lies in that hyperplane.

That is, we want to minimize $\Ltwo{x}$, or equivalently minimize $\Ltwo{x}^2 = \sum_i x_i^2$. For a given $x$, the force pushing on $x_i$ to decrease is proportional to $x_i$ (since $\dfrac{d}{dx_i}\Ltwo{x}^2 \propto x_i$). The cost of decreasing $x_i$, which to stay on the plane has to be compensated for by increasing other coordinates, is proportional to the coefficient in front of $x_i$ in the equation for the plane, i.e. $\alpha_i$. So for all these forces to balance, we can guess that $x \propto \alpha$.

To say it another way, if, say, $x_0 > \tau\alpha_0$ but also $x_1 \le \tau\alpha_1$, then we can move around on the plane by adding to $x$ an infinitesimal proportional to $(-\alpha_1, \alpha_0, 0, 0, ...)$. This adds to $\Ltwo{x}^2$ something proportional to $-\alpha_1 x_0 + \alpha_0 x_1 < - \alpha_1 \alpha_0 + \alpha_0 \alpha_1 = 0$, i.e. it decreases $\Ltwo{x}^2$. So at the minimum we have $x \propto \alpha$.

Say $x = \tau \alpha$. Then, to have that $\alpha^{\intercal} x = \tau \Ltwo{\alpha}^2 = s_\alpha$, we just take $\tau = s_\alpha$, since $\Ltwo{\alpha} = 1$. Thus we have:

$$x=s_\alpha\alpha$$

Caching out numbers

What does it mean that $x=s_\alpha\alpha$? To put it another way, if we're selecting so that we get $s_\alpha$ SDs on $S_\alpha$, we can pick any vector $s$ of Euclidean norm $s_\alpha$, and then make that vector be the mean of the scores that we get, i.e. $s_i = E(S_i(x))$. For example, say we have $m$ scores and we weight them equally. That means we're taking

$$s_\alpha= \Ltwo{s} = \sqrt{\sum_i s_i^2}= s_0\sqrt{m}$$ which means that we score $s_\alpha/\sqrt{m}$ on each of $m$ scores. E.g., instead of scoring +10 SDs on one score, you can score $10/\sqrt{2} \approx +7$ SDs on two scores, or $10/\sqrt{3} \approx +5.7$ on three scores. Scoring small amounts above the mean is extremely cheap. E.g. if you select to the +10 SD level, you can score +9 SDs on one score and then +1 SDs on $10^2-9^2 = 19$ other scores, or +2 SDs on $(10^2-9^2)/2^2 \approx 5$ other scores.

7. The limits of selection

The distribution of a sum of coin flips is close to a normal distribution, but not exactly equal to one. How does this affect selection power?

Unbiased coins

How close is the binomial distribution to the normal distribution? With $2^{10}$ coin flips with an equal probability of +1 or -1, the variance is 1024 and the standard deviation is $2^5$. We could look at a graph of the two PDFS, but they would be indistinguishable except for the discreteness of the binomial. Instead we can look at a graph of the logarithm of the PDFs:

Farther out, the probability densities do diverge very far, in relative terms. And by the time we get to 32 standard deviations away from the mean, the binomial distribution's PDF hits 0 and stops showing up on the graph: there's no chance of getting 1025 heads or tails.

If we were getting selection power by selecting from extremely many samples (like, taking the best of $10^n$, where say $n>15$) from a binomial distribution, we'd hit more quickly diminishing returns than if we were selecting from a normal distribution. That doesn't matter, though, if we're only ever taking relatively small numbers of samples, instead getting selection power e.g. by combining components selected on or by selecting in stages that make cumulative progress.

In some contexts, it may be more useful to think of the coin flips as being more fundamental than the normal distribution they approximate. If $n$ coin flips coming up heads has some effect, then $10n$ flips coming up heads might have approximately 10 times that effect, and the fact that a random sample being that extreme is much less less likely according to the binomial distribution than according to the normal distribution doesn't really bear on the question of what the effect of $10n$ heads would be.

This leaves the question of how much room there is. In the present model with 1024 coins, the mean is 0 points, 1 SD is 32 points, and the maximum number of points is 1024, so the limit of selection is $1024/32=32$ SDs. (Note: It may seem strange that there are 64 SDs of range of outcome, from -32 SDs to +32 SDs. This happens because in the model where heads gives +1 and tails gives -1, changing a coin from tails to heads adds 2 to the total. That is, the mean point score is 0 but the mean heads count is 500; the standard deviation of the score in points is 32 but the standard deviation of the heads count is 16; and the minimum points is -1000, but the minimum heads is 0.)

In general, with $n$ fair coins, the variance is $n$, the standard deviation is $\sqrt{n}$, and the range of possible outcomes is from $n/\sqrt{n} = \sqrt{n}$ to $-\sqrt{n}$ standard deviations. (The range of standard deviations of outcomes isn't affected by the choice of +1 and -1 for the values of heads and tails. Changing those values amounts to shifting the mean and scaling the standard deviation, but scaling the standard deviation doesn't affect the number of standard deviations in the range. If heads is worth $+x$ and tails worth $-x$, then the variance is $x^2n$, the standard deviation is $x\sqrt{n}$, and the maximum outcome is $xn / x\sqrt{n} = \sqrt{n}$ SDs.)

Biased coins

What if the coins are not fair coins? Qualitatively, the situation is like this: If tails is rare, then each tails decreases the sum substantially (by more than 2). So the mean score happens when there aren't very many tails. That means there's much less room to decrease the number of tails, so the range of outcomes has a smaller upper limit. Since there's much more room to increase the number of tails, the lower limit of the range of outcomes is far smaller than in the case of fair coins.

More precisely, say the probability of heads is $p$ and there are $n$ coins. We can still approximate this as a normal distribution, but to make that distribution mean 0 and variance $n$, we have to choose the right values for heads and tails. The value of heads should be

$$\sqrt{\dfrac{1-p}{p}}$$

and the value of tails should be

$$-\sqrt{\dfrac{p}{1-p}}$$

What does the resulting biased binomial distribution look like? Like this:

(The binomial distributions are discrete; the actual values are at the center of each horizontal interval in the "densities" shown. The intervals are so long because the value of switching from tails to heads is much larger than in the fair case.)

That shows the biased binomial distributions being slightly but noticeably skewed. What about the limits? We can look at the logarithm of the probability densities:

In this graph, most of the lines end because of limited numerical precision or graph boundaries, but the key ends are the ends on the right side (other than the bright gaussian). Those show where the biased binomial PDFs end because they actually become 0, having hit their limits. What exactly are the limits?

The limits are acheived when all coins are tails or when all coins are heads. When all coins are tails, the score is

$$-n\sqrt{\dfrac{p}{1-p}}$$

points. Since a standard deviation is $\sqrt{n}$ points, that amounts to

$$\sqrt{\dfrac{np}{1-p}}$$

standard deviations below the mean score. Likewise, when all the coins are heads, that amounts to $$\sqrt{\dfrac{n(1-p)}{p}}$$ standard deviations above the mean score. For example if $p=.9$ and $n=10000$, the upper limit is $$\sqrt{\dfrac{10000\times .1}{.9}} = 100/3 \approx 33$$ standard deviations. If instead $p=.99$, the upper limit is $$\sqrt{\dfrac{10000\times .01}{.99}} \approx 10$$ standard deviations. The upper limit as a function of $p$:

On the other side, the lower limit gets much more negative as $p$ approaches 1, since the mean score involves only a few tails, leaving a lot of room to change heads to tails for a large decrease in score:

Mixed bias

What if some coins are biased towards heads and some biased towards tails (still keeping the variance of the score of each coin flip fixed at 1)? Qualitatively we have that, as $p$ approaches 1, the raised upper limit from the rare high-scoring heads will compensate for the decreased upper limit from the common medium-scoring heads. So as $p$ varies, the upper limit of selection doesn't go as far down as it did in the case above where all the coins were biased towards heads.

Quantitatively, suppose the mix is half and half, with the common probability being $p>0.5$ and the rare probability being $1-p < 0.5$. That is, half the coins come up heads with probability $p$ and increase the score by $\sqrt{\frac{1-p}{p}}$, and the other half of the coins come up heads with probability $1-p$ and increase the score by $\sqrt{\frac{p}{1-p}}$, and mutatis mutandis for tails. Then with $n$ coins, the overall upper limit of the score is the score of the all-heads outcome, which is just a half and half mix of the score of the all-heads outcome for the all-$p$-coins case and the all-$(1-p)$-coins case. So (dividing through by the standard deviation $\sqrt{n}$) the upper limit on standard deviations of scores outcomes is this many points:

$$\frac{1}{2}\sqrt{\dfrac{n(1-p)}{p}} + \frac{1}{2}\sqrt{\dfrac{np}{(1-p)}}$$

Or in general, with a fraction $f$ of the coins having probability $p>0.5$ of heads, the upper limit is:

$$f\sqrt{\dfrac{n(1-p)}{p}} + (1-f)\sqrt{\dfrac{np}{1-p}}$$

Graphing for a few values of $f$:

Except for very extreme mixes, the compensation from the uncommon heads puts a solid floor on the upper limit.

Therefore we can conclude that, for 10,000 coins, there are many (say, >30) standard deviations of room for selection to operate on a binomial distribution, unless it's the case that both

  • nearly all (say, >.99) of the coins have heads being more common, and also
  • the probability of the more common flip outcome is high (say, >.9).