# Inferring demography with LD

As described in the linkage disequilibrium and LD Parsing sections, we use a family of normalized LD and heterozygosity statistics to compare between model expectations and data. We optimize demographic model parameters to find the expected binned LD and heterozygosity statistics that maximize a composite likelihood over all pairs of SNPs and recombination bins.

In this section, we’ll describe the likelihood framework, how to define
demographic models that can be used in inference, how to run optimization using
`moments`

’ built-in inference functions, and how to compute confidence
intervals. We include a short example, following the parsing of data simulated under an isolation-with-migration
model, to illustrate the main features and options.

## Likelihood framework

For a given recombination distance bin indexed by \(i\), we have a set of
computed LD statistic means \(D_i\) from data along with the
variance-covariance matrix \(\Sigma_i\) as returned by
`moments.LD.Parsing.bootstrap_data`

. We assume a multivariate Guassian
likelihood function, so that a model parameterized by \(\Theta\) that has
expected statistics \(\mu_i(\Theta)\) has likelihood

The likelihood is computed similarly for heterozygosity statistics, given their variance-covariance matrix. Then the composite likelihood of two-locus data across recombination bins and single-locus heterozygosity (indexed by \(i=n+1\) where \(n\) is the total number of recombination bins), is

In practice, we work with the log of the likelihood, so that products turn to sums and we can drop constant factors:

As the data \(\{D_i, \Sigma_i\}\) is fixed, we search for the model parameters \(\Theta\) that provide \(\{\mu_i\}\) that maximizes \(\log\mathcal{L}\).

## Defining demographic models

There are a handful of built-in demographic models for one-, two-, and
three-population scenarios that can be used in inference (see here). However, these are far from comprehensive and it is
likely that custom demographic models will need to be written for a given
inference problem. For inspiration, `moments.LD.Demographics1D`

,
`Demographics2D`

, and `Demographics3D`

can be used as starting points and
as illustrations of how to structure model functions.

Demographic models all require a `params`

positional argument and `rho`

and
(optionally) `theta`

keyword arguments. `theta`

, the population-size scaled
mutation rate, does not play a role in inference using relative statistics, as
the mutation rate cancels in \(\sigma_d^2\)-type statistics.

For example, the IM model we simulated data under in the LD Parsing section could be parameterized as

```
def model_func(params, rho=None, theta=0.001):
nu0, nu1, T, M = params
y = moments.LD.Numerics.steady_state(rho=rho, theta=theta)
y = moments.LD.LDstats(y, num_pops=1)
y = y.split(0)
y.integrate([nu0, nu1], T, m=[[0, M], [M, 0]], rho=rho, theta=theta)
return y
```

In the input demographic model to the simulations, we had the ancestral
effective population size as 10,000, the size of deme0 was 2,000, and the size
of deme1 was 20,000. The populations split 1,500 generations ago, and exchanged
migrants symmetrically at a rate of 0.0001 per-generation. Converted into
genetic units, `nu0 = 0.2`

, `nu1 = 2`

, `T=1500 / 2 / 10000 = 0.075`

, and
`M = 2 * 10000 * 0.0001 = 2.0`

.

## Running optimization

Optimization with `moments.LD`

, much like `moments`

optimization with the
SFS, includes a handful functions that serve as wrappers for `scipy`

optimizers with options specific to working with LD statistics. The two primary
functions in `moments.LD.Inference`

are

`optimize_log_fmin`

: Uses the downhill simplex algorithm on the log of the parameters.`optimize_log_powell`

: Uses the modified Powell’s method, which optimizes slices of parameter space sequentially.

Each optimization method accepts the same arguments. Required positional arguments are

`p0`

: The initial guess for the parameters in`model_func`

.`data`

: Structured as a list of lists of data means and data var-cov matrices. I.e.,`data = [[means[0], means[1], ...], [varcovs[0], varcovs[1], ...]]`

, with the final entry of the lists the means and varcovs of the heterozygosity statistics.`model_func`

: The demographic model to be fit (see above section). Importantly, this is a*list*, where the first entry is the LD model, which is always used, and the optional second entry is a demographic model for the SFS (which is a rarely used option and can be ignored). So usually, we would set`model_func`

as`[model_func_ld]`

.

Additionally, we will almost always pass the list of unscaled recombination
bin edges as `rs = [r0, r1, ..., rn]`

, which defines *n* recombination bins.

The effective population size plays a different role in LD inference than it does in SFS inference. For the site frequency spectrum, \(N_e\) merely acts as a linear scaling factor and is absorbed by the scaled mutation rate \(\theta\), which is treated as a free parameter. Here, \(N_e\) instead rescales recombination rates, and because we use a recombination map to determine the binning of data by recombination distances separating loci, \(N_e\) is a parameter that must be either passed as a fixed value or simultaneously fit in the optimization.

If `Ne`

is a fixed value, we specify the population size using that keyword
argument. Otherwise, if `Ne`

is to be fit, our list of parameters to fit by
convention includes `Ne`

in the final position in the list. Typically, `Ne`

is not a parameter of the demographic model, as we work in rescaled genetic
units, so the parameters that get passed to `model_func`

are `params[:-1]`

.
However, it is also possible to write a demographic model that also uses `Ne`

as a parameter. In this case we set `pass_Ne`

to `True`

, so that `Ne`

both rescales recombination rates and is a model parameter, and all `params`

are passed to `model_func`

.

`Ne`

: The effective population size, used to rescale`rs`

to get`rhos = 4 * Ne * rs`

.`pass_Ne`

: Defaults to`False`

. If`True`

, the demographic model includes`Ne`

as a parameter (in the final position of input parameters).

Other commonly used options include

`fixed_params`

: Defaults to`None`

. To fix some parameters, this should be a list of equal length as`p0`

, with`None`

for parameters to be fit and fixed values at corresponding indexes.`lower_bound`

: Defaults to`None`

. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.`upper_bound`

: Defaults to`None`

. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.`statistics`

: Defaults to`None`

, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given by`moments.LD.Util.moment_names(num_pops)`

.`normalization`

: Defaults to`0`

. The index of the population to normalize by, which should match the population index that we normalized by when parsing the data.`verbose`

: If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.

### Example

Using the data simulated in the Parsing section, we can
refit the demographic model under a parameterized IM model. For this, we could
use the `moments.LD.Demographics2D.split_mig`

model as our `model_func`

,
which is equivalent to the function we defined above (which we use in this
example). After loading the data and setting up the inference options, we’ll
use `optimize_log_fmin`

to fit the model.

```
import moments.LD
import pickle
data = pickle.load(open("data/means.varcovs.split_mig.100_reps.bp", "rb"))
rs = [0, 1e-6, 2e-6, 5e-6, 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]
p_guess = [0.1, 2.0, 0.075, 2.0, 10000]
p0 = moments.LD.Util.perturb_params(p_guess, fold=0.2)
# run optimization
opt_params, LL = moments.LD.Inference.optimize_log_fmin(
p_guess,
[data["means"], data["varcovs"]],
[model_func],
rs=rs,
verbose=40,
)
# get physical units, rescaling by Ne
physical_units = moments.LD.Util.rescale_params(
opt_params, ["nu", "nu", "T", "m", "Ne"]
)
print("best fit parameters:")
print(f" N(deme0) : {physical_units[0]:.1f}")
print(f" N(deme1) : {physical_units[1]:.1f}")
print(f" Div. time (gen) : {physical_units[2]:.1f}")
print(f" Migration rate : {physical_units[3]:.6f}")
print(f" N(ancestral) : {physical_units[4]:.1f}")
```

```
40 , -214.296 , array([ 0.140269 , 1.95203 , 0.0514585 , 2.07543 , 12430 ])
```

```
80 , -130.758 , array([ 0.153478 , 2.00214 , 0.0541354 , 2.11845 , 11248.2 ])
```

```
120 , -77.1877 , array([ 0.184635 , 1.80666 , 0.0673507 , 2.0472 , 11068.5 ])
```

```
160 , -76.8557 , array([ 0.186388 , 1.80029 , 0.0685854 , 2.05002 , 11008.4 ])
```

```
200 , -76.823 , array([ 0.185232 , 1.80143 , 0.0679905 , 2.04797 , 11010.4 ])
```

```
240 , -76.8106 , array([ 0.185808 , 1.80374 , 0.0683159 , 2.05049 , 11000.2 ])
```

```
280 , -76.4522 , array([ 0.188075 , 1.81258 , 0.0692172 , 2.02497 , 10970.3 ])
```

```
320 , -74.2099 , array([ 0.183888 , 1.92246 , 0.0656908 , 1.82861 , 10918.1 ])
```

```
360 , -72.9169 , array([ 0.191533 , 2.01901 , 0.0680872 , 1.70206 , 10638.2 ])
```

```
400 , -72.7291 , array([ 0.192878 , 2.06349 , 0.0681534 , 1.66157 , 10576 ])
```

```
440 , -72.7222 , array([ 0.193385 , 2.06172 , 0.0684375 , 1.67061 , 10567.8 ])
```

```
best fit parameters:
N(deme0) : 2042.8
N(deme1) : 21785.4
Div. time (gen) : 1445.7
Migration rate : 0.000079
N(ancestral) : 10569.8
```

These should be pretty close to the input demographic parameters from the simulations! They won’t be spot on, as this was only using 100Mb of simulated data, but we should be in the ballpark.

## Computing confidence intervals

When running demographic inference, we get a point estimate for the *best fit*
demographic parameters. However, for an unknown underlying true value, it’s
important to also estimate what’s called a confidence interval. The CI tells us
the probability that the true value lies within some range, and provides some
information about which parameters in our demographic model are tightly
constrained and which parameters we have little power to pin down.

`moments.LD`

can estimate confidence intervals using either the Fisher
Information Matrix (FIM) or the Godambe Information Matrix (GIM). In almost all
cases when using real data (or even most simulated data), the FIM will estimate
a much smaller CI than the GIM. This occurs because the FIM assumes all data
points that we’ve used are independent, when in reality there is linkage that
causes data points to be sometimes highly correlated between pairs of loci and
between recombination bins. The Godambe method uses bootstrap-resampled
replicates of the data to account for this correlation and does a much better
job at estimating the true underlying CIs [Coffman2016].

Note

If you use the Godambe approach to estimate confidence intervals, please
cite [Coffman2016]. Alec originally implemented this approach in `dadi`

,
and `moments`

has more-or-less used this same implementation here.

To create bootstrap replicates from the dictionary of data sums computed over
regions, where `rep_data = {0: ld_stats_0, 1: ld_stats_1, ...}`

, e.g., we use

```
num_boots = 100
norm_idx = 0
bootstrap_sets = moments.LD.Parsing.get_bootstrap_sets(
rep_data, num_bootstraps=num_boots, normalization=norm_idx)
```

These bootstrap sets can then be used as the inputs to the `moments.LD.Godambe`

methods. The two CI estimation methods are

`FIM_uncert`

: Uses the Fisher Information Matrix. Usage is`FIM_uncert(model_func, opt_params, means, varcovs, r_edges=rs)`

.`GIM_uncert`

: Uses the Godambe Information Matrix. Usage is`GIM_uncert(model_func, bootstrap_sets, opt_params, means, varcovs, r_edges=rs)`

.

In each case, the model function is the same as used in inference (some
manipulation may be needed if we had any fixed parameters), means and varcovs
are the same data as input to the inference function, and `r_edges`

are the
bin edges used in the inference. Additional options for some corner cases are
described in the API reference for LD methods.

### Example

We’ll use both the FIM and GIM to compute uncertainties from the above example inference.

Using the FIM approach:

```
# using FIM
uncerts_FIM = moments.LD.Godambe.FIM_uncert(
model_func,
opt_params,
data["means"],
data["varcovs"],
r_edges=rs,
)
# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_FIM
upper = opt_params + 1.96 * uncerts_FIM
# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])
print("95% CIs:")
print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")
```

```
95% CIs:
N(deme0) : 1828.5 - 2267.9
N(deme1) : 18863.7 - 24873.3
Div. time (gen) : 1269.6 - 1631.3
Migration rate : 0.000069 - 0.000088
N(ancestral) : 10165.2 - 10974.3
```

And using the GIM approach:

```
bootstrap_sets = pickle.load(open("data/bootstrap_sets.split_mig.100_reps.bp", "rb"))
# using GIM
uncerts_GIM = moments.LD.Godambe.GIM_uncert(
model_func,
bootstrap_sets,
opt_params,
data["means"],
data["varcovs"],
r_edges=rs,
)
# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_GIM
upper = opt_params + 1.96 * uncerts_GIM
# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])
print("95% CIs:")
print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")
```

```
95% CIs:
N(deme0) : 1581.2 - 2551.3
N(deme1) : 17105.0 - 26931.2
Div. time (gen) : 1030.5 - 1906.9
Migration rate : 0.000059 - 0.000096
N(ancestral) : 9854.5 - 11285.0
```

We can see above that the FIM uncertainties are considerably smaller (i.e. more constrained) than the GIM uncertainties. However, the GIM uncertainties are to be preferred here, as they more accurately estimate the underlying true uncertainty in the demographic inference.