Inferring demography with LD

As described in the linkage disequilibrium and LD Parsing sections, we use a family of normalized LD and heterozygosity statistics to compare between model expectations and data. We optimize demographic model parameters to find the expected binned LD and heterozygosity statistics that maximize a composite likelihood over all pairs of SNPs and recombination bins.

In this section, we’ll describe the likelihood framework, how to define demographic models that can be used in inference, how to run optimization using moments’ built-in inference functions, and how to compute confidence intervals. We include a short example, following the parsing of data simulated under an isolation-with-migration model, to illustrate the main features and options.

Likelihood framework

For a given recombination distance bin indexed by \(i\), we have a set of computed LD statistic means \(D_i\) from data along with the variance-covariance matrix \(\Sigma_i\) as returned by moments.LD.Parsing.bootstrap_data. We assume a multivariate Guassian likelihood function, so that a model parameterized by \(\Theta\) that has expected statistics \(\mu_i(\Theta)\) has likelihood

\[\mathcal{L}_i(\Theta | D_i) = P(D_i | \mu_i, \Sigma_i) = \frac{\exp\left(-\frac{1}{2}(D_i-\mu_i)^T\Sigma_i^{-1}(D_i-\mu_i)\right)}{(2\pi)^{k/2}|\Sigma_i|^{1/2}}.\]

The likelihood is computed similarly for heterozygosity statistics, given their variance-covariance matrix. Then the composite likelihood of two-locus data across recombination bins and single-locus heterozygosity (indexed by \(i=n+1\) where \(n\) is the total number of recombination bins), is

\[\mathcal{L} = \prod_{i=1}^{n+1}\mathcal{L}_i.\]

In practice, we work with the log of the likelihood, so that products turn to sums and we can drop constant factors:

\[\log\mathcal{L} \propto -\frac{1}{2}\sum_{i=1}^{n+1} (D_i-\mu_i)^T\Sigma_i^{-1}(D_i-\mu_i).\]

As the data \(\{D_i, \Sigma_i\}\) is fixed, we search for the model parameters \(\Theta\) that provide \(\{\mu_i\}\) that maximizes \(\log\mathcal{L}\).

Defining demographic models

There are a handful of built-in demographic models for one-, two-, and three-population scenarios that can be used in inference (see here). However, these are far from comprehensive and it is likely that custom demographic models will need to be written for a given inference problem. For inspiration, moments.LD.Demographics1D, Demographics2D, and Demographics3D can be used as starting points and as illustrations of how to structure model functions.

Demographic models all require a params positional argument and rho and (optionally) theta keyword arguments. theta, the population-size scaled mutation rate, does not play a role in inference using relative statistics, as the mutation rate cancels in \(\sigma_d^2\)-type statistics.

For example, the IM model we simulated data under in the LD Parsing section could be parameterized as

def model_func(params, rho=None, theta=0.001):
    nu0, nu1, T, M = params
    y = moments.LD.Numerics.steady_state(rho=rho, theta=theta)
    y = moments.LD.LDstats(y, num_pops=1)
    y = y.split(0)
    y.integrate([nu0, nu1], T, m=[[0, M], [M, 0]], rho=rho, theta=theta)
    return y

In the input demographic model to the simulations, we had the ancestral effective population size as 10,000, the size of deme0 was 2,000, and the size of deme1 was 20,000. The populations split 1,500 generations ago, and exchanged migrants symmetrically at a rate of 0.0001 per-generation. Converted into genetic units, nu0 = 0.2, nu1 = 2, T=1500 / 2 / 10000 = 0.075, and M = 2 * 10000 * 0.0001 = 2.0.

Running optimization

Optimization with moments.LD, much like moments optimization with the SFS, includes a handful functions that serve as wrappers for scipy optimizers with options specific to working with LD statistics. The two primary functions in moments.LD.Inference are

optimize_log_fmin: Uses the downhill simplex algorithm on the log of the parameters.
optimize_log_powell: Uses the modified Powell’s method, which optimizes slices of parameter space sequentially.

Each optimization method accepts the same arguments. Required positional arguments are

p0: The initial guess for the parameters in model_func.
data: Structured as a list of lists of data means and data var-cov matrices. I.e., data = [[means[0], means[1], ...], [varcovs[0], varcovs[1], ...]], with the final entry of the lists the means and varcovs of the heterozygosity statistics.
model_func: The demographic model to be fit (see above section). Importantly, this is a list, where the first entry is the LD model, which is always used, and the optional second entry is a demographic model for the SFS (which is a rarely used option and can be ignored). So usually, we would set model_func as [model_func_ld].

Additionally, we will almost always pass the list of unscaled recombination bin edges as rs = [r0, r1, ..., rn], which defines n recombination bins.

The effective population size plays a different role in LD inference than it does in SFS inference. For the site frequency spectrum, \(N_e\) merely acts as a linear scaling factor and is absorbed by the scaled mutation rate \(\theta\), which is treated as a free parameter. Here, \(N_e\) instead rescales recombination rates, and because we use a recombination map to determine the binning of data by recombination distances separating loci, \(N_e\) is a parameter that must be either passed as a fixed value or simultaneously fit in the optimization.

If Ne is a fixed value, we specify the population size using that keyword argument. Otherwise, if Ne is to be fit, our list of parameters to fit by convention includes Ne in the final position in the list. Typically, Ne is not a parameter of the demographic model, as we work in rescaled genetic units, so the parameters that get passed to model_func are params[:-1]. However, it is also possible to write a demographic model that also uses Ne as a parameter. In this case we set pass_Ne to True, so that Ne both rescales recombination rates and is a model parameter, and all params are passed to model_func.

Ne: The effective population size, used to rescale rs to get rhos = 4 * Ne * rs.
pass_Ne: Defaults to False. If True, the demographic model includes Ne as a parameter (in the final position of input parameters).

Other commonly used options include

fixed_params: Defaults to None. To fix some parameters, this should be a list of equal length as p0, with None for parameters to be fit and fixed values at corresponding indexes.
lower_bound: Defaults to None. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.
upper_bound: Defaults to None. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.
statistics: Defaults to None, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given by moments.LD.Util.moment_names(num_pops).
normalization: Defaults to 0. The index of the population to normalize by, which should match the population index that we normalized by when parsing the data.
verbose: If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.

Example

Using the data simulated in the Parsing section, we can refit the demographic model under a parameterized IM model. For this, we could use the moments.LD.Demographics2D.split_mig model as our model_func, which is equivalent to the function we defined above (which we use in this example). After loading the data and setting up the inference options, we’ll use optimize_log_fmin to fit the model.

import moments.LD
import pickle

data = pickle.load(open("data/means.varcovs.split_mig.100_reps.bp", "rb"))

rs = [0, 1e-6, 2e-6, 5e-6, 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]

p_guess = [0.1, 2.0, 0.075, 2.0, 10000]
p0 = moments.LD.Util.perturb_params(p_guess, fold=0.2)

# run optimization
opt_params, LL = moments.LD.Inference.optimize_log_fmin(
    p_guess,
    [data["means"], data["varcovs"]],
    [model_func],
    rs=rs,
    verbose=40,
)

# get physical units, rescaling by Ne
physical_units = moments.LD.Util.rescale_params(
    opt_params, ["nu", "nu", "T", "m", "Ne"]
)

print("best fit parameters:")
print(f"  N(deme0)         :  {physical_units[0]:.1f}")
print(f"  N(deme1)         :  {physical_units[1]:.1f}")
print(f"  Div. time (gen)  :  {physical_units[2]:.1f}")
print(f"  Migration rate   :  {physical_units[3]:.6f}")
print(f"  N(ancestral)     :  {physical_units[4]:.1f}")

40      , -214.296    , array([ 0.140269   ,  1.95203    ,  0.0514585  ,  2.07543    ,  12430      ])

80      , -130.758    , array([ 0.153478   ,  2.00214    ,  0.0541354  ,  2.11845    ,  11248.2    ])

120     , -77.1877    , array([ 0.184635   ,  1.80666    ,  0.0673507  ,  2.0472     ,  11068.5    ])

160     , -76.8557    , array([ 0.186388   ,  1.80029    ,  0.0685854  ,  2.05002    ,  11008.4    ])

200     , -76.823     , array([ 0.185232   ,  1.80143    ,  0.0679905  ,  2.04797    ,  11010.4    ])

240     , -76.8106    , array([ 0.185808   ,  1.80374    ,  0.0683159  ,  2.05049    ,  11000.2    ])

280     , -76.4522    , array([ 0.188075   ,  1.81258    ,  0.0692172  ,  2.02497    ,  10970.3    ])

320     , -74.2099    , array([ 0.183888   ,  1.92246    ,  0.0656908  ,  1.82861    ,  10918.1    ])

360     , -72.9169    , array([ 0.191533   ,  2.01901    ,  0.0680872  ,  1.70206    ,  10638.2    ])

400     , -72.7291    , array([ 0.192878   ,  2.06349    ,  0.0681534  ,  1.66157    ,  10576      ])

440     , -72.7222    , array([ 0.193385   ,  2.06172    ,  0.0684375  ,  1.67061    ,  10567.8    ])

best fit parameters:
  N(deme0)         :  2042.8
  N(deme1)         :  21785.4
  Div. time (gen)  :  1445.7
  Migration rate   :  0.000079
  N(ancestral)     :  10569.8

These should be pretty close to the input demographic parameters from the simulations! They won’t be spot on, as this was only using 100Mb of simulated data, but we should be in the ballpark.

Computing confidence intervals

When running demographic inference, we get a point estimate for the best fit demographic parameters. However, for an unknown underlying true value, it’s important to also estimate what’s called a confidence interval. The CI tells us the probability that the true value lies within some range, and provides some information about which parameters in our demographic model are tightly constrained and which parameters we have little power to pin down.

moments.LD can estimate confidence intervals using either the Fisher Information Matrix (FIM) or the Godambe Information Matrix (GIM). In almost all cases when using real data (or even most simulated data), the FIM will estimate a much smaller CI than the GIM. This occurs because the FIM assumes all data points that we’ve used are independent, when in reality there is linkage that causes data points to be sometimes highly correlated between pairs of loci and between recombination bins. The Godambe method uses bootstrap-resampled replicates of the data to account for this correlation and does a much better job at estimating the true underlying CIs [Coffman2016].

Note

If you use the Godambe approach to estimate confidence intervals, please cite [Coffman2016]. Alec originally implemented this approach in dadi, and moments has more-or-less used this same implementation here.

To create bootstrap replicates from the dictionary of data sums computed over regions, where rep_data = {0: ld_stats_0, 1: ld_stats_1, ...}, e.g., we use

num_boots = 100
norm_idx = 0
bootstrap_sets = moments.LD.Parsing.get_bootstrap_sets(
    rep_data, num_bootstraps=num_boots, normalization=norm_idx)

These bootstrap sets can then be used as the inputs to the moments.LD.Godambe methods. The two CI estimation methods are

FIM_uncert: Uses the Fisher Information Matrix. Usage is FIM_uncert(model_func, opt_params, means, varcovs, r_edges=rs).
GIM_uncert: Uses the Godambe Information Matrix. Usage is GIM_uncert(model_func, bootstrap_sets, opt_params, means, varcovs, r_edges=rs).

In each case, the model function is the same as used in inference (some manipulation may be needed if we had any fixed parameters), means and varcovs are the same data as input to the inference function, and r_edges are the bin edges used in the inference. Additional options for some corner cases are described in the API reference for LD methods.

Example

We’ll use both the FIM and GIM to compute uncertainties from the above example inference.

Using the FIM approach:

# using FIM
uncerts_FIM = moments.LD.Godambe.FIM_uncert(
    model_func,
    opt_params,
    data["means"],
    data["varcovs"],
    r_edges=rs,
)

# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_FIM
upper = opt_params + 1.96 * uncerts_FIM

# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])

print("95% CIs:")
print(f"  N(deme0)         :  {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f"  N(deme1)         :  {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f"  Div. time (gen)  :  {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f"  Migration rate   :  {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f"  N(ancestral)     :  {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")

95% CIs:
  N(deme0)         :  1828.5 - 2267.9
  N(deme1)         :  18863.7 - 24873.3
  Div. time (gen)  :  1269.6 - 1631.3
  Migration rate   :  0.000069 - 0.000088
  N(ancestral)     :  10165.2 - 10974.3

And using the GIM approach:

bootstrap_sets = pickle.load(open("data/bootstrap_sets.split_mig.100_reps.bp", "rb"))

# using GIM
uncerts_GIM = moments.LD.Godambe.GIM_uncert(
    model_func,
    bootstrap_sets,
    opt_params,
    data["means"],
    data["varcovs"],
    r_edges=rs,
)

# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_GIM
upper = opt_params + 1.96 * uncerts_GIM

# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])

print("95% CIs:")
print(f"  N(deme0)         :  {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f"  N(deme1)         :  {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f"  Div. time (gen)  :  {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f"  Migration rate   :  {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f"  N(ancestral)     :  {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")

95% CIs:
  N(deme0)         :  1581.2 - 2551.3
  N(deme1)         :  17105.0 - 26931.2
  Div. time (gen)  :  1030.5 - 1906.9
  Migration rate   :  0.000059 - 0.000096
  N(ancestral)     :  9854.5 - 11285.0

We can see above that the FIM uncertainties are considerably smaller (i.e. more constrained) than the GIM uncertainties. However, the GIM uncertainties are to be preferred here, as they more accurately estimate the underlying true uncertainty in the demographic inference.

References

[Coffman2016] (1,2)

Coffman, Alec J., et al. “Computationally efficient composite likelihood statistics for demographic inference.” Molecular biology and evolution 33.2 (2016): 591-593.