Parsing LD statistics

As described in the multi-population LD section, we are interested in \(\sigma_d^2\)-type statistics, which is the ratio of expectations of \(D^2\) and \(\pi_2 = p(1-p)q(1-q)\). Again, \(p\) and \(q\) are the allele frequencies at the left and right loci.

To estimate these statistics from data, we take the average of each LD statistic over all pairs of observed (biallelic) SNPs at a given recombination distance, and then divide by the observed \(\pi_2\) in one of the populations (the “normalizing” population). As described below, we also use a block bootstrapping approach to estimate variances and covariances of observed statistics at each recombination distance, which is used in inference and computing confidence intervals.

Binned LD decay

To estimate LD decay curves from the data, we bin all pairs of observed SNPs by recombination distance. While we can bin by physical distance (bps) separating SNPs, genetic maps are non-uniform and physical distance does not perfectly correlate with genetic distance at small scales. If we have a recombination map available, it is preferable to compute and compare statistics using that map.

Recombination rate bins are defined by bin edges, which is a list or array with length equal to the number of desired bins plus one. Bin edges should be monotonically increasing, and are thus adjacent without gaps between bins. Thus, bins are defined as semi-open intervals:

import moments.LD
import numpy as np

bin_edges = np.array([0, 1e-6, 1e-5, 1e-4])
print("Bins:")
for b_l, b_r in zip(bin_edges[:-1], bin_edges[1:]):
    print(f"[{b_l}, {b_r})")

Bins:
[0.0, 1e-06)
[1e-06, 1e-05)
[1e-05, 0.0001)

There are a few considerations to keep in mind. In practice, very short distances can be problematic, because “non-standard” evolutionary processes can distort allele frequency correlations for tightly linked loci. For example, our evolutionary model does not include multi-nucleotide mutations [Harris2014] or gene conversion [Ardlie2001], both of which operate at short distances.

Thus, when working with real data we recommend omitting bins of very short recombination distances. In practice, we typically drop bins with length less than \(r=5\times 10^{-6}\), which corresponds to roughly a few hundred bp on average in humans.

Parsing from a VCF

The primary function of the Parsing module is computing LD statistics from an input VCF. There are a number of options available, but the primary inputs are the path to the VCF file and the bins of distances separating loci. Typically, we work in recombination distance, in which case a recombination map is also required. If we do not have a recombination map available, we can bin by base pair distances instead.

The function moments.LD.Parsing.compute_ld_statistics returns a dictionary with the bins, returned statistics, populations, and sums of each statistic over the provided bins. For example:

r_bins = np.logspace(-6, -3, 7)
ld_stats = moments.LD.Parsing.compute_ld_statistics(
    vcf_path, r_bins=r_bins, rec_map_file=map_path)

Using a recombination map

The input recombination map is specified as a text file, with the first column giving positions along the chromosome and additional column(s) defining the cumulative map(s), typically in units of cM. The header line is “Pos Map1 Map2 (etc)”, and we can use any map in the file by specifying the map_name. If no map name is given, or the specified map name does not match a genetic map in the header, we use the map in the first column.

Typically, maps are given in units of centi-Morgans, and the default behavior is to assume cM units. If the map is given in units of Morgans, we need to set cM=False.

Populations and pop-file

We often have data from more than one population, so we need to be able to specify which samples in the VCF correspond to which populations. This is handled by passing a file that assigns each sample to a population. For example, the population file is written as

sample  pop
sid_0   pop_A
sid_1   pop_B
sid_2   pop_A
sid_3   pop_A
sid_4   pop_B
...

Then to include the population information in the function, we also pass a list of the populations to compute statistics for. Samples from omitted populations are dropped from the data.

pops = ["pop_A", "pop_B"]
ld_stats = ld_stats = moments.LD.Parsing.compute_ld_statistics(
    vcf_path,
    r_bins=r_bins,
    rec_map_file=map_path,
    pop_file=pop_file_path,
    pops=pops
)

Masking and using bed files

If there are multiple chromosomes or contigs included in the VCF, we specify which chromosome to compute statistics for by setting the chromosome flag. We can also subset a chromosome by including a bed file, which will filter out all SNPs that fall outside the region intervals given in the bed file. Bed files have the format {chrom}\t{left_pos}\t{right_pos}, which defines a semi-open interval. The path to the bed file is provided with the bed_file argument.

Computing a subset of statistics

Sometimes we may wish to only compute a subset of possible LD statistics. By default, the parsing function computes all statistics possible for the number of populations provided. Instead, we can specify the stats_to_compute, which is a list (of length 2) of lists. The first list are the LD statistics to return, and the second list has the heterozygosity statistics to return. Statistic names follow the convention in moments.LD.Util.moment_names(num_pops), and should be formatted accordingly.

Phased vs unphased data

We can compute LD statistics from either phased or unphased data. The default behavior is to assume that phasing is unknown, and use_genotypes is True by default. If we want to compute LD using phased data, we set use_genotypes=False, and parsing uses phased haplotypes instead. In general, phasing errors can bias LD statistics, sometimes significantly, and using genotypes instead of haplotypes only slightly increases uncertainty in most cases. Therefore, we usually recommend leaving use_genotypes=True.

Computing averages and covariances over regions

From moments.LD.Parsing.compute_ld_statistics(), we get LD statistic sums from the regions in a VCF, perhaps constrained by a bed file. Our strategy is to divide our data into some large number of roughly equally sized chunks, for example 500 regions across all 22 autosomes in human data. We then compute LD statistics independently for each region (it helps to parallelize that step, using a compute cluster). From those outputs, we can then compute average statistics genome-wide, as well as covariances of statistics within each bin. Those covariances are needed to be able to compute likelihoods and run optimization.

The outputs of compute_ld_statistics are compiled in a dictionary, where the keys are unique region identifiers, and items the outputs of that function. For example:

region_stats = {
    0: moments.LD.Parsing.compute_ld_statistics(VCF, bed_file="region_0.bed", ...),
    1: moments.LD.Parsing.compute_ld_statistics(VCF, bed_file="region_1.bed", ...),
    2: moments.LD.Parsing.compute_ld_statistics(VCF, bed_file="region_2.bed", ...),
    ...
}

Mean and variance-covariance matrices are computed by calling bootstrap_data, passing the region statistics dictionary, and optionally the index of the population to normalize \(\sigma_d^2\) statistics by. By default, the normalizing population is the first (index 0).

mv = moments.LD.Parsing.bootstrap_data(region_stats)

mv contains the bins, statistics, and populations, as well as lists of mean statistics and variance-covariance matrices. This data can then be directly compared to model expectations and used in inference.

Example

Using msprime [Kelleher2016], we’ll simulate some data under an isolation-with-migration (IM) model and then compute LD and heterozygosity statistics using the LD.Parsing methods. First, the simulation will use the demes-msprime interface, which are then written as a VCF.

The YAML-file specifying the model is

description: A simple isolation-with-migration model
time_units: generations
demes:
- name: anc
  epochs: [{start_size: 10000, end_time: 1500}]
- name: deme0
  ancestors: [anc]
  epochs: [{start_size: 2000}]
- name: deme1
  ancestors: [anc]
  epochs: [{start_size: 20000}]
migrations:
- demes: [deme0, deme1]
  rate: 1e-4

And we use msprime to simulate 1Mb of data, using a constant recombination and mutation rate.

import msprime
import demes
import os

# set up simulation parameters
L = 1e6
u = r = 1.5e-8
n = 10

g = demes.load("data/im-parsing-example.yaml")
demog = msprime.Demography.from_demes(g)

trees = msprime.sim_ancestry(
    {"deme0": n, "deme1": n},
    demography=demog,
    sequence_length=L,
    recombination_rate=r,
    random_seed=321,
)

trees = msprime.sim_mutations(trees, rate=u, random_seed=123)

with open("data/im-parsing-example.vcf", "w+") as fout:
    trees.write_vcf(fout)

This simulation had 10 diploid individuals per population, and msprime/tskit writes their IDs as tsk_0, tsk_1, etc:

##fileformat=VCFv4.2
##source=tskit 0.3.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=1000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6	tsk_7	tsk_8	tsk_9	tsk_10	tsk_11	tsk_12	tsk_13	tsk_14	tsk_15	tsk_16	tsk_17	tsk_18	tsk_19
1	221	.	A	T	.	PASS	.	GT	1|1	1|0	1|1	1|1	0|1	0|1	0|0	1|1	0|1	1|1	1|1	1|1	1|1	1|1	1|1	1|1	1|1	1|1	1|1	1|1
1	966	.	A	C	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	1|1	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0	0|0	0|1	1|0
1	1082	.	G	A	.	PASS	.	GT	0|0	0|1	0|0	0|0	1|0	1|0	1|1	0|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
1	1133	.	G	T	.	PASS	.	GT	0|0	0|1	0|0	0|0	1|0	1|0	1|1	0|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0

To parse this data, we need the file that maps samples to populations and the recombination map file (the total map length is found by \(1 \times 10^{6} \text{ bp} \times 1.5 \times 10^{-8} \text{ M/bp} \times 100 \text{ cM/M}\)):

sample  pop
tsk_0   deme0
tsk_1   deme0
tsk_2   deme0
tsk_3   deme0
tsk_4   deme0
tsk_5   deme0
tsk_6   deme0
tsk_7   deme0
tsk_8   deme0
tsk_9   deme0
tsk_10  deme1
tsk_11  deme1
tsk_12  deme1
tsk_13  deme1
tsk_14  deme1
tsk_15  deme1
tsk_16  deme1
tsk_17  deme1
tsk_18  deme1
tsk_19  deme1

Pos Map(cM)
0   0
1000000 1.5

With all this, we can now compute LD based on recombination distance bins:

r_bins = np.array(
    [0, 1e-6, 2e-6, 5e-6, 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]
)

vcf_file = "data/im-parsing-example.vcf"
map_file = "data/im-parsing-example.map.txt"
pop_file = "data/im-parsing-example.samples.pops.txt"
pops = ["deme0", "deme1"]
ld_stats = moments.LD.Parsing.compute_ld_statistics(
    vcf_file,
    rec_map_file=map_file,
    pop_file=pop_file,
    pops=["deme0", "deme1"],
    r_bins=r_bins,
    report=False,
)

The output, ld_stats, is a dictionary with the keys bins, stats, pops, and sums. To get the average statistics over multiple regions (here, we only have a single region that we simulated), we use means_from_region_data:

means = moments.LD.Parsing.means_from_region_data(
    {0: ld_stats}, ld_stats["stats"], norm_idx=0
)

This provides \(\sigma_d^2\)-type statistics relative to pi_2 in deme0, and relative heterozygosities (also relative to deme0). These statistics were computed from only a single relatively small region, so they will be quite noisy. But we can still compare to expectations under the input IM demographic model.

import demes
g = demes.load("data/im-parsing-example.yaml")

y = moments.Demes.LD(
    g,
    sampled_demes=["deme0", "deme1"],
    rho=4 * g["anc"].epochs[0].start_size * r_bins,
)

# stats are computed at the bin edges - average to get midpoint estimates
y = moments.LD.LDstats(
    [(y_l + y_r) / 2 for y_l, y_r in zip(y[:-2], y[1:-1])] + [y[-1]],
    num_pops=y.num_pops,
    pop_ids=y.pop_ids,
)
y = moments.LD.Inference.sigmaD2(y)

# plot LD decay curves for some statistics
moments.LD.Plotting.plot_ld_curves_comp(
    y,
    means[:-1],
    [],
    rs=r_bins,
    stats_to_plot=[
        ["DD_0_0", "DD_0_1", "DD_1_1"],
        ["Dz_0_0_0", "Dz_0_1_1", "Dz_1_1_1"],
        ["pi2_0_0_1_1", "pi2_0_1_0_1", "pi2_1_1_1_1"]
    ],
    labels=[[r"$D_0^2$", r"$D_0 D_1$", r"$D_1^2$"],
        [r"$Dz_{0,0,0}$", r"$Dz_{0,1,1}$", r"$Dz_{1,1,1}$"],
        [r"$\pi_{2;0,0,1,1}$", r"$\pi_{2;0,1,0,1}$", r"$\pi_{2;1,1,1,1}$"]
    ],
    plot_vcs=False,
    fig_size=(8, 3),
    show=True,
)

Bootstrapping over multiple regions

Normally, we’ll want more data than from a single 1Mb region to compute averages and variances of statistics. Using the same approach as the above example, ld_stats for 100 replicates we computed (see example in the moments repository here). From this, each replicate set of statistics were placed in a dictionary, as rep_stats = {0: ld_stats_0, 1: ld_stats_1, ..., 99: ld_stats_99}. This dictionary can then be used to compute means and covariances of statistics.

mv = moments.LD.Parsing.bootstrap_data(ld_stats)

By simulating more data, the LD decay curves are much less noisy, and by simulating multiple replicates, we also compute the variance-covariance matrices for each bin and can include standard errors in the plots.

# plot LD decay curves for some statistics
moments.LD.Plotting.plot_ld_curves_comp(
    y,
    mv["means"][:-1],
    mv["varcovs"][:-1],
    rs=r_bins,
    stats_to_plot=[
        ["DD_0_0", "DD_0_1", "DD_1_1"],
        ["Dz_0_0_0", "Dz_0_1_1", "Dz_1_1_1"],
        ["pi2_0_0_1_1", "pi2_0_1_0_1", "pi2_1_1_1_1"]
    ],
    labels=[[r"$D_0^2$", r"$D_0 D_1$", r"$D_1^2$"],
        [r"$Dz_{0,0,0}$", r"$Dz_{0,1,1}$", r"$Dz_{1,1,1}$"],
        [r"$\pi_{2;0,0,1,1}$", r"$\pi_{2;0,1,0,1}$", r"$\pi_{2;1,1,1,1}$"]
    ],
    plot_vcs=True,
    fig_size=(8, 3),
    show=True,
)

Note

The means-covariances data is required for inference using LD statistics. In Inferring demography with LD, we’ll use the same mv data dictionary to refit the IM model as an example.

LD statistics in genotype blocks

moments.LD.Parsing also includes some functions for computing LD from genotype (or haplotype) blocks. Genotype blocks are arrays of shape \(L\times n\), where L is the number of loci and n is the sample size. We assume a single population, and so we compute \(D^2\), \(Dz\), \(\pi_2\), and \(D\), either pairwise or averaged over all pairwise comparisons.

If we have a genotype matrix containing n diploid samples, genotypes are coded as 0, 1, and 2, and we set genotypes=True. If we have a haplotype matrix with data from n haploid copies, genotypes are coded as 0 and 1 only, and we set genotypes=False.

For example, given a single genotype matrix, we compute all pairwise statistics and average statistics as shown below:

L = 10
n = 5
G = np.random.randint(3, size=L * n).reshape(L, n)

# all pairwise comparisons:
D2_pw, Dz_pw, pi2_pw, D_pw = moments.LD.Parsing.compute_pairwise_stats(G)

# averages:
D2_ave, Dz_ave, pi2_ave, D_ave = moments.LD.Parsing.compute_average_stats(G)

Similarly, we can compute the pairwise or average statistics between two genotype matrices. The matrices can have differing number of loci, but they must have the same number of samples, as the genotype matrices are assumed to come from different regions within the same samples.

L2 = 12
n = 5

G2 = np.random.randint(3, size=L2 * n).reshape(L2, n)

# all pairwise comparisons:
D2_pw, Dz_pw, pi2_pw, D_pw = moments.LD.Parsing.compute_pairwise_stats_between(G, G2)

# averages:
D2_ave, Dz_ave, pi2_ave, D_ave = moments.LD.Parsing.compute_average_stats_between(G, G2)

Note

Computing LD in genotype blocks uses C-extensions that are not built by default, so are only available if these are built when compiling the C-extensions. In order to use these methods, we need to build these extensions using the --ld_extensions flag, as python setup.py build_ext --ld_extensions -i.

References

[Ardlie2001]

Ardlie, Kristin, et al. “Lower-than-expected linkage disequilibrium between tightly linked markers in humans suggests a role for gene conversion.” The American Journal of Human Genetics 69.3 (2001): 582-589.

[Harris2014]

Harris, Kelley, and Rasmus Nielsen. “Error-prone polymerase activity causes multinucleotide mutations in humans.” Genome research 24.9 (2014): 1445-1454.

[Kelleher2016]

Kelleher, Jerome, Alison M. Etheridge, and Gilean McVean. “Efficient coalescent simulation and genealogical analysis for large sample sizes.” PLoS computational biology 12.5 (2016): e1004842.