inferGenotypeBayesian - Infer a subject-specific genotype using a Bayesian approach

Description

inferGenotypeBayesian infers an subject’s genotype by applying a Bayesian framework with a Dirichlet prior for the multinomial distribution. Up to four distinct alleles are allowed in an individual’s genotype. Four likelihood distributions were generated by empirically fitting three high coverage genotypes from three individuals (Laserson and Vigneault et al, 2014). A posterior probability is calculated for the four most common alleles. The certainty of the highest probability model was calculated using a Bayes factor (the most likely model divided by second-most likely model). The larger the Bayes factor (K), the greater the certainty in the model.

Usage

inferGenotypeBayesian(data, germline_db = NA, novel = NA,
v_call = "V_CALL", find_unmutated = TRUE, priors = c(0.6, 0.4, 0.4,
0.35, 0.25, 0.25, 0.25, 0.25, 0.25))

Arguments

data
a data.frame containing V allele calls from a single subject. If find_unmutated is TRUE, then the sample IMGT-gapped V(D)J sequence should be provided in a column "SEQUENCE_IMGT"
germline_db
named vector of sequences containing the germline sequences named in allele_calls. Only required if find_unmutated is TRUE.
novel
an optional data.frame of the type novel returned by findNovelAlleles containing germline sequences that will be utilized if find_unmutated is TRUE. See Details.
v_call
column in data with V allele calls. Default is "V_CALL".
find_unmutated
if TRUE, use germline_db to find which samples are unmutated. Not needed if allele_calls only represent unmutated samples.
priors
a numeric vector of priors for the multinomial distribution. The priors vector must be nine values that defined the priors for the heterozygous (two allele), trizygous (three allele), and quadrozygous (four allele) distributions. The first two values of priors define the prior for the heterozygous case, the next three values are for the trizygous case, and the final four values are for the quadrozygous case. Each set of priors should sum to one. Note, each distribution prior is actually defined internally by set of four numbers, with the unspecified final values assigned to 0; e.g., the heterozygous case is c(priors[1], priors[2], 0, 0). The prior for the homozygous distribution is fixed at c(1, 0, 0, 0).

Value

A data.frame of alleles denoting the genotype of the subject with the log10 of the likelihood of each model and the log10 of the Bayes factor. The output contains the following columns:

  • GENE: The gene name without allele.
  • ALLELES: Comma separated list of alleles for the given GENE.
  • COUNTS: Comma separated list of observed sequences for each corresponding allele in the ALLELES list.
  • TOTAL: The total count of observed sequences for the given GENE.
  • NOTE: Any comments on the inferrence.
  • KH: log10 likelihood that the GENE is homozygous.
  • KD: log10 likelihood that the GENE is heterozygous.
  • KT: log10 likelihood that the GENE is trizygous
  • KQ: log10 likelihood that the GENE is quadrozygous.
  • K_DIFF: log10 ratio of the highest to second-highest zygosity likelihoods.

Details

Allele calls representing cases where multiple alleles have been assigned to a single sample sequence are rare among unmutated sequences but may result if nucleotides for certain positions are not available. Calls containing multiple alleles are treated as belonging to all groups. If novel is provided, all sequences that are assigned to the same starting allele as any novel germline allele will have the novel germline allele appended to their assignent prior to searching for unmutated sequences.

Note

This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.

References

  1. Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. PNAS. 2014 111(13):4928-33.

Examples

# Infer IGHV genotype, using only unmutated sequences, including novel alleles
inferGenotypeBayesian(SampleDb, germline_db=GermlineIGHV, novel=SampleNovel, 
find_unmutated=TRUE)
        GENE     ALLELES         COUNTS TOTAL NOTE                KH
1    IGHV1-2       02,04        664,302   966                  -1000
2    IGHV1-3          01            226   226       4.20089197988625
3    IGHV1-8 01,02_G234T        467,370   837                  -1000
4   IGHV1-18          01           1005  1005      -3.76643736033536
5   IGHV1-24          01            105   105       4.75335701924247
6   IGHV1-46          01            624   624      0.457455409315221
7   IGHV1-58       01,02          23,18    41      -20.3932114156223
8   IGHV1-69 01,04,06,02 515,469,280,15  1279                  -1000
9 IGHV1-69-2          01             31    31       4.16107190423977
                 KD                KT                KQ           K_DIFF
1 -7.92846809405969 -139.556367176944 -313.583949130729 131.627899082884
2 -45.2911957825576 -84.2865868763307 -128.991761853586 49.4920877624439
3 -1.04759115960507 -102.524664723923 -247.193958844361 101.477073564318
4  -223.85293382607             -1000             -1000 220.086496465735
5 -18.2407545518045 -36.3580822723628 -57.1281856909991 22.9941115710469
6 -136.193264784335 -243.861955237939             -1000  136.65072019365
7  3.60009261357983 -1.38512929425796 -8.47869574581951  4.9852219078378
8 -277.291087469703  3.55051520054669 -143.380669247128 146.931184447674
9 -2.62766579768837 -7.97659112471034 -14.1087168959268 6.78873770192814

See also

plotGenotype for a colorful visualization and genotypeFasta to convert the genotype to nucleotide sequences. See inferGenotype to infer a subject-specific genotype using a frequency method