inferGenotypeBayesian - Infer a subject-specific genotype using a Bayesian approach
Description¶
inferGenotypeBayesian
infers an subject’s genotype by applying a Bayesian framework
with a Dirichlet prior for the multinomial distribution. Up to four distinct alleles are
allowed in an individual’s genotype. Four likelihood distributions were generated by
empirically fitting three high coverage genotypes from three individuals
(Laserson and Vigneault et al, 2014). A posterior probability is calculated for the
four most common alleles. The certainty of the highest probability model was
calculated using a Bayes factor (the most likely model divided by second-most likely model).
The larger the Bayes factor (K), the greater the certainty in the model.
Usage¶
inferGenotypeBayesian(
data,
germline_db = NA,
novel = NA,
v_call = "v_call",
seq = "sequence_alignment",
find_unmutated = TRUE,
priors = c(0.6, 0.4, 0.4, 0.35, 0.25, 0.25, 0.25, 0.25, 0.25)
)
Arguments¶
- data
- a
data.frame
containing V allele calls from a single subject. Iffind_unmutated
isTRUE
, then the sample IMGT-gapped V(D)J sequence should be provided in columnsequence_alignment
- germline_db
- named vector of sequences containing the
germline sequences named in
allele_calls
. Only required iffind_unmutated
isTRUE
. - novel
- an optional
data.frame
of the type novel returned by findNovelAlleles containing germline sequences that will be utilized iffind_unmutated
isTRUE
. See Details. - v_call
- column in
data
with V allele calls. Default is"v_call"
. - seq
- name of the column in
data
with the aligned, IMGT-numbered, V(D)J nucleotide sequence. Default is"sequence_alignment"
. - find_unmutated
- if
TRUE
, usegermline_db
to find which samples are unmutated. Not needed ifallele_calls
only represent unmutated samples. - priors
- a numeric vector of priors for the multinomial distribution.
The
priors
vector must be nine values that defined the priors for the heterozygous (two allele), trizygous (three allele), and quadrozygous (four allele) distributions. The first two values ofpriors
define the prior for the heterozygous case, the next three values are for the trizygous case, and the final four values are for the quadrozygous case. Each set of priors should sum to one. Note, each distribution prior is actually defined internally by set of four numbers, with the unspecified final values assigned to0
; e.g., the heterozygous case isc(priors[1], priors[2], 0, 0)
. The prior for the homozygous distribution is fixed atc(1, 0, 0, 0)
.
Value¶
A data.frame
of alleles denoting the genotype of the subject with the log10
of the likelihood of each model and the log10 of the Bayes factor. The output
contains the following columns:
gene
: The gene name without allele.alleles
: Comma separated list of alleles for the givengene
.counts
: Comma separated list of observed sequences for each corresponding allele in thealleles
list.total
: The total count of observed sequences for the givengene
.note
: Any comments on the inferrence.kh
: log10 likelihood that thegene
is homozygous.kd
: log10 likelihood that thegene
is heterozygous.kt
: log10 likelihood that thegene
is trizygouskq
: log10 likelihood that thegene
is quadrozygous.k_diff
: log10 ratio of the highest to second-highest zygosity likelihoods.
Details¶
Allele calls representing cases where multiple alleles have been
assigned to a single sample sequence are rare among unmutated
sequences but may result if nucleotides for certain positions are
not available. Calls containing multiple alleles are treated as
belonging to all groups. If novel
is provided, all
sequences that are assigned to the same starting allele as any
novel germline allele will have the novel germline allele appended
to their assignent prior to searching for unmutated sequences.
Note¶
This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.
References¶
- Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. PNAS. 2014 111(13):4928-33.
Examples¶
# Infer IGHV genotype, using only unmutated sequences, including novel alleles
inferGenotypeBayesian(AIRRDb, germline_db=SampleGermlineIGHV, novel=SampleNovel,
find_unmutated=TRUE, v_call="v_call", seq="sequence_alignment")
gene alleles counts total
1 IGHV1-2 02,04 664,302 966
2 IGHV1-3 01 226 226
3 IGHV1-8 01,02_G234T 467,370 837
4 IGHV1-18 01 1005 1005
5 IGHV1-24 01 105 105
6 IGHV1-46 01 624 624
7 IGHV1-58 01,02 23,18 41
8 IGHV1-69 01,04,06,02 515,469,280,15 1279
9 IGHV1-69-2 01 31 31
note kh
1 -1000
2 4.20089197988625
3 -1000
4 -3.76643736033536
5 4.75335701924247
6 0.457455409315221
7 -20.3932114156223
8 Cannot distinguish IGHV1-69*01 and IGHV1-69D*01 -1000
9 4.16107190423977
kd kt kq k_diff
1 -7.92846809405969 -139.556367176944 -313.583949130729 131.627899082884
2 -45.2911957825576 -84.2865868763307 -128.991761853586 49.4920877624439
3 -1.04759115960507 -102.524664723923 -247.193958844361 101.477073564318
4 -223.85293382607 -1000 -1000 220.086496465735
5 -18.2407545518045 -36.3580822723628 -57.1281856909991 22.9941115710469
6 -136.193264784335 -243.861955237939 -1000 136.65072019365
7 3.60009261357983 -1.38512929425796 -8.47869574581951 4.9852219078378
8 -277.291087469703 3.55051520054669 -143.380669247128 146.931184447674
9 -2.62766579768837 -7.97659112471034 -14.1087168959268 6.78873770192814
See also¶
plotGenotype for a colorful visualization and genotypeFasta to convert the genotype to nucleotide sequences. See inferGenotype to infer a subject-specific genotype using a frequency method