inferGenotype - Infer a subject-specific genotype using a frequency method
Description¶
inferGenotype
infers an subject’s genotype using a frequency method.
The genotype is inferred by finding the minimum number set of alleles that
can explain the majority of each gene’s calls. The most common allele of
each gene is included in the genotype first, and the next most common allele
is added until the desired fraction of alleles can be explained. In this
way, mistaken allele calls (resulting from sequences which
by chance have been mutated to look like another allele) can be removed.
Usage¶
inferGenotype(
data,
germline_db = NA,
novel = NA,
v_call = "v_call",
seq = "sequence_alignment",
fraction_to_explain = 0.875,
gene_cutoff = 1e-04,
find_unmutated = TRUE
)
Arguments¶
- data
data.frame
containing V allele calls from a single subject.- germline_db
- named vector of sequences containing the
germline sequences named in
allele_calls
. Only required iffind_unmutated
isTRUE
. - novel
- optional
data.frame
of the type novel returned by findNovelAlleles containing germline sequences that will be utilized iffind_unmutated
isTRUE
. See Details. - v_call
- column in
data
with V allele calls. Default is"v_call"
. - seq
- name of the column in
data
with the aligned, IMGT-numbered, V(D)J nucleotide sequence. Default issequence_alignment
. - fraction_to_explain
- the portion of each gene that must be explained by the alleles that will be included in the genotype.
- gene_cutoff
- either a number of sequences or a fraction of
the length of
allele_calls
denoting the minimum number of times a gene must be observed inallele_calls
to be included in the genotype. - find_unmutated
- if
TRUE
, usegermline_db
to find which samples are unmutated. Not needed ifallele_calls
only represent unmutated samples.
Value¶
A data.frame
of alleles denoting the genotype of the subject containing
the following columns:
gene
: The gene name without allele.alleles
: Comma separated list of alleles for the givengene
.counts
: Comma separated list of observed sequences for each corresponding allele in thealleles
list.total
: The total count of observed sequences for the givengene
.note
: Any comments on the inferrence.
Details¶
Allele calls representing cases where multiple alleles have been
assigned to a single sample sequence are rare among unmutated
sequences but may result if nucleotides for certain positions are
not available. Calls containing multiple alleles are treated as
belonging to all groups. If novel
is provided, all
sequences that are assigned to the same starting allele as any
novel germline allele will have the novel germline allele appended
to their assignent prior to searching for unmutated sequences.
Note¶
This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.
Examples¶
# Infer IGHV genotype, using only unmutated sequences, including novel alleles
inferGenotype(AIRRDb, germline_db=SampleGermlineIGHV, novel=SampleNovel,
find_unmutated=TRUE)
gene alleles counts total
1 IGHV1-2 02,04 664,302 966
2 IGHV1-3 01 226 226
3 IGHV1-8 01,02_G234T 467,370 837
4 IGHV1-18 01 1005 1005
5 IGHV1-24 01 105 105
6 IGHV1-46 01 624 624
7 IGHV1-58 01,02 23,18 41
8 IGHV1-69 01,04,06 515,469,280 1279
9 IGHV1-69-2 01 31 31
note
1
2
3
4
5
6
7
8 Cannot distinguish IGHV1-69*01 and IGHV1-69D*01
9
See also¶
plotGenotype for a colorful visualization and genotypeFasta to convert the genotype to nucleotide sequences. See inferGenotypeBayesian to infer a subject-specific genotype using a Bayesian approach.