inferGenotype - Infer a subject-specific genotype
inferGenotype infers an subject’s genotype by finding the minimum
number set of alleles that can explain the majority of each gene’s calls. The
most common allele of each gene is included in the genotype first, and the
next most common allele is added until the desired fraction of alleles can be
explained. In this way, mistaken allele calls (resulting from sequences which
by chance have been mutated to look like another allele) can be removed.
inferGenotype(clip_db, fraction_to_explain = 0.875, gene_cutoff = 1e-04, find_unmutated = TRUE, germline_db = NA, novel_df = NA)
data.framecontaining V allele calls from a single subject under
TRUE, then the sample IMGT-gapped V(D)J sequence should be provided in a column
- the portion of each gene that must be explained by the alleles that will be included in the genotype
- either a number of sequences or a fraction of
the length of
allele_callsdenoting the minimum number of times a gene must be observed in
allele_callsto be included in the genotype
germline_dbto find which samples are unmutated. Not needed if
allele_callsonly represent unmutated samples.
- named vector of sequences containing the
germline sequences named in
allele_calls. Only required if
- an optional
data.frameof the type novel returned by findNovelAlleles containing germline sequences that will be utilized if
TRUE. See details.
A table of alleles denoting the genotype of the subject
Allele calls representing cases where multiple alleles have been
assigned to a single sample sequence are rare among unmutated
sequences but may result if nucleotides for certain positions are
not available. Calls containing multiple alleles are treated as
belonging to all groups. If
novel_df is provided, all
sequences that are assigned to the same starting allele as any
novel germline allele will have the novel germline allele appended
to their assignent prior to searching for unmutated sequences.
This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.
# Infer the IGHV genotype, using only unmutated sequences, including any # novel alleles data(sample_db) data(germline_ighv) data(novel_df) inferGenotype(sample_db, find_unmutated = TRUE, germline_db = germline_ighv, novel_df = novel_df)
GENE ALLELES COUNTS TOTAL NOTE 1 IGHV1-2 02,04 664,302 966 2 IGHV1-3 01 226 226 3 IGHV1-8 01,02_G234T 467,370 837 4 IGHV1-18 01 1005 1005 5 IGHV1-24 01 105 105 6 IGHV1-46 01 624 624 7 IGHV1-58 01,02 23,18 41 8 IGHV1-69 01,04,06 515,469,280 1279 9 IGHV1-69-2 01 31 31