generateEvidence - Generate evidence
generateEvidence builds a table of evidence metrics for the final novel V
allele detection and genotyping inferrences.
generateEvidence(data, novel, genotype, genotype_db, germline_db, fields = NULL)
data.framecontaining sequence data that has been passed through reassignAlleles to correct the allele assignments.
data.framereturned by findNovelAlleles.
data.frameof alleles generated with inferGenotype denoting the genotype of the subject.
- a vector of named nucleotide germline sequences in the genotype. Returned by genotypeFasta.
- the original uncorrected germline database used to by findNovelAlleles to identify novel alleles.
- character vector of column names used to split the data to
identify novel alleles, if any. If
NULLthen the data is not divided by grouping variables.
data.frame with the following additional columns
providing supporting evidence for each inferred allele:
FIELD_ID: Data subset identifier, defined with the input paramter
- A variable number of columns, specified with the input parameter
POLYMORPHISM_CALL: The novel allele call.
NOVEL_IMGT: The novel allele sequence.
CLOSEST_REFERENCE: The closest reference gene and allele in the
CLOSEST_REFERENCE_IMGT: Sequence of the closest reference gene and allele in the
GERMLINE_CALL: The input (uncorrected) V call.
GERMLINE_IMGT: Germline sequence for
NT_DIFF: Number of nucleotides that differ between the new allele and the closest reference (
CLOSEST_REFERENCE) in the
NT_SUBSTITUTIONS: A comma separated list of specific nucleotide differences (e.g.
112G>A) in the novel allele.
AA_DIFF: Number of amino acids that differ between the new allele and the closest reference (
CLOSEST_REFERENCE) in the
AA_SUBSTITUTIONS: A comma separated list with specific amino acid differences (e.g.
96A>N) in the novel allele.
SEQUENCES: Number of sequences unambiguosly assigned to this allele.
UNMUTATED_SEQUENCES: Number of records with the unmutated novel allele sequence.
UNMUTATED_FREQUENCY: Proportion of records with the unmutated novel allele sequence (
UNMUTATED_SEQUENCES / SEQUENCE).
ALLELIC_PERCENTAGE: Percentage at which the (unmutated) allele is observed in the sequence dataset compared to other (unmutated) alleles.
UNIQUE_JS: Number of unique J sequences found associated with the novel allele. The sequences are those who have been unambiguously assigned to the novel allelle (
UNIQUE_CDR3S: Number of unique CDR3s associated with the inferred allele. The sequences are those who have been unambiguously assigned to the novel allelle (POLYMORPHISM_CALL).
MUT_MIN: Minimum mutation considered by the algorithm.
MUT_MAX: Maximum mutation considered by the algorithm.
POS_MIN: First position of the sequence considered by the algorithm (IMGT numbering).
POS_MAX: Last position of the sequence considered by the algorithm (IMGT numbering).
Y_INTERCEPT: The y-intercept above which positions were considered potentially polymorphic.
ALPHA: Significance threshold to be used when constructing the confidence interval for the y-intercept.
min_seqs. The minimum number of total sequences (within the desired mutational range and nucleotide range) required for the samples to be considered.
j_max. The maximum fraction of sequences perfectly aligning to a potential novel allele that are allowed to utilize to a particular combination of junction length and J gene.
min_frac. The minimum fraction of sequences that must have usable nucleotides in a given position for that position to be considered.
NOTE: Comments regarding the novel allele inferrence.
# Generate input data novel <- findNovelAlleles(SampleDb, GermlineIGHV) genotype <- inferGenotype(SampleDb, find_unmutated=TRUE, germline_db=GermlineIGHV, novel=novel) genotype_db <- genotypeFasta(genotype, GermlineIGHV, novel) data_db <- reassignAlleles(SampleDb, genotype_db) # Assemble evidence table evidence <- generateEvidence(data_db, novel, genotype, genotype_db, GermlineIGHV)