subsampleDb - Subsample repertoire
Description¶
subsampleDb
will sample the same number of sequences for each gene, family
or allele (specified with mode
) in data
. Samples or subjects can
be subsampled indepently by setting group
.
Usage¶
subsampleDb(
data,
gene = "v_call",
mode = c("gene", "allele", "family"),
min_n = 1,
max_n = NULL,
group = NULL
)
Arguments¶
- data
data.frame
containing repertoire data.- gene
- name of the column in
data
with allele calls. Default isv_call
. - mode
- one of
c("gene", "family", "allele")
defining the degree of specificity regarding allele calls when subsetting sequences. Determines howdata
will be split into subsets from which the same number of sequences will be subsampled. See alsogroup
. - min_n
- minimum number of observations to sample from each groupe. A group with less observations than the minimum is excluded.
- max_n
- maximum number of observations to sample for all
mode
groups. IfNULL
, it will be set automatically to the size of the smallest group. Ifmax_n
is larger than the availabe number of sequences for anymode
group, if will be automatically adjusted and the efectivemax_n
used will be the size of the smallestmode
group. - group
- columns containing additional grouping variables, e.g. sample_id.
These groups will be subsampled independently. If
max_n
isNULL
, amax_n
will be automatically set for eachgroup
.
Value¶
Subsampled version of the input data
.
Details¶
data
will be split into gene, allele or family subsets (mode
) from
which the same number of sequences will be subsampled. If mode=gene
,
for each gene in the field gene
from data
, a maximum of
max_n
sequences will be subsampled. Input sequences
that have multiple gene calls (ties), can be subsampled from any of their calls,
but these duplicated samplings will be removed, and the final
subsampled data
will contain unique rows.
Examples¶
subsampleDb(AIRRDb)
# A tibble: 139 × 26
sequence_id sequence rev_comp productive v_call d_call j_call
<chr> <chr> <lgl> <lgl> <chr> <chr> <chr>
1 077294_2726_1483_length=42… NNNNNNN… NA TRUE IGHV1… IGHD3… IGHJ5…
2 240679_3698_3066_length=42… NNNNNNN… NA TRUE IGHV1… IGHD3… IGHJ3…
3 014026_2186_3677_length=45… NNNNNNN… NA TRUE IGHV1… IGHD4… IGHJ6…
4 087866_2612_1813_length=47… NNNNNNN… NA TRUE IGHV1… IGHD3… IGHJ6…
5 186255_1506_0138_length=45… NNNNNNN… NA TRUE IGHV1… IGHD2… IGHJ6…
6 094811_0577_1785_length=45… NNNNNNN… NA TRUE IGHV1… IGHD3… IGHJ6…
7 200273_3493_1744_length=45… NNNNNNN… NA TRUE IGHV1… IGHD2… IGHJ6…
8 008329_2221_1510_length=43… NNNNNNN… NA TRUE IGHV1… IGHD2… IGHJ6…
9 098375_3215_3198_length=43… NNNNNNN… NA TRUE IGHV1… IGHD6… IGHJ5…
10 011171_2123_0412_length=44… NNNNNNN… NA TRUE IGHV1… IGHD3… IGHJ4…
# ℹ 129 more rows
# ℹ 19 more variables: sequence_alignment <chr>, germline_alignment <lgl>,
# junction <chr>, junction_aa <lgl>, v_cigar <lgl>, d_cigar <lgl>,
# j_cigar <lgl>, vj_in_frame <lgl>, stop_codon <lgl>, v_germline_end <lgl>,
# np1_length <dbl>, np2_length <dbl>, j_germline_end <lgl>,
# junction_length <dbl>, mutated_invariant <lgl>, indels <lgl>,
# d_5_trim <dbl>, d_3_trim <dbl>, j_5_trim <dbl>