R/clustering-methods.R
canonicalize_cluster.Rd
Find a canonical contig to represent a cluster
an expression passed to dplyr::filter()
.
Unlike filter
, multiple criteria must be &
together, rather than using
commas to separate. These act on ccdb$contig_tbl
(optional) character
naming fields in contig_tbl
that are used sort the contig table in descending order.
Used to break ties if contig_filter_args
does not return a unique contig
for each cluster
The rank order of the contig, based on tie_break_keys
to return. If tie_break_keys
included an ordered factor (such as chain)
this could be used to return the second chain.
an optional field from contig_tbl
that will be made
unique. Serve as a surrogate cluster_pk
.
Optional fields from contig_tbl
that will be copied into
the cluster_tbl
from the canonical contig.
logical
-- should non-key fields in y be overwritten using x, or should a suffix (".y") be added
ContigCellDB()
with some number of clusters/contigs/cells but with "canonical" values copied into cluster_tbl
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
data(ccdb_ex)
ccdb_ex_small = ccdb_ex
ccdb_ex_small$cell_tbl = ccdb_ex_small$cell_tbl[1:200,]
ccdb_ex_small = cdhit_ccdb(ccdb_ex_small,
sequence_key = 'cdr3_nt', type = 'DNA', cluster_name = 'DNA97',
identity = .965, min_length = 12, G = 1)
ccdb_ex_small = fine_clustering(ccdb_ex_small, sequence_key = 'cdr3_nt', type = 'DNA')
#> Calculating intradistances on 329 clusters.
#> Summarizing
# Canonicalize with the medoid contig is probably what is most common
ccdb_medoid = canonicalize_cluster(ccdb_ex_small)
#> Filtering `contig_tbl` by `is_medoid`, override by setting `contig_filter_args == TRUE`
# But there are other possibilities.
# To pass multiple "AND" filter arguments must use &
ccdb_umi = canonicalize_cluster(ccdb_ex_small,
contig_filter_args = chain == 'TRA' & length > 500, tie_break_keys = 'umis',
contig_fields = c('chain', 'length'))
#> Subset of `contig_tbl` has 157 rows for 329 clusters. Filling missing values and breaking ties
#> with umis.
ccdb_umi$cluster_tbl %>% dplyr::select(chain, length) %>% summary()
#> chain length
#> Length:329 Min. : 503.0
#> Class :character 1st Qu.: 558.0
#> Mode :character Median : 607.0
#> Mean : 620.0
#> 3rd Qu.: 665.5
#> Max. :1006.0
#> NA's :186