Find a canonical contig to represent a cluster

canonicalize_cluster(
  ccdb,
  contig_filter_args,
  tie_break_keys = character(),
  order = 1,
  representative = ccdb$cluster_pk[1],
  contig_fields = c("cdr3", "cdr3_nt", "chain", "v_gene", "d_gene", "j_gene"),
  overwrite = TRUE
)

Arguments

ccdb

ContigCellDB()

contig_filter_args

an expression passed to dplyr::filter(). Unlike filter, multiple criteria must be & together, rather than using commas to separate. These act on ccdb$contig_tbl

tie_break_keys

(optional) character naming fields in contig_tbl that are used sort the contig table in descending order. Used to break ties if contig_filter_args does not return a unique contig for each cluster

order

The rank order of the contig, based on tie_break_keys to return. If tie_break_keys included an ordered factor (such as chain) this could be used to return the second chain.

representative

an optional field from contig_tbl that will be made unique. Serve as a surrogate cluster_pk.

contig_fields

Optional fields from contig_tbl that will be copied into the cluster_tbl from the canonical contig.

overwrite

logical -- should non-key fields in y be overwritten using x, or should a suffix (".y") be added

Value

ContigCellDB() with some number of clusters/contigs/cells but with "canonical" values copied into cluster_tbl

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
data(ccdb_ex)
ccdb_ex_small = ccdb_ex
ccdb_ex_small$cell_tbl = ccdb_ex_small$cell_tbl[1:200,]
ccdb_ex_small = cdhit_ccdb(ccdb_ex_small,
sequence_key = 'cdr3_nt', type = 'DNA', cluster_name = 'DNA97',
identity = .965, min_length = 12, G = 1)
ccdb_ex_small = fine_clustering(ccdb_ex_small, sequence_key = 'cdr3_nt', type = 'DNA')
#> Calculating intradistances on 329 clusters.
#> Summarizing

# Canonicalize with the medoid contig is probably what is most common
ccdb_medoid = canonicalize_cluster(ccdb_ex_small)
#> Filtering `contig_tbl` by `is_medoid`, override by setting `contig_filter_args == TRUE`

# But there are other possibilities.
# To pass multiple "AND" filter arguments must use &
ccdb_umi = canonicalize_cluster(ccdb_ex_small,
contig_filter_args = chain == 'TRA' & length > 500, tie_break_keys = 'umis',
contig_fields = c('chain', 'length'))
#> Subset of `contig_tbl` has 157 rows for 329 clusters. Filling missing values and breaking ties 
#> with umis.
ccdb_umi$cluster_tbl %>% dplyr::select(chain, length) %>% summary()
#>     chain               length      
#>  Length:329         Min.   : 503.0  
#>  Class :character   1st Qu.: 558.0  
#>  Mode  :character   Median : 607.0  
#>                     Mean   : 620.0  
#>                     3rd Qu.: 665.5  
#>                     Max.   :1006.0  
#>                     NA's   :186