See
https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details
on other potential arguments to ...
.
These will override any default values.
cdhit_ccdb(
ccdb,
sequence_key,
type = c("DNA", "AA"),
cluster_pk = "cluster_idx",
...
)
An object of class ContigCellDB()
character
naming the column in the contig_tbl
containing the sequence to be clustered
one of 'DNA' or 'AA'
character
specifying key, and name for the clustering.
Arguments passed on to cdhit
identity
minimum proportion identity
kmerSize
word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7.
min_length
Minimum length for sequences to be clustered. An error if something smaller is passed.
s
fraction of shorter sequence covered by alignment.
showProgress
show a status bar
G
1 for global alignment, 0 for local. If doubt, pick global.
data(ccdb_ex)
res = cdhit_ccdb(ccdb_ex, 'cdr3_nt', type = 'DNA',
cluster_name = 'DNA97', identity = .965, min_length = 12, G = 1)
res$cluster_tbl
#> # A tibble: 1,354 × 1
#> cluster_idx
#> <dbl>
#> 1 1094
#> 2 784
#> 3 785
#> 4 388
#> 5 125
#> 6 786
#> 7 126
#> 8 1322
#> 9 389
#> 10 1095
#> # … with 1,344 more rows
res$contig_tbl
#> # A tibble: 1,508 × 23
#> anno_file pop sample barcode is_cell contig_id high_confidence length chain
#> <chr> <chr> <chr> <chr> <lgl> <chr> <lgl> <dbl> <chr>
#> 1 /Users/a… b6 4 AAAGTA… TRUE AAAGTAGT… TRUE 611 TRB
#> 2 /Users/a… b6 4 AAAGTA… TRUE AAAGTAGT… TRUE 609 TRB
#> 3 /Users/a… b6 4 AAAGTA… TRUE AAAGTAGT… TRUE 538 TRA
#> 4 /Users/a… b6 4 AACCAT… TRUE AACCATGC… TRUE 799 TRA
#> 5 /Users/a… b6 4 AACTGG… TRUE AACTGGTG… TRUE 634 TRB
#> 6 /Users/a… b6 4 AACTGG… TRUE AACTGGTG… TRUE 923 TRA
#> 7 /Users/a… b6 4 AAGCCG… TRUE AAGCCGCA… TRUE 693 TRB
#> 8 /Users/a… b6 4 AAGTCT… TRUE AAGTCTGG… TRUE 658 TRB
#> 9 /Users/a… b6 4 AAGTCT… TRUE AAGTCTGG… TRUE 558 TRA
#> 10 /Users/a… b6 4 ACACCA… TRUE ACACCAAA… TRUE 614 TRB
#> # … with 1,498 more rows, and 14 more variables: v_gene <chr>, d_gene <chr>,
#> # j_gene <chr>, c_gene <chr>, full_length <lgl>, productive <chr>,
#> # cdr3 <chr>, cdr3_nt <chr>, reads <dbl>, umis <dbl>, raw_clonotype_id <chr>,
#> # raw_consensus_id <chr>, celltype <chr>, cluster_idx <dbl>
res$cluster_pk
#> [1] "cluster_idx"