See https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details on other potential arguments to .... These will override any default values.

cdhit_ccdb(
  ccdb,
  sequence_key,
  type = c("DNA", "AA"),
  cluster_pk = "cluster_idx",
  ...
)

Arguments

ccdb

An object of class ContigCellDB()

sequence_key

character naming the column in the contig_tbl containing the sequence to be clustered

type

one of 'DNA' or 'AA'

cluster_pk

character specifying key, and name for the clustering.

...

Arguments passed on to cdhit

identity

minimum proportion identity

kmerSize

word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7.

min_length

Minimum length for sequences to be clustered. An error if something smaller is passed.

s

fraction of shorter sequence covered by alignment.

showProgress

show a status bar

G

1 for global alignment, 0 for local. If doubt, pick global.

See also

Examples

data(ccdb_ex)
res = cdhit_ccdb(ccdb_ex, 'cdr3_nt', type = 'DNA',
cluster_name = 'DNA97', identity = .965, min_length = 12, G = 1)
res$cluster_tbl
#> # A tibble: 1,354 × 1
#>    cluster_idx
#>          <dbl>
#>  1        1094
#>  2         784
#>  3         785
#>  4         388
#>  5         125
#>  6         786
#>  7         126
#>  8        1322
#>  9         389
#> 10        1095
#> # … with 1,344 more rows
res$contig_tbl
#> # A tibble: 1,508 × 23
#>    anno_file pop   sample barcode is_cell contig_id high_confidence length chain
#>    <chr>     <chr> <chr>  <chr>   <lgl>   <chr>     <lgl>            <dbl> <chr>
#>  1 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               611 TRB  
#>  2 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               609 TRB  
#>  3 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               538 TRA  
#>  4 /Users/a… b6    4      AACCAT… TRUE    AACCATGC… TRUE               799 TRA  
#>  5 /Users/a… b6    4      AACTGG… TRUE    AACTGGTG… TRUE               634 TRB  
#>  6 /Users/a… b6    4      AACTGG… TRUE    AACTGGTG… TRUE               923 TRA  
#>  7 /Users/a… b6    4      AAGCCG… TRUE    AAGCCGCA… TRUE               693 TRB  
#>  8 /Users/a… b6    4      AAGTCT… TRUE    AAGTCTGG… TRUE               658 TRB  
#>  9 /Users/a… b6    4      AAGTCT… TRUE    AAGTCTGG… TRUE               558 TRA  
#> 10 /Users/a… b6    4      ACACCA… TRUE    ACACCAAA… TRUE               614 TRB  
#> # … with 1,498 more rows, and 14 more variables: v_gene <chr>, d_gene <chr>,
#> #   j_gene <chr>, c_gene <chr>, full_length <lgl>, productive <chr>,
#> #   cdr3 <chr>, cdr3_nt <chr>, reads <dbl>, umis <dbl>, raw_clonotype_id <chr>,
#> #   raw_consensus_id <chr>, celltype <chr>, cluster_idx <dbl>
res$cluster_pk
#> [1] "cluster_idx"