Use cdhit() to cluster a ContigCellDB() — cdhit

See https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details on other potential arguments to .... These will override any default values.

cdhit_ccdb(
  ccdb,
  sequence_key,
  type = c("DNA", "AA"),
  cluster_pk = "cluster_idx",
  ...
)

Arguments

ccdb

An object of class ContigCellDB()

sequence_key

character naming the column in the contig_tbl containing the sequence to be clustered

type

one of 'DNA' or 'AA'

cluster_pk

character specifying key, and name for the clustering.

...

Arguments passed on to cdhit

identity: minimum proportion identity
kmerSize: word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7.
min_length: Minimum length for sequences to be clustered. An error if something smaller is passed.
s: fraction of shorter sequence covered by alignment.
showProgress: show a status bar
G: 1 for global alignment, 0 for local. If doubt, pick global.

Value

ContigCellDB()

Examples

data(ccdb_ex)
res = cdhit_ccdb(ccdb_ex, 'cdr3_nt', type = 'DNA',
cluster_name = 'DNA97', identity = .965, min_length = 12, G = 1)
res$cluster_tbl
#> # A tibble: 1,354 × 1
#>    cluster_idx
#>          <dbl>
#>  1        1094
#>  2         784
#>  3         785
#>  4         388
#>  5         125
#>  6         786
#>  7         126
#>  8        1322
#>  9         389
#> 10        1095
#> # … with 1,344 more rows
res$contig_tbl
#> # A tibble: 1,508 × 23
#>    anno_file pop   sample barcode is_cell contig_id high_confidence length chain
#>    <chr>     <chr> <chr>  <chr>   <lgl>   <chr>     <lgl>            <dbl> <chr>
#>  1 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               611 TRB  
#>  2 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               609 TRB  
#>  3 /Users/a… b6    4      AAAGTA… TRUE    AAAGTAGT… TRUE               538 TRA  
#>  4 /Users/a… b6    4      AACCAT… TRUE    AACCATGC… TRUE               799 TRA  
#>  5 /Users/a… b6    4      AACTGG… TRUE    AACTGGTG… TRUE               634 TRB  
#>  6 /Users/a… b6    4      AACTGG… TRUE    AACTGGTG… TRUE               923 TRA  
#>  7 /Users/a… b6    4      AAGCCG… TRUE    AAGCCGCA… TRUE               693 TRB  
#>  8 /Users/a… b6    4      AAGTCT… TRUE    AAGTCTGG… TRUE               658 TRB  
#>  9 /Users/a… b6    4      AAGTCT… TRUE    AAGTCTGG… TRUE               558 TRA  
#> 10 /Users/a… b6    4      ACACCA… TRUE    ACACCAAA… TRUE               614 TRB  
#> # … with 1,498 more rows, and 14 more variables: v_gene <chr>, d_gene <chr>,
#> #   j_gene <chr>, c_gene <chr>, full_length <lgl>, productive <chr>,
#> #   cdr3 <chr>, cdr3_nt <chr>, reads <dbl>, umis <dbl>, raw_clonotype_id <chr>,
#> #   raw_consensus_id <chr>, celltype <chr>, cluster_idx <dbl>
res$cluster_pk
#> [1] "cluster_idx"

Use `cdhit()` to cluster a `ContigCellDB()`

Arguments

Value

See also

Examples