CDHIT is a greedy algorithm to cluster amino acid or DNA sequences based on a minimum identity. By default, in this package it is configured perform ungapped, global alignments with no clipping at start or end. The identity is the number of identical characters in alignment divided by the full length of the shorter sequence. Set s < 1 to change the minimum coverage of the shorter sequence, which will allow clipping at start or end. Changing G = 0 changes the meaning of the identity to be the number of identical characters in the alignment divided by the length of the alignment. In this case, you must also set the alignment coverage controls aL, AL, aS, AS.

cdhit(
  seqs,
  identity = NULL,
  kmerSize = NULL,
  min_length = 6,
  s = 1,
  G = 1,
  only_index = FALSE,
  showProgress = interactive(),
  ...
)

Arguments

seqs

AAseq or DNAseq

identity

minimum proportion identity

kmerSize

word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7.

min_length

Minimum length for sequences to be clustered. An error if something smaller is passed.

s

fraction of shorter sequence covered by alignment.

G

1 for global alignment, 0 for local. If doubt, pick global.

only_index

if TRUE only return the integer cluster indices, otherwise return a tibble.

showProgress

show a status bar

...

other arguments that can be passed to cdhit, see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details. These will override any default values.

Value

vector of integer of length seqs providing the cluster ID for each sequence, or a tibble. See details.

Details

CDHit is by Fu, Niu, Zhu, Wu and Li (2012). The R interface is originally by Thomas Lin Pedersen and was transcribed here because it is not exported from the package FindMyFriends, which is orphaned.

Examples

fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)
# 100% identity, global alignment
cdhit(aaseq, identity = 1, only_index = TRUE)[1:10]
#>  [1] 100 101 162 102   6 245 103  49 163 164
# 100% identity, local alignment with no padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = 1, aS = 1,  only_index = TRUE)[1:10]
#>  [1] 100 101 162 102   6 245 103  49 163 164
# 100% identity, local alignment with .9 padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = .9, aS = .9,  only_index = TRUE)[1:10]
#>  [1] 100 101 162 102   6 245 103  49 163 164
# a tibble
tbl = cdhit(aaseq, identity = 1, G = 0, aL = .9, aS = .9, only_index = FALSE)