CDHIT is a greedy algorithm to cluster amino acid or DNA sequences based on a
minimum identity.
By default, in this package it is configured perform ungapped, global
alignments with no clipping at start or end.
The identity
is the number of identical characters in alignment
divided by the full length of the shorter sequence.
Set s
< 1 to change the minimum coverage of the shorter sequence, which
will allow clipping at start or end.
Changing G
= 0 changes the meaning of the identity
to be the number of
identical characters in the alignment divided by the length of the alignment.
In this case, you must also set the alignment coverage controls aL
, AL
, aS
, AS
.
cdhit(
seqs,
identity = NULL,
kmerSize = NULL,
min_length = 6,
s = 1,
G = 1,
only_index = FALSE,
showProgress = interactive(),
...
)
AAseq
or DNAseq
minimum proportion identity
word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7.
Minimum length for sequences to be clustered. An error if something smaller is passed.
fraction of shorter sequence covered by alignment.
1 for global alignment, 0 for local. If doubt, pick global.
if TRUE only return the integer cluster indices, otherwise return a tibble.
show a status bar
other arguments that can be passed to cdhit, see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details. These will override any default values.
vector of integer
of length seqs
providing the cluster
ID for each sequence, or a tibble
. See details.
CDHit is by Fu, Niu, Zhu, Wu and Li (2012). The R interface is originally by Thomas Lin Pedersen and was transcribed here because it is not exported from the package FindMyFriends, which is orphaned.
fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)
# 100% identity, global alignment
cdhit(aaseq, identity = 1, only_index = TRUE)[1:10]
#> [1] 100 101 162 102 6 245 103 49 163 164
# 100% identity, local alignment with no padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = 1, aS = 1, only_index = TRUE)[1:10]
#> [1] 100 101 162 102 6 245 103 49 163 164
# 100% identity, local alignment with .9 padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = .9, aS = .9, only_index = TRUE)[1:10]
#> [1] 100 101 162 102 6 245 103 49 163 164
# a tibble
tbl = cdhit(aaseq, identity = 1, G = 0, aL = .9, aS = .9, only_index = FALSE)