R/clustering-methods.R
fine_cluster_seqs.Rd
The distances between AA sequences is defined to be 1-score/max(score) times the median length of the input sequences. The distances between nucleotide sequences is defined to be edit_distance/max(edit_distance) times the median length of input sequences.
fine_cluster_seqs(
seqs,
type = "AA",
big_memory_brute = FALSE,
method = "levenshtein",
substitution_matrix = "BLOSUM100",
cluster_fun = "none",
cluster_method = "complete"
)
character vector, DNAStringSet or AAStringSet
character either AA
or DNA
specifying type of seqs
attempt to cluster more than 4000 sequences? Clustering is quadratic, so this will take a long time and might exhaust memory
one of 'substitutionMatrix' or 'levenshtein'
a character vector naming a substitution matrix available in Biostrings, or a substitution matrix itself
character
, one of "hclust" or "none", determining if distance matrices should also be clustered with hclust
character passed to hclust
list
fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)[1:100]
cls = fine_cluster_seqs(aaseq, cluster_fun = 'hclust')
plot(cls$cluster)