The distances between AA sequences is defined to be 1-score/max(score) times the median length of the input sequences. The distances between nucleotide sequences is defined to be edit_distance/max(edit_distance) times the median length of input sequences.

fine_cluster_seqs(
  seqs,
  type = "AA",
  big_memory_brute = FALSE,
  method = "levenshtein",
  substitution_matrix = "BLOSUM100",
  cluster_fun = "none",
  cluster_method = "complete"
)

Arguments

seqs

character vector, DNAStringSet or AAStringSet

type

character either AA or DNA specifying type of seqs

big_memory_brute

attempt to cluster more than 4000 sequences? Clustering is quadratic, so this will take a long time and might exhaust memory

method

one of 'substitutionMatrix' or 'levenshtein'

substitution_matrix

a character vector naming a substitution matrix available in Biostrings, or a substitution matrix itself

cluster_fun

character, one of "hclust" or "none", determining if distance matrices should also be clustered with hclust

cluster_method

character passed to hclust

Value

list

Examples

fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)[1:100]
cls = fine_cluster_seqs(aaseq, cluster_fun = 'hclust')
plot(cls$cluster)