Calculate distances and perform hierarchical clustering on a set of sequences

The distances between AA sequences is defined to be 1-score/max(score) times the median length of the input sequences. The distances between nucleotide sequences is defined to be edit_distance/max(edit_distance) times the median length of input sequences.

fine_cluster_seqs(
  seqs,
  type = "AA",
  big_memory_brute = FALSE,
  method = "levenshtein",
  substitution_matrix = "BLOSUM100",
  cluster_fun = "none",
  cluster_method = "complete"
)

Arguments

seqs: character vector, DNAStringSet or AAStringSet
type: character either AA or DNA specifying type of seqs
big_memory_brute: attempt to cluster more than 4000 sequences? Clustering is quadratic, so this will take a long time and might exhaust memory
method: one of 'substitutionMatrix' or 'levenshtein'
substitution_matrix: a character vector naming a substitution matrix available in Biostrings, or a substitution matrix itself
cluster_fun: character, one of "hclust" or "none", determining if distance matrices should also be clustered with hclust
cluster_method: character passed to hclust

Value

list

Examples

fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)[1:100]
cls = fine_cluster_seqs(aaseq, cluster_fun = 'hclust')
plot(cls$cluster)

Calculate distances and perform hierarchical clustering on a set of sequences

Arguments

Value

See also

Examples