Information

Is there a simple way to generate sets of sequences that re-capitulate the topology of pre-defined phylogenetic tree

Is there a simple way to generate sets of sequences that re-capitulate the topology of pre-defined phylogenetic tree



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I would like to generate ensembles of N sequences in such a way that the phylogenetic trees obtained from my ensembles (e.g., by neighbor joining) have the same (or nearly the same) topology as a specified "real" tree obtained from an alignment of N proteins.

My starting point would be an empirically determined amino acid substitution matrix. The matrix would be used to generate a "random" sequence representing the root of my simulated tree. The root sequence would then be evolved, "split", and its products evolved, iteratively until N sequences are generated. Somehow, the topology of the "real" tree would be used to guide the branching process.

This problem arises in determining the significance of correlations between mutations at different sites in a protein (specifically, the "mutual information" measure). One would like to determine the contribution of "pure phylogenetic" correlations to the mutual information for a given alignment of proteins.

The answer to my question seems connected to the "parametric bootstrap procedure" commonly used in phylogeny reconstruction. In fact, the reference below seems to use this procedure to generate the ensembles I describe above. However, their procedure is not explained in detail. This makes me think that there may be a simple answer to my question.

reference :

"Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap" K. R. Wollenberg and W. R. Atchley (2000) PNAS 97, 3288-3291


Yes, what you are looking for seems to be an alignment simulator, which is quite commonly used in phylogenetics. Given a phylogenetic tree and a substitution model, you can use for instance INDELible or DAWG to simulate the extant sequences (N sequences at the tips). These programs can also simulate the indel process (gaps), which means your final sequences won't have, by default, the same length -- but you can change the settings to exclude insertions/deletions.

There is no absolute guarantee that the tree estimated from the simulated sequences will follow the "true" one (used to simulate the sequences), but they should be close enough if:

  1. you don't simulate gaps
  2. the branch lengths are not too short
  3. the sequences are large enough

Chapter 4 Visualization and annotation of phylogenetic trees: ggtree

There are many software packages and webtools that are designed for displaying phylogenetic trees, such as TreeView (Page 2002) , FigTree 7 , TreeDyn (Chevenet et al. 2006) , Dendroscope (Huson and Scornavacca 2012) , EvolView (Z. He et al. 2016) and iTOL (Letunic and Bork 2007) , etc.. Only several of them, such as FigTree, TreeDyn and iTOL, allow users to annotate the trees with coloring branches, highlighted clades and tree features. However, their pre-defined annotating functions are usually limited to some specific phylogenetic data. As phylogenetic trees become more widely used in multidisciplinary studies, there is an increasing need to incorporate various types of the phylogenetic covariates and other associated data from different sources into the trees for visualizations and further analyses. For instance, influenza virus has a wide host range, diverse and dynamic genotypes and characteristic transmission behaviors that are mostly associated with the virus evolution and essentially among themselves. Therefore, in addition to standalone applications that focus on each of the specific analysis and data type, researchers studying molecular evolution need a robust and programmable platform that allows the high levels integration and visualization of many of these different aspects of data (raw or from other primary analyses) over the phylogenetic trees to identify their associations and patterns.

To fill this gap, we developed ggtree, a package for the R programming language (R Core Team 2016) released under the Bioconductor project (Gentleman et al. 2004) . The ggtree is built to work with phylogenetic data object generated by treeio, and display tree graphics with ggplot2 package (Wickham 2016) that was based on the grammar of graphics (Wilkinson et al. 2005) .

The R language is increasingly being used in phylogenetics. However, a comprehensive package, designed for viewing and annotating phylogenetic trees, particularly with complex data integration, is not yet available. Most of the R packages in phylogenetics focus on specific statistical analyses rather than viewing and annotating the trees with more generalized phylogeny-associated data. Some packages, including ape (Paradis, Claude, and Strimmer 2004) and phytools (Revell 2012) , which are capable of displaying and annotating trees, are developed using the base graphics system of R. In particular, ape is one of the fundamental package for phylogenetic analysis and data processing. However, the base graphics system is relatively difficult to extend and limits the complexity of tree figure to be displayed. OutbreakTools (Jombart et al. 2014) and phyloseq (McMurdie and Holmes 2013) extended ggplot2 to plot phylogenetic trees. The ggplot2 system of graphics allows rapid customization and exploration of design solutions. However these packages were designed for epidemiology and microbiome data respectively and did not aim to provide a general solution for tree visualization and annotation. The ggtree package also inherits versatile properties of ggplot2, and more importantly allows constructing complex tree figures by freely combining multiple layers of annotations using the tree associated data imported from different sources (importation using treeio package is detailed in Chapter 2).


Abstract

The fourth industrial revolution (I 4.0) is paving the way for change in manufacturing systems. A logical enabler for dynamic and adaptive manufacturing systems including smart automated guided vehicles (AGVs) is presented. It can respond to requests for changing operations sequences received digitally or via distributed sensors, and change the processing route according to pre-planned flow sequences and pre-determined alternatives. A novel method for generating a master assembly network with alternative sequences based on legacy assembly data for a product family is developed. A master assembly network is a generic multiple alternative assembly sequences for a group of product variants belonging to a family where they share some parts and have common product structure. The assembly network with alternative sequences for a new variant is extracted from the master assembly network. These alternative sequences increase the flexibility and adaptability of the assembly system to handle workshop disruptions such as change orders, machine breakdowns and tool failures. The developed method is inspired by the phylogenetic networks used in biology, namely the soft-wired galled network. A Genetic Algorithm is developed to generate the master assembly network that summarizes a set of conflicting rooted assembly sequence trees. A family of three control valves is used as a case study. The proposed method can be utilized in any manufacturing system that use alternative assembly sequence including those utilizing smart AGVs in and Industry 4.0 dynamic environment. The developed method decreases the time and cost of introducing a new product variant as well as increases the responsiveness of the manufacturing system.


Author Summary

There are millions of sequences in the human genome that perform essential functions, such as protein-coding exons, noncoding RNAs, and regulatory sequences that control the transcription of genes. However, these functional sequences are embedded in a background of DNA that serves no discernible function. Thus, a major challenge in the field of genomics is the accurate identification of functional sequences in the human genome. One approach to identify functional sequences is to align the genome sequences of many divergent species and search for sequences whose similarity has been maintained during evolution. We have developed GERP++, a software tool that utilizes this “comparative genomics” approach to identify putatively functional sequences. Given a multiple sequence alignment, GERP++ identifies sites under evolutionary constraint, i.e., sites that show fewer substitutions than would be expected to occur during neutral evolution. GERP++ then aggregates these sites into longer, potentially functional sequences called constrained elements. Using GERP++ results in improved resolution of functional sequence elements in the human genome and reveals that a higher proportion of the human genome is under evolutionary constraint (∼7%) than was previously estimated.

Citation: Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S (2010) Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++. PLoS Comput Biol 6(12): e1001025. https://doi.org/10.1371/journal.pcbi.1001025

Editor: Wyeth W. Wasserman, University of British Columbia, Canada

Received: August 27, 2010 Accepted: November 8, 2010 Published: December 2, 2010

Copyright: © 2010 Davydov et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The work was partly supported by an Encode subcontract to AS and SB (PI, Richard Myers), NIH/NHGRI. This work was supported by US National Library of Medicine (K22 LM008261 and T15 LM007033). This work has been supported in part by NSF grant #0347952. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


Consensus tree method for generating master assembly sequence

An assembly process plan for a given product provides the sequence of assembly operations, their times as well as the required tools and fixtures for each operation. Much research has been done on automating and optimizing assembly sequence generation as the most important part of an assembly process plan. A novel method for generating the assembly sequence of a given product based on available assembly sequence data of similar products is presented. The proposed method uses a binary tree form to represent the assembly sequences of an existing family of products. A Genetic Algorithm is employed to find the consensus tree that represents the set of all assembly sequence trees with minimum total dissimilarity distance. This is similar to defining Generic Bill-of-Material. The generated consensus tree serves as a master assembly sequence for the product family. The assembly sequence for a new product variant that falls within, or significantly overlaps with, the scope of the considered family of products can be directly extracted from the derived master assembly sequence tree. The developed method is demonstrated using a family of three control valves. This novel method greatly simplifies and enhances automatic assembly sequence generation and minimizes subsequent modifications, hence, reduces assembly planning cost and improves productivity.

This is a preview of subscription content, access via your institution.


Is there a simple way to generate sets of sequences that re-capitulate the topology of pre-defined phylogenetic tree - Biology

Authors: Gopalacharyulu, Lindfors, Bounsaythip, Kivioja, Yetukuri, Hollmen, Oresic

Motivation: Integration of heterogeneous data in life sciences is a growing and recognized challenge. The problem is not only to enable the study of such data within the context of a biological question, but also more fundamentally how to represent the available knowledge and make it accessible for mining.

Results: Our integration approach is based on the premise that relationships between biological entities can be represented as a complex network. The context dependency is achieved by a judicious use of distance measures on these networks. The biological entities and the distances between them are mapped for the purpose of visualization into the lower-dimensional space using the Sammon&rsquos mapping. The system implementation is based on a multi-tier architecture using a native XML database and a software tool for querying and visualizing complex biological networks. The functionality of our system is demonstrated with two examples: (1) A multiple pathway retrieval, in which, given a pathway name, the system finds all the relationships related to the query by checking available metabolic pathway, transcriptional, signaling, protein-protein interaction, and ontology annotation resources and (2) A protein neighborhood search, in which given a protein name, the system finds all its connected entities within a specified depth. These two examples show that our system is able to conceptually traverse different databases to produce testable hypotheses and lead towards answers to complex biological questions.

Authors: Sharan, Myers

Motivation: Many signals in biological sequences are based on the presence or absence of base signals and their spatial combinations. One of the best known examples in this regard is the signal identifying a core promoter---the site at which the basal transcription machinery starts the transcription of a gene. Our goal is a fully automatic pattern recognition system for a family of sequences that simultaneously discovers the base signals, their spatial relationships and a classifier based upon them.

Results: In this paper we present a general method for characterizing a set of Sequences by their recurrent motifs. Our approach relies on novel probabilistic models for DNA binding sites and modules of binding sites, on algorithms to learn them from data, and on a support vector machine that uses the learned models to classify a set of sequences. We demonstrate the applicability of our approach to diverse instances, ranging from families of promoter sequences to a data set of intronic sequences flanking alternatively spliced exons. On a core promoter data set our results are comparable to the state-of-the-art McPromoter. On a data set of alternatively spliced exons we outperform a previous approach. We also achieve high success rates in recognizing cell cycle regulated genes. These results demonstrate that a fully automatic pattern recognition algorithm can meet or exceed the performance of hand-crafted approaches.

Availability: The software and data sets are available from the authors upon request.

Authors: Dolan, Ni, Camon, Blake

Motivation: The Gene Ontology [GO] is widely used to annotate molecular attributes of genes and gene products. Multiple groups undertaking functional annotations of genomes contribute their annotation sets to the GO database resource and these data are subsequently used in comparative functional analysis research. Although GO curators adhere to the same protocols and standards while assigning GO annotations, the specific procedure followed by each annotation group can vary. Since differences in application of annotation standards would dilute the effectiveness of comparative analysis, methods for assessing annotation consistency are essential. The development of methodologies that are broadly applicable for the assessment of GO annotation consistency is an important issue for the comparative genomics community.

Results: We have developed a methodology for assessing the consistency of GO annotations provided by different annotation groups. The method is completely general and can be applied to compare any two sets of GO annotations. This is the first attempt to assess cross-species GO annotation consistency. Our method compares annotation sets utilizing the hierarchical structure of the GO to compare GO annotations between orthologous gene pairs. The method produces a report on the annotation consistency and inconsistency for each orthologous pair. We present results obtained by comparing GO annotations for mouse and human gene sets.

Availability: The complete current MGI_GOA GO annotation consistency report is available online at http://www.spatial.maine.edu/

Authors: Käll, Krogh, Sonnhammer

When predicting sequence features like transmembrane topology, signal peptides, coil-coil structures, protein secondary structure, or genes, extra support can be gained from homologs. We here present a general HMM decoding algorithm that combines probabilities for sequence features of homologs by considering the average of the posterior label probability of each position in a global sequence alignment. The algorithm is an extension of the previously described "optimal accuracy" decoder, allowing homology information to be used. It was benchmarked using an HMM for transmembrane topology and signal peptide prediction, Phobius. We found that performance was substantially increased when incorporating information from homologs.

Authors: Cheung, Yip, Smith, deKnikker, Masiar, Gerstein

Motivation: As the semantic web technology is maturing and the need for life sciences data integration over the web is growing, it is important to explore how data integration needs can be addressed by the semantic web. The main problem that we face in data integration is a lack of widely-accepted standards for expressing the syntax and semantics of the data. We address this problem by exploring the use of semantic web technologies — including Resource Description Framework (RDF), RDF Site Summary (RSS), relational-database-to-RDF mapping (D2RQ), and native RDF data repository — to represent, store, and query both metadata and data across life sciences datasets.

Results: As many biological datasets are presently available in tabular format, we introduce an RDF structure into which they can be converted. Also, we develop a prototype web-based application called YeastHub that demonstrates how a life sciences data warehouse can be built using a native RDF data store (Sesame). This data warehouse allows integration of different types of yeast genome data provided by different resources in different formats including the tabular and RDF formats. Once the data are loaded into the data warehouse, RDF-based queries can be formulated to retrieve and query the data in an integrated fashion.

Availability: The YeastHub web site is accessible via the following URL: http://yeasthub.gersteinlab.org

Authors: Mahony, Golden, Smith, Benos

Motivation: One of the limiting factors in deciphering transcriptional regulatory networks is the effectiveness of motif-finding software. An emerging avenue for improving motif-finding accuracy aims to incorporate generalised binding constraints of related transcription factors (TFs), named familial binding profiles (FBPs), as priors in motif identification methods. A motif-finder can thus be “biased” towards finding motifs from a particular TF family. However, current motif-finders allow only a single FBP to be used as a prior in a given motif-finding run. In addition, current FBP construction methods are based on manual clustering of position specific scoring matrices (PSSMs) according to the known structural properties of the TF proteins. Manual clustering assumes that the binding preferences of structurally similar TFs will also be similar. This assumption is not true, at least not for some TF families. Automatic PSSM clustering methods are thus required for augmenting the usefulness of FBPs.

Results: A novel method is developed for automatic clustering of PSSM models. The resulting FBPs are incorporated into the SOMBRERO motif-finder, significantly improving its performance when finding motifs related to those that have been incorporated. SOMBRERO is thus the only existing de novo motif-finder that can incorporate knowledge of all known PSSMs in a given motif-finding run.

Availability: The methods outlined will be incorporated into the next release of SOMBRERO, which is available from http://bioinf.nuigalway.ie/sombrero.

Authors: Kou, Cohen, Murphy

Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches to that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary—the measure of most interest in our intended application.

Authors: Prakash, Tompa

BLAST [Altschul et al., 1990] statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. While these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences.

Authors: Narayanaswamy, Ravikumar, Vijay-Shanker

Phosphorylation is an important biochemical reaction that plays a critical Role in signal transduction pathways and cell-cycle processes. A text mining system to extract the phosphorylation relation from the literature is reported. The focus of this paper is on the new methods developed and implemented to connect and merge pieces of information about phosphorylation mentioned in different sentences in the text. The effectiveness and accuracy of the system as a whole as well as that of the methods for extraction beyond a clause/sentence is evaluated using an independently annotated dataset, the Phospho.ELM database. The new methods developed to merge pieces of information from different sentences are shown to effective in significantly raising the recall without much difference in precision.

Authors: Nagarajan, Jones, Keich

Motivation: The efficient and accurate computation of p-values is an essential requirement for motif-finding and alignment tools. We show that the approximation algorithms used in two popular motif-finding programs, MEME and Consensus, can fail to accurately compute the p-value.

Results: We present two new algorithms: one for the evaluation of the p-values of a range of motif scores, and a faster one for the evaluation of the p-value of a single motif score. Both exhibit more reliability than existing algorithms, and the latter algorithm is comparable in speed to the fastest existing method.

Availability: The algorithms described in this paper are available from http://www.cs.cornell.edu/

Authors: Batt, Ropers, de Jong, Geiselmann, Mateescu, Page, Schneider

Motivation: The modeling and simulation of genetic regulatory networks has created the need for tools for model validation. The main challenges of model validation are the achievement of a match between the precision of model predictions and experimental data, as well as the efficient and reliable comparison of the predictions and observations.

Results: We present an approach towards the validation of models of genetic regulatory networks addressing the above challenges. It combines a method for qualitative modeling and simulation with techniques for model checking, and is supported by a new version of the computer tool GNA. The model-validation approach has been applied to the analysis of the network controlling the nutritional stress response in Escherichia coli. Availability: GNA and the model of the stress response network are available at http://www-helix.inrialpes.fr/gna

Authors: Hahn, Lee

Motivation: The recent release of the draft sequence of the chimpanzee genome is an invaluable resource for finding genome-wide genetic differences that might explain phenotypic differences between humans and chimpanzees.

Results: In this paper, we describe a simple procedure to identify potential human-specific frameshift mutations that occurred after the divergence of human and chimpanzee. The procedure involves collecting human coding exons bearing insertions or deletions compared to the chimpanzee genome and identification of homologs from other species supporting that the mutations are human-specific. Using this procedure, we identified nine genes, BASE, DNAJB3, FLJ33674, HEJ1, NTSR2, RPL13AP, SCGB1D4, WBSCR27, and ZCCHC13, that show human-specific alterations including truncations of the C-terminus. In some cases, the frameshift mutation results in gene inactivation or decay. In other cases, the altered protein seems to be functional. This study demonstrates that even the unfinished chimpanzee genome sequence can be useful in identifying modification of genes that are specific to the human lineage and therefore could potentially be relevant to the study of the acquisition of human-specific traits.

Authors: Ben-Hur, Noble Paper

Despite advances in high throughput methods for discovering protein-protein interactions, the interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions.

We present a kernel method for predicting protein-protein interactions using a combination of data sources, including protein sequences, Gene Ontology annotations, local properties of the network, and homologous interactions in other species. Whereas protein kernels proposed in the literature provide a similarity between single proteins, prediction of interactions requires a kernel between pairs of proteins. We propose a pairwise kernel that converts a kernel between single proteins into a kernel between pairs of proteins, and we illustrate the kernel's effectiveness in conjunction with a support vector machine classifier. Furthermore, we obtain improved performance by combining several sequence-based kernels based on k-mer frequency, motif and domain content and by further augmenting the pairwise sequence kernel with features that are based on other sources of data.

We apply our method to predict physical interactions in yeast using data from the BIND database. At a false positive rate of 1% the classifier retrieves close to 80% of a set of trusted interactions. We thus demonstrate the ability of our method to make accurate predictions despite the sizeable fraction of false positives that are known to exist in interaction databases.

Authors: Tuller, Chor

Motivation: Maximum likelihood (ML) is an increasingly popular optimality criterion for selecting evolutionary trees (Felsenstein, 1981). Yet the computational complexity of ML was open for over 20 years, and only recently resolved by the authors (2005) for the Jukes Cantor model of substitution and its generalizations. It was proved that reconstructing the ML tree is computationally intractable (NP hard). In this work we explore three directions, which extend that result.

  1. We show that ML under the assumption of molecular clock is still computationally intractable (NP hard).
  2. We show that not only is it computationally intractable to find the exact ML tree, even approximating the logarithm of the maximum likelihood for any multiplicative factor smaller than 1.00175 is computationally intractable.
  3. We develop an algorithm for approximating log-likelihood under the condition that the input sequences are sparse. It employs any approximation algorithm for parsimony, and asymptotically achieves the same approximation ratio. We note that ML reconstruction for sparse inputs is still hard under this condition, and furthermore many real datasets satisfy it.

Authors: Ye, Osterman, Overbeek, Godzik

Motivation: Proteins work together in pathways and networks, collectively comprising the cellular machinery. A subsystem (a generalization of pathway concept) is a group of related functional roles (such as enzymes) jointly involved in a specific aspect of the cellular machinery. Subsystems provide a natural framework for comparative genome analysis and functional annotation. A subsystem may be implemented in a number of different functional variants in individual species. In order to reliably project functional assignments across multiple genomes, we have to be able to identify the variants implemented in each genome. The analysis of such variants across diverse species is an interesting problem by itself and may provide new evolutionary insights. However, no computational techniques are presently available for an automated detection and analysis of subsystem variants.

Results: Here we formulate the subsystem variant detection problem as finding the minimum number of subgraphs of a subsystem, which is represented as a graph, and solve the optimization problem by integer programming approach. The performance of our method was tested on subsystems encoded in the SEED, a genomic integration platform developed by the Fellowship for Interpretation of Genomes (FIG) as a component of a large-scale effort on comparative analysis and annotation of multiple diverse genomes. Here we illustrate the results obtained for two expert-encoded subsystems of the biosynthesis of Coenzyme A and FMN/FAD cofactors. Applications of variant detection, to support genomic annotations and to assess divergence of species, are briefly discussed in the context of these universally conserved and essential metabolic subsystems.

Authors: Dimmic, Hubisz, Bustamante, Nielsen

Motivation: The evolution of protein sequences is constrained by complex interactions between amino acid residues. Because harmful substitutions may be compensated by other substitutions at neighboring sites, residues can coevolve. We describe a Bayesian phylogenetic approach to the detection of coevolving residues in protein families. This method, Bayesian mutational mapping (BMM), assigns mutations to the branches of the evolutionary tree stochastically, and then test statistics are calculated to determine whether a coevolutionary signal exists in the mapping. Posterior predictive P-values provide an estimate of significance, and specificity is maintained by integrating over uncertainty in the estimation of the tree topology, branch lengths, and substitution rates. A coevolutionary Markov model for codon substitution is also described, and this model is used as the basis of several test statistics.

Results: Results on simulated coevolutionary data indicate that the BMM method can successfully detect nearly all coevolving sites when the model has been correctly specified, and that nonparametric statistics such as mutual information are generally less powerful than parametric statistics. On a dataset of eukaryotic proteins from the phosphoglycerate kinase (PGK) family, interdomain site contacts yield a significantly greater coevolutionary signal than interdomain non-contacts, an indication that the method provides information about interacting sites. Failure to account for the heterogeneity in rates across sites in PGK resulted in a less discriminating test, yielding a marked increase in the number of reported positives at both contact and non-contact sites.

Authors: Jönsson, Heisler, Reddy, Agrawal, Gor, Shapiro, Mjolsness, Meyerowitz

Motivation: The above ground tissues of higher plants are generated from a small region of cells situated at the plant apex called the shoot apical meristem. An important genetic control circuit modulating the size of the Arabidopsis thaliana meristem is a feed-back network between the CLAVATA3 and WUSCHEL genes. Although the expression patterns for these genes do not overlap, WUSCHEL activity is both necessary and sufficient (when expressed ectopically) for the induction of CLAVATA3 expression. However, upregulation of CLAVATA3 in conjunction with the receptor kinase CLAVATA1 results in the down regulation of WUSCHEL. Despite much work, experimental data for this network is incomplete and additional hypotheses are needed to explain the spatial locations and dynamics of these expression domains. Predictive mathematical models describing the system should provide a useful tool for investigating and discriminating among possible hypotheses, by determining which hypotheses best explain observed gene expression dynamics.

Results: We are developing a method using in vivo live confocal microscopy to capture quantitative gene expression data and create templates for computational models. We present two models accounting for the organization of the WUSCHEL expression domain. Our preferred model uses a reaction-diffusion mechanism, in which an activator induces WUSCHEL expression. This model is able to organize the WUSCHEL expression domain. In addition, the model predicts the dynamical reorganization seen in experiments where cells, including the WUSCHEL domain, are ablated, and also predicts the spatial expansion of the WUSCHEL domain resulting from removal of the CLAVATA3 signal.

Authors: Song, Wu, Gusfield

Results: In this paper, we present efficient, practical methods for computing both upper and lower bounds on the minimum number of needed recombinations, and for constructing evolutionary histories that explain the input sequences. We extensively study the efficiency and accuracy of these algorithms on both simulated and real data sets. The algorithms produce very close upper and lower bounds, which match exactly in a surprisingly wide range of data. Thus, with the use of new, very effective lower bounding methods and an efficient algorithm for computing upper bounds, this approach allows the efficient exact computation of the minimum number of needed recombinations, with high frequency in a large range of data. When upper and lower bounds match, evolutionary histories found by our algorithm correspond to most parsimonious histories.

Availability: HapBound and SHRUB, programs implementing the new algorithms discussed in this paper, are available at http://wwwcsif.cs.ucdavis.edu/

Authors: Yu, Zaveljevski, Stevens, Yackovich, Reifman

The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a Support Vector Machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity, volume, surface area, bulkiness, and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or its neighbors in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consists of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5-25% and sensitivity between 9-17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation.

Authors: Cline, Blume, Cawley, Clark, Hu, Lu, Salomonis, Wang, Williams

Motivation: Many or most mammalian genes undergo alternative splicing, generating a variety of transcripts from a single gene. New information on splice variation is becoming available through technology for measuring expression levels of several exons or splice junctions per gene. We have developed a statistical method, ANOSVA, to detect alternative splicing from expression data. ANOSVA requires no transcript information, so can be applied when the level of annotation data is poor. When validated against spiked clone data, it generated no false positives and few false negatives. We demonstrated ANOSVA with data from a prototype mouse alternative splicing array, yielding a set of genes with evidence of tissue-specific splice variation. These results are available at http://bioinfo.affymetrix.com/Papers/ANOSVA.

Authors: Borgwardt, Ong, Schoenauer, Vishwanathan, Smola, Kriegel

Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and Support Vector Machine classification on these protein graphs.

Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively.

Availability: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.

Authors: Soukup, Lee, Cho

Genome-wide microarray data are often used in challenging classification problems of subtypes of human diseases. However, the identification of a parsimonious robust prediction model that performs consistently well on future independent data has not been successful due to the biased model selection from an extremely large number of candidate models during the classification model search and construction. Furthermore, common criteria of prediction model performance such as classification error rates do not provide a sensitive measure for valuating performance of such astronomic competing models. Also, even though several different classification approaches have been utilized to tackle such classification problems, no direct comparison on these methods have been made. We introduce a novel measure for assessing the performance of a prediction model, the misclassification-penalized posterior (MiPP), the sum of the posterior classification probabilities penalized by the number of incorrectly classified samples. Using MiPP, we implement a forward step-wise cross-validated procedure to find our optimal prediction models with different numbers of features on a training set. Our final robust classification model and its dimension are determined based on a completely independent test data set. This MiPP-based classification modeling approach enables us to identify the most parsimonious robust prediction models only with two or three features on well-known microarray data sets. These models show superior performance to other models in the literature that often have more than 40--100 features in their model construction. Our classification algorithm is available at the Bioconductor web site as the MiPP package.

Authors: Garay-Malpartida, Occhiucci, Alves, Belizário

Motivation: The in vitro studies have shown that the most remarkable catalytic features of caspases, a family of cys-teineproteases, are their stringent specificity to Asp (D) in the S1 subsite and at least four amino acid to the left of scis-sile bound. However, there is little information about the substraterecognition patterns in vivo. The prediction and-characterization of proteolytic cleavage sites in natural sub-strates could be useful for uncovering these structural rela-tionships.

Results: PEST-like sequences rich in the amino acids Ser (S), Thr (T), Pro (P), Glu or Asp (E/D), including Asn (N) and Gln (Q) are adjacent structural/sequential elements in the majority of cleavage site regions of the natural caspase sub-strates described in the literature, supporting its possible implication in the substrate selection by caspases. We de-veloped CaSPredictor, a software which incorporated a PEST-like index and the position-dependent amino acid ma-trices for prediction of caspase cleavage sites in individual proteins and protein datasets. The program predicted suc-cessfully 81% (111/137) of the cleavage sites in experimen-tally verified caspase substrates not annotated in its internal data file. Its accuracy and confidence was estimated as 80% using ROC methodology. The program was much more effi-cient to predict caspase substrates when compared to Pep-tideCutter and PEPS softwares. Finally, the program de-tected potential cleavage sites in the primary sequences of 1644 proteins in a dataset containing 9986 protein entries.

Authors: Woo, Krueger, Kaur, Churchill

Motivation: Compared to two-color microarrays, three-color microarrays can increase design efficiency and power to detect differential expression without additional samples and arrays. Furthermore, three-color microarray technology is currently available at a reasonable cost. Despite the potential advantages, clear guidelines for designing and analyzing three-color experiments do not exist.

Results: We propose a three- and four-color cyclic design (loop) and a complementary graphical representation to help design experiments that are balanced, efficient, and robust to hybridization failures. In theory, three-color loop designs are more efficient than two-color loop designs. Experiments using both two- and three-color platforms were performed in parallel and its outputwere analyzed using linearmixedmodel analysis in R/MAANOVA. These results demonstrate that three-color experiments using the same number of samples (and fewer arrays) will perform no less efficiently than two-color experiments. The improved efficiency of the design is somewhat offset by a reduced dynamic range and increased variability in the three-color experimental system. This result suggests that, with minor technological improvements, three-color microarrays using loop designs could detect differential expression more efficiently than two-color loop designs.

Supplementary Information: Multi-color cyclic design construction methods and examples along with additional results of the experiment are provided at:
http://www.jax.org/staff/churchill/labsite/pubs/yong.

Keywords: three-color microarray, four-color microarray, cyclic design, directed graph, efficiency, Fs-statistics

Authors: Sadka, Linial

In eukaryotes, membranous proteins account for 20-30% of the proteome. Most of these proteins contain one or more transmembrane (TM) domains. These are short segments that transverse the bilayer lipid membrane. Various properties of the TM domains, such as their number, their topology and their arrangement within the membrane are closely related to the protein’s cellular functions. Properties of the TM domains also determine the cellular targeting and localization of these proteins. It is not known, however, whether the information encoded by TM domains suffices for the purpose of classifying proteins into their functional families. This is the question we address here. We introduce an algorithm that creates a profile of each functional family of membranous proteins based only on the amino-acid composition of their TM domains. This is complemented by a classifier program for each such family (to determine whether a given protein belongs to it or not). We find that in most instances TM domains contain enough information to allow an accurate discrimination of

90% specificity amon g unrelated polytopic functional families with the same number of TM domains.

Authors: Li, Meyer, Liu

Motivation: Transcription factors (TFs) regulate gene ex-pression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can oc-cur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray ex-periments (ChIP-chip). The massive data set generated from these experiments pose great challenges for data analysis.

Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both ge-nomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes’ promoter regions or associated with previ-ously characterized p53-responsive genes.

Authors: Mettu, Lilien, Donald

We cast the problem of identifying protein-protein interfaces, using only unassigned NMR spectra, into a geometric clustering problem. Identifying protein-protein interfaces is critical to understanding inter- and intra-cellular communication, and NMR allows the study of protein interaction in solution. However it is often the case that NMR studies of a protein complex are very time-consuming, mainly due to the bottleneck in assigning the chemical shifts, even if the apo structures of the constituent proteins are known. We study whether it is possible, in a high-throughput manner, to identify the interface region of a protein complex using only unassigned chemical shift and residual dipolar coupling (RDC) data.

We introduce a geometric optimization problem where we must cluster the cells in an arrangement on the boundary of a 3-manifold, where the arrangement is induced by a spherical quadratic form. We show that this formalism derives directly from the physics of RDCs. We present an optimal algorithm for this problem that runs in O(n^3 log n) time for an n-residue protein. We then use this clustering algorithm as a subroutine in a practical algorithm for identifying the interface region of a protein complex from unassigned NMR data. We present the results of our algorithm on NMR data for 7 proteins from 5 protein complexes and show that our approach is useful for high-throughput applications in which we seek to rapidly identify the interface region of a protein complex.

Authors: Dueck, Morris, Frey

We address the problem of multi-way clustering of microarray data using a generative model. Our algorithm, probabilistic sparse matrix factorization (PSMF), is a probabilistic extension of a previous hard-decision algorithm for this problem. PSMF allows for varying levels of sensor noise in the data, uncertainty in the hidden prototypes used to explain the data, and uncertainty as to which prototypes are selected to explain each data vector. We present experimental results demonstrating that our method can better recover functionally-relevant clusterings in mRNA expression data than standard clustering techniques, including hierarchical agglomerative clustering, and we show that by computing probabilities instead of point estimates, our method avoids converging to poor solutions.

Authors: Halperin, Kimmel, Shamir

Motivation: The search for genetic regions associated with complex disease, such as cancer or Alzheimer's disease, is an important challenge that may lead to better diagnosis and treatment. The existence of millions of DNA variations, primarily single nucleotide polymorphisms (SNPs), may allow the fine dissection of such associations. However, studies seeking disease association are limited by the cost of genotyping SNPs. Therefore, it is essential to find a small subset of informative SNPs (tag SNPs) that may be used as good representatives of the rest of the SNPs.

Results: We define a new natural measure for evaluating the prediction accuracy of a set of tag SNPs, and use it to develop a new method for tag SNPs selection. Our method is based on a novel algorithm that predicts the values of the rest of the SNPs given the tag SNPs. In contrast to most previous methods, our prediction algorithm uses the genotype information and not the haplotype information of the tag SNPs. Our method is very efficient, and it does not rely on having a block partition of the genomic region.

We compared our method to two state of the art tag SNP selection algorithms on 58 different genotype data sets from four different sources. Our method consistently found tag SNPs with considerably better prediction ability than the other methods.

Authors: Cheng,Baldi

Motivation: Protein beta-sheets play a fundamental role in protein structure, function, evolution, and bio-engineering. Accurate prediction and assembly of protein beta-sheets, however, remains challenging because protein beta-sheets require formation of hydrogen bonds between linearly distant residues. Previous approaches for predicting beta-sheet topological features, such as beta-strand alignments, in general have not exploited the global covariation and constraints characteristic of beta-sheet architectures.

Results: We propose a modular approach to the problem of predicting/assembling protein beta-sheets in a chain by integrating both local and global constraints in three steps. The first step uses recursive neural networks to predict pairing probabilities for all pairs of inter-strand beta-residues from profile, secondary structure, and solvent accessibility information. The second step applies dynamic programming techniques to these probabilities to derive binding pseudo-energies and optimal alignments between all pairs of beta-strands. Finally, the third step, uses graph matching algorithms to predict the beta-sheet architecture of the protein by optimizing the global pseudo-energy while enforcing strong global beta-strand pairing constraints. The approach is evaluated using cross-validation methods on a large non-homologous dataset and yields significant improvements over previous methods.

Authors: Smith, Sumazin, Das, Zhang

We describe methodology to identify de novo individual and interacting pairs of binding site motifs from ChIP-chip data, using an algorithm that integrates localization data directly into the motif discovery process. We combine matrix-enumeration-based motif discovery with multi-variate regression to evaluate candidate motifs and identify motif interactions. When applied to the HNF localization data of Odom et al. (2004) in liver and pancreatic islets, our methods produce motifs that are either novel or improved known motifs. All motif pairs identified to predict localization are further evaluated according to how well they predict expression in liver and islets, and according to how conserved are the relative positions of their occurrences. We find that interaction models of HNF1 and CDP motifs provide excellent prediction of both HNF1 localization and gene expression in liver. Our results demonstrate that ChIP-chip data can be used to identify interacting binding site motifs.

Authors: Brunette, Brock

De novo protein structure prediction can be formulated as search in a high-dimensional space. One of the most frequently used computational tools to solve such search problems is the Monte Carlo method. We present a novel search technique, called model-based search. This method samples the high-dimensional search space to build an approximate model of the underlying function. This model is incrementally refined in areas of interest, while areas that are not of interest are excluded from further exploration. Model-based search derives its efficiency from the fact that the information obtained during the exploration of the search space is used to guide further exploration. In contrast, Monte Carlo-based techniques are memory-less and exploration is performed based on random walks, ignoring the information obtained in previous steps. Model-based search is applied to protein structure prediction, where search is employed to find the global minimum of the protein's energy landscape. We show that model-based search uses computational resources more efficiently to find lower-energy conformations of proteins when compared to one of the leading protein structure prediction methods, which relies on a tailored Monte Carlo method to perform search. The performance improvements become more pronounced as the dimensionality of the search space increases. We show that model-based search enables more accurate protein structure prediction than previously possible. Furthermore, we believe that similar performance improvements can be expected in other problems that are currently solved using Monte Carlo-based search methods.

Authors: Huang, Morris, Frey Paper

Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Within the next 2 months, our group will release a microarray data set containing over 12,000,000 microarray measurements taken from 37 mouse tissues. A problem that will become increasingly significant in the upcoming era of genome-wide exon-tiling microarray experiments is the removal of cross-hybridization noise. We present a probabilistic generative model for cross-hybridization in microarray data and a corresponding variational learning method for cross-hybridization compensation, GenXHC, that reduces cross-hybridization noise by taking into account multiple sources for each mRNA expression level measurement and prior knowledge of hybridization similarities between the nucleotide sequences of microarray probes and their target cDNAs. The algorithm is applied to a subset of an exon-resolution genome-wide Agilent microarray data set for chromosome 16 of Mus musculus and is found to produce statistically significant reductions in cross-hybridization noise. The denoised data is found to produce enrichment in multiple Gene Ontology-Biological Process (GO-BP) functional groups. The algorithm is found to outperform Robust Multi-array Analysis, another method for cross-hybridization compensation.

Authors: Ko, Murga, Ondrechen

Identification of functional information for a protein from its three-dimensional structure is a major challenge in genomics. The power of theoretical microscopic titration curves (THEMATICS), when coupled with a statistical analysis, provides a method for high-throughput screening for identification of catalytic sites and binding sites with high accuracy and precision. The method requires only the 3D structure of the query protein as input, but performs as well as other methods that depend on sequence alignments and structural similarities.

Authors: Raetsch, Sonnenburg, Schoelkopf

Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least a third of the genes of less complex organisms are alternatively spliced. In this work we consider one major form of alternative splicing: the exclusion of exons from the transcript. It was shown that alternatively spliced exons have properties that distinguish them from other exons. While most computational studies on alternative splicing only apply to exons which are conserved among species, our method only uses information available to the splicing machinery. We employ advanced machine learning techniques in order to answer two questions: a) Is a certain exon alternatively spliced? b) How to identify yet unidentified exons within verified introns?

We designed a SVM kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task a), we combine the kernel with additional local sequence information. The resulting SVM based classifier achieves a true positive rate of 48,5% at a false positive rate of 1%. By scanning over EST-confirmed exons we identified 215 potentially alternatively spliced exons. For 10 randomly selected exons we successfully performed biological verification experiments and confirmed 3 novel alternatively spliced exons. To answer question b), we used SVM based predictions to recognize acceptor and donor splice sites. Combined with the abovementioned features we were able to identify 85,2% of skipped exons within verified introns at a false positive rate of 1%.

Authors: Nimrod, Glaser, Steinberg, Ben-Tal, Pupko

Motivation: In silico prediction of functional regions on protein surfaces, i.e., sites of interaction with DNA, ligands, substrates and other proteins, is of utmost importance in various applications in the emerging fields of proteomics and structural genomics. When a sufficient number of homologs is found, powerful prediction schemes can be based on the observation that evolutionarily conserved regions are often functionally important. However, it is challenging to unambiguously identify the boundaries of the functional regions typically, only the principal functionally-important region of the protein is detected, while secondary functional regions with weaker conservation signals are overlooked.

Methods: We present a new methodology, called PatchFinder, that automatically identifies patches of conserved residues that are located in close proximity to each other on the protein surface. PatchFinder is based on the following steps: (1) Assignment of conservation scores to each amino acid position on the protein surface. (2) Assignment of a score to each putative patch, based on its likelihood to be functionally important. The patch of maximum likelihood is considered to be the main functionally-important region, and the search is continued for non-overlapping patches of secondary importance.
Results: We examined the accuracy of the method using the IGPS enzyme, the SH2 domain and a benchmark set of 112 proteins. These examples demonstrated that PatchFinder is capable of identifying both the main and secondary functional patches.

Availability: The PatchFinder program is available at: http://ashtoret.tau.ac.il/

Authors: Hannenhalli, Wang

Motivation: Positional weight matrix (PWM) is derived from a set of experimentally determined binding sites. Here we explore whether there exist subclasses of binding sites, and if the mixture of these subclass-PWMs can improve the binding site prediction. Intuitively, the subclasses correspond to either distinct binding preference of the same transcription factor in different contexts or distinct subtypes of the transcription factor.

Result: We report an Expectation Maximization algorithm adapting the mixture model of Baily and Elkan. We assessed the relative merit of using two subclass-PWMs. The resulting PWMs were evaluated with respect to preferred conservation (relative to mouse) of potential sites in human promoters and expression coherence of the potential target genes. Based on 64 JASPAR vertebrate PWMs, 61%-81% of the cases resulted in a higher conservation using the mixture model. Also in 98% of the cases the expression coherence was higher for the target genes of one of the subclass-PWMs. Our analysis of REB1 sites is consistent with previously discovered subtypes using independent methods. Additionally application of our method to mutated sites for transcription factor LEU3 reveals subclasses that segregate into strongly binding and weakly binding sites with p-value of 0.008. This is the first study which attempts to quantify the subtly different binding specificities of a transcription factor at a large scale and suggests the use of a mixture of PWMs instead of the current practice of using a single PWM for a transcription factor.

Authors: Winstanley, Abeln, Deane

Motivation: At present there exists no age estimate for the different protein structures found in nature. It has become clear from occurrence studies that different folds arose at different points in evolutionary time. An estimation of the age of different folds would be a starting point for many investigations into protein structure evolution: how we arrived at the set of folds we see today. It would also be a powerful tool in protein structure classification allowing us to reassess the available hierarchical methods and perhaps suggest improvements.

Results: We have created the first relative age estimation technique for protein folds. Our method is based on constructing parsimonious scenarios which can describe occurrence patterns in a phylogeny of species. The ages presented are shown to be robust to the different trees or data types used for their generation. They show correlations with other previously used protein age estimators, but appear to be far more discriminating than any previously suggested technique. The age estimates given are not absolutes but they already offer intriguing insights, like the very different age patterns of alpha/beta folds compared to small folds. The alpha/beta folds appear on average to be far older than their small fold counterparts.

Availability: Example trees and additional material are available at:
http://www.stats.ox.ac.uk/

Authors: Price, Jones, Pevzner

De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis.

Our RepeatScout algorithm is more sensitive and orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that roughly 2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence.

Authors: Shmygelska

The problem of finding folding nuclei (a set of native contacts that play an important role in folding) along with identifying folding pathways (a time ordered sequence of folding events) of proteins is one of the most important problems in protein chemistry. Here we propose a novel and simple approach to address this problem as follows: given the topology of the native state, identify native contacts that form folding nuclei based on the graph theoretical approach that considers effective contact order (effective loop closure) as its objective function.

Motivation: A number of computational metho ds for the prediction of folding nuclei already exists in the literature, but most of them rely on restrictive assumptions about the nature of nuclei or the process of folding. Our motivation is to develop a simple, efficient and robust algorithm to find an ensemble of pathways with the lowest effective contact order and to identify contacts that are crucial for folding.

Results: Our approach is different from the previously used methods in that it uses efficient graph algorithms, and does not formulate restrictive assumptions about folding nuclei. Our predictions provide more details concerning the protein folding pathway than most other methods in literature. We demonstrate the success of our approach by predicting folding nuclei for a data set of proteins for which experimental kinetic data is available. We show that our method compares favourably with other methods in literature and its results agree with experimental results.

Authors: Tang, Mechref, Novotny

The emerging glycomics and glycoproteomics projects aim to characterize all forms of glycoproteins in different tissues and organisms. Tandem mass spectrometry is the key experimental methodology for high throughput glycan identification and characterization. Fragmentation of glycans from high energy collision-induced dissociation (CID) generates ions from glycosidic as well as internal cleavages. The cross-ring ions resulting from internal cleavages provide additional information that is important to reveal the type of linkage between monosaccharides. This information, however, is not incorporated by the current programs for analyzing glycan mass spectra. As a result, they can rarely distinguish isomeric oligosaccharides from the mass spectra, which have the same saccharide composition but different types of sequences, branches or linkages.

In this paper, we describe a novel algorithm for glycan characterization using tandem mass spectrometry (MS/MS). This algorithm consists of three steps. First we develop a scoring scheme to identify potential bond linkages between monosaccharides, based on the appearance pattern of cross-ring ions. Next, we use a dynamic programming algorithm to determine the most probable oligosaccharide structures from the mass spectrum. Finally, we re-evaluate these oligosaccharide structures, taking into account the double fragmentation ions. We also show the preliminary results of testing our algorithm on several MS/MS spectra of oligosaccharides.

Authors: Cortes, Simeon, Ruiz de Angulo, Guieysse, Remaud-Simeon, Tran

Motivation: Motion is inherent in molecular interactions. Molecular flexibility must be taken into account in order to develop accurate computational techniques for predicting interactions. Energy-based methods currently used in molecular modeling (i.e. molecular dynamics, Monte Carlo algorithms) are, in practice, only able to compute local motions while accounting for molecular flexibility. However, large-amplitude motions often occur in biological processes. We investigate the application of geometric path planning algorithms to compute such large motions in flexible molecular models. Our purpose is to exploit the efficacy of a geometric conformational search as a filtering stage before subsequent energy refinements.

Results: Two kinds of large-amplitude motion are treated in this paper: protein loop conformational changes (involving protein backbone flexibility) and ligand trajectories to deep active sites in proteins (involving ligand and protein side-chain flexibility). First studies performed using our two-stage approach (geometric search followed by energy refinements) show that, compared to classical molecular modeling methods, quite similar results can be obtained with a performance gain of several orders of magnitude. Furthermore, our results also indicate that the geometric stage can provide by itself highly valuable information to biologists.

Availability: The algorithms have been implemented in the general-purpose motion planning software Move3D (Simeon et al., 2001). We are currently working on an optimized stand-alone library that will be available to the scientific community.

Contact: @laas.fr

Authors: Ernst, Nau, Bar-Joseph

Motivation: Time series expression experiments are used to study a wide range of biological systems. More than 80% of all time series expression datasets are short (8 time points or fewer). These datasets present unique challenges. Due to the large number of genes profiled (often tens of thousands) and the small number of time points many patterns are expected to arise at random. Most clustering algorithms are unable to distinguish between real and random patterns.

Results: We present an algorithm specifically designed for clustering short time series expression data. Our algorithm works by assigning genes to a pre-defined set of model profiles that capture the potential distinct patterns that can be expected from the experiment. We discuss how to obtain such a set of profiles and how to determine the significance of each of these profiles. Significant profiles are retained for further analysis and can be combined to form clusters. We tested our method on both simulated and real biological data. Using immune response data we show that our algorithm can correctly detect the temporal profile of relevant functional categories. Using GO analysis we
show that our algorithm outperforms both general clustering algorithms and algorithms designed specifically for clustering time series gene expression data.

Availability: Information on obtaining a Java implementation with a Graphical User Interface (GUI) is available from http://www.cs.cmu.edu/

Authors: Edgar, Myers

Repeated elements such as satellites and transposons are ubiquitous in eukaryotic genomes. De novo computational identification and classification of such elements is a challenging problem, so repeat annotation of sequenced genomes has historically largely relied on sequence similarity to hand-curated libraries of known repeat families. We present a new approach to de novo repeat annotation that exploits characteristic patterns of local alignments induced by certain classes of repeats. We describe PILER, a package of efficient search algorithms for identifying such patterns. Novel repeats found using PILER are reported for H. sapiens, A. thalania and D. melanogaster. The software is freely available at http://www.drive5.com/piler.

Authors: Noble, Kuehn, Thurman, Yu, Stamatoyannopoulos

In the living cell nucleus, genomic DNA is packaged into chromatin. DNA sequences that regulate transcription and other chromosomal processes are associated with local disruptions, or "openings," in chromatin structure caused by the cooperative action of regulatory proteins. Such perturbations are extremely specific for em cis-regulatory elements, and occur over short stretches of DNA (typically approximately 250 bp). They can be detected experimentally as DNaseI hypersensitive sites (HSs) in vivo, though the process is extremely laborious and costly. The ability to discriminate DNaseI HSs computationally would have a major impact on the annotation and utilization of the human genome.

We found that a supervised pattern recognition algorithm, trained using a set of 280 DNaseI HS and 737 non-HS control sequences from erythroid cells, was capable of de novo prediction of HSs across the human genome with surprisingly high accuracy determined by prospective in vivo validation. Systematic application of this computational approach will greatly facilitate discovery and analysis of functional non-coding elements in the human and other complex

Authors: Tharakaraman, Marino-Ramirez, Sheetlin, Landsman, Spouge

Motivation: The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter’s TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new “anchors” to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. The positional anchors focus A-GLAM on the motif of interest, however, so masking DNA repeats during Gibbs sampling becomes unnecessary.

Results: In a set of human DNA sequences with experimentally characterized TSSs, the placement of 801 octonucleotide words was unusually consistent (multiple-test corrected p < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE.

Availability: The A-GLAM program and a list of statistically significant words are available at: ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM.

Authors: Jothi, Kann, Przytycka

Motivation: Uncovering protein-protein interaction network is a fundamental step in the quest for understanding the molecular machinery of a cell. This motivates the search for efficient computational methods for predicting such interactions. Among the available predictors are those that are based on the co-evolution hypothesis "evolutionary trees of protein families (that are known to interact) are expected to have similar topologies." Many of these methods are limited by the fact that they can only handle a small number of protein sequences. Also, details on evolutionary tree topology are missing as they use similarity matrices in lieu of the trees.

Results: We introduce MORPH, a new algorithm for predicting protein interaction partners between members of two protein families that are known to interact. Our approach can also be seen as a new method for searching the best superposition of the corresponding evolutionary trees based on tree automorphism group. We discuss relevant facts related to predictability of protein-protein interaction based on their co-evolution. When compared to related computational approaches, our method reduces the search space by about $3 imes 10^5$-fold, and at the same time increases the prediction accuracy of correct binding partners.

Authors: Zheng, Lenert, Sankoff

Motivation: The total order of the genes or markers on a chromosome inherent in its representation as a signed permutation must often be weakened to a partial order in the case of real data. This is due to lack of resolution, where several genes are mapped to the same chromosomal position, to missing data from some of the datasets used to compile a gene order, and to conflicts between these datasets. The available genome rearrangement algorithms, however, require total orders as input. A more general ap-proach is needed to handle rearrangements of gene partial orders.

Results: We formalize the uncertainty in gene order data by representing a chromosome from each genome as a partial order, summarized by a directed acyclic graph (DAG). The rearrangement problem is then to infer a minimal sequence of reversals for transforming any topological sort of one DAG to any one of the other DAG. Each topological sort represents a possible linearization compatible with all the datasets on the chromosome. The set of all possible topological sorts is embedded in each DAG by appropriately augmenting the edge set, so that it becomes a general directed graph (DG). The DGs representing chromosomes of two genomes are combined to produce a bicoloured graph from which we extract a maximal decomposition into alternating coloured cycles, from which an optimal sequence of reversals can usually be identified. We test this approach on simulated incomplete comparative maps and on cereal chromosomal maps drawn from the Gramene browser.

Authors: Yamanishi, Vert, Kanehisa

The metabolic network is an important biological network which relates enzyme proteins and chemical compounds. A large number of metabolic pathways remain unknown nowadays, and many enzymes are missing even in known metabolic pathways. There is therefore an incentive to develop methods to reconstruct the unknown parts of the metabolic network, and to identify genes coding for missing enzymes.

This paper presents new methods to infer enzyme networks from the integration of multiple genomic data and chemical information, in the framework of supervised graph inference. The originality of the methods is the introduction of chemical compatibility as a constraint for refining the network predicted by the network inference engine. The chemical compatibility between two enzymes is obtained automatically from the information encoded by their EC numbers. The proposed method are tested and compared on their ability to infer the enzyme network of the yeast Saccharomyces cerevisiae from four datasets for enzymes with assigned EC numbers: gene expression data, protein localization data, phylogenetic profiles, and chemical compatibility information. It is shown that the prediction accuracy of the network reconstruction consistently improves due to the introduction of chemical constraints, the use of a supervised approach, and the weighted integration of multiple datasets. Finally, we conduct a comprehensive prediction of a global enzyme network consisting all enzyme candidate proteins of the yeast to obtain new biological findings.

Authors: Brejova, Brown, Li, Vinar

We present ExonHunter, a new and comprehensive gene finding system that outperforms existing systems and features several new ideas and approaches. Our system combines numerous sources of information (genomic sequences, ESTs, and protein databases of related species) into a gene finder based on a hidden Markov model in a novel and systematic way. In our framework, various sources of information are expressed as partial probabilistic statements about positions in the sequence and their annotation. We then combine these into the final prediction via a quadratic programming method, which we show is an extension of existing methods. Allowing only partial statements is key to our transparent handling of missing information and coping with the heterogeneous character of individual sources of information. As well, we give a new method for modeling the length distribution of intergenic regions in hidden Markov models.
Results: On a commonly used test set, ExonHunter performs significantly better than the existing gene finders ROSETTA, SLAM, or TWINSCAN, with more than two thirds of genes predicted completely correctly.

Authors: Nabieva, Jim, Agarwal, Chazelle, Singh

Motivation: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression, and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps.

Results: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks.

Authors: Apostolico, Comin, Parida

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. Part of the problem seems rooted, in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model. We introduce and study a notion of extensible motif in a sequence which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of don't cares compatible with a given list of of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs. The merits of the method are documented by results obtained in implementation, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as most important in terms of functional/structural relevance emerges among the top thirty extensible motifs returned by our algorithm, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constrains.

Authors: Hu, Yan, Huang, Han, Zhou

Motivation: The rapid accumulation of biological network data translates into an urgent need for computational meth-ods for graph pattern mining. One important problem is to identify recurrent patterns across multiple networks to dis-cover biological modules. However, existing algorithms for frequent pattern mining become very costly in time and space as the pattern sizes and network numbers increase. Currently, no efficient algorithm is available for mining re-current patterns across large collections of genome-wide networks.

Results: We developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs. Compared to previous methods, our approach is scalable in the number and size of the input graphs, and adjustable in terms of exact or ap-proximate pattern mining. Applying CODENSE to 39 co-expression networks derived from microarray data sets, we discovered a large number of functionally homogenous clusters and made functional predictions for 169 uncharac-terized yeast genes.

Authors: Beissbarth, Tye-Din, Smyth, Speed, Anderson

Motivation: T-cell response to peptides bound on MHC Class

II molecules is essential for immune recognition of pathogens. T-cells are activated by specific peptide epitopes that are determined within the antigen processing pathways and presented on the surface of other cells bound to MHC molecules. To determine which part of allergenic or pathogenic proteins can stimulate T-cells is important for the treatment of diseases. We sought to take advantage of the falling cost of synthetic, screening grade peptides, and devise a comprehensive, non-hypothesis-driven screen for T-cell epitopes. We were interested in the study of celiac disease (CD) and used the ELISPOT technique to perform a large number of T-cell assays. We therefore needed to compensate for the lack of statistical data analysis methods for ELISPOT assays.

Results: We describe a method to comprehensively screen for T-cell epitopes within a family or group of proteins. We have implemented an algorithm to generate a set of unique short peptide sequences that incorporate all possible epitopes within a group of proteins. T-cell assays were performed in 96 well plates using the ELISPOT assay to screen for responses in CD patients against any epitopes in glutens. We describe a statistical model to fit the data and an EM (Expectation Maximization) algorithm to estimate the parameters of interest and analyze the resulting data.

Availability: Implementations of our algorithms in R or Perl are available at
http://bioinf.wehi.edu.au/folders/immunol.

Authors: Yu, Chen

The classification of high-dimensional data is always a challenge to statistical machine learning. We propose a novel method named shallow feature selection (SFS) that assigns each feature a probability of being selected based on the structure of training data itself. Independent of particular classifiers, the high dimension of biodata can be fleetly reduced to an applicable case for consequential processings. Moreover, to improve both efficiency and performance of classification, these prior probabilities will be further used to specify the distributions of top-level hyperparameters in hierarchical models of Bayesian neural network (BNN), as well as the parameters in Gaussian process models. Three BNN approaches are derived and then applied to identify ovarian cancer from NCI's high-resolution mass spectrometry data, which yielded an excellent performance in 1,000 independent k-fold cross validations (k=2. 10). For instance, an average sensitivity and specificity of 98.56% and 98.42% have been achieved in the 2-fold cross validations, furthermore, only one control and one cancer are misclassified in the leave-one-out cross validation. Some other popular classifiers are also tested as a comparison.

Authors: Ralaivola, Chen, Phung, Bruand, Swamidass, Baldi

Motivation: Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As large datasets of small molecules become increasingly available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their chemical, physical, or biological properties. Results: Here we develop several new classes of kernels for small molecules using their 1D, 2D, and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways sub-paths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity, and anti-cancer activity on three publicly available datasets. The results derived using cross validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed.


A framework for phylogenetic sequence alignment

A phylogenetic alignment differs from other forms of multiple sequence alignment because it must align homologous features. Therefore, the goal of the alignment procedure should be to identify the events associated with the homologies, so that the aligned sequences accurately reflect those events. That is, an alignment is a set of hypotheses about historical events rather than about residues, and any alignment algorithm must be designed to identify and align such events. Some events (e.g., substitution) involve single residues, and our current algorithms can successfully align those events when sequence similarity is great enough. However, the other common events (such as duplication, translocation, deletion, insertion and inversion) can create complex sequence patterns that defeat such algorithms. There is therefore currently no computerized algorithm that can successfully align molecular sequences for phylogenetic analysis, except under restricted circumstances. Manual re-alignment of a preliminary alignment is thus the only feasible contemporary methodology, although it should be possible to automate such a procedure.

This is a preview of subscription content, access via your institution.


References

Li, W.-H. & Graur, D. Fundamentals of Molecular Evolution (Sinauer Associates, Sunderland, Massachusetts, 1991 ).

Yang, Z. H. & Bielawski, J. P. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15, 496–503 (2000).

Smith, N. H., Maynard Smith, J. & Spratt, B. G. Sequence evolution of the porB gene of Neisseria gonorrhoeae and Neisseria meningitidis: evidence of positive Darwinian selection. Mol. Biol. Evol. 12, 363– 370 (1995).

Yamaguchi-Kabata, Y. & Gojobori, T. Reevaluation of amino acid variability of the human immunodeficiency virus type 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes. J. Virol. 74, 4335–4350 (2000).

Bush, R. M., Fitch, W. M., Bender, C. A. & Cox, N. J. Positive selection on the H3 hemagglutinin gene of human influenza virus A . Mol. Biol. Evol. 16, 1457– 1465 (1999).

Hughes, A. L. & Nei, M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection . Nature 335, 167–170 (1988).

Wang, G. L. et al. Xa21D encodes a receptor-like molecule with a leucine-rich repeat domain that determines race-specific recognition and is subject to adaptive evolution. Plant Cell 10, 765– 779 (1998).

Meyers, B. C., Shen, K. A., Rohani, P., Gaut, B. S. & Michelmore, R. W. Receptor-like genes in the major resistance locus of lettuce are subject to divergent selection. Plant Cell 10, 1833–1846 (1998).

Bishop, J. G., Dean, A. M. & Mitchell-Olds, T. Rapid evolution in plant chitinases: molecular targets of selection in plant–pathogen coevolution. Proc. Natl Acad. Sci. USA 97, 5322–5327 ( 2000).

Lee, Y. H., Ota, T. & Vacquier, V. D. Positive selection is a general phenomenon in the evolution of abalone sperm lysin. Mol. Biol. Evol. 12, 231–238 (1995).

Hellberg, M. E. & Vacquier, V. D. Rapid evolution of fertilization selectivity and lysin cDNA sequences in teguline gastropods . Mol. Biol. Evol. 16, 839– 848 (1999).

Yang, Z. H., Swanson, W. J. & Vacquier, V. D. Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17, 1446– 1455 (2000).

Kresge, N., Vacquier, V. D. & Stout, C. D. Abalone lysin: the dissolving and evolving sperm protein. Bioessays 23, 95– 103 (2001).

Metz, E. C. & Palumbi, S. R. Positive selection and sequence rearrangements generate extensive polymorphism in the gamete recognition protein bindin. Mol. Biol. Evol. 13, 397– 406 (1996).

Vacquier, V. D., Swanson, W. J. & Lee, Y. H. Positive Darwinian selection on two homologous fertilization proteins: what is the selective pressure driving their divergence? J. Mol. Evol. 44, S15–S22 (1997).

Biermann, C. H. The molecular evolution of sperm bindin in six species of sea urchins (Echinoida: Strongylocentrotidae). Mol. Biol. Evol. 15, 1761–1771 (1998).

Palumbi, S. R. All males are not created equal: fertility differences depend on gamete recognition polymorphisms in sea urchins. Proc. Natl Acad. Sci. USA 96, 12632–12637 (1999).

Bush, R. M., Bender, C. A., Subbarao, K., Cox, N. J. & Fitch, W. M. Predicting the evolution of human influenza A. Science 286, 1921– 1925 (1999).

Fitch, W. M., Leiter, J. M. E., Li, X. Q. & Palese, P. Positive Darwinian evolution in human influenza A viruses. Proc. Natl Acad. Sci. USA 88, 4270–4274 (1991).

Ina, Y. & Gojobori, T. Statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza A viruses. Proc. Natl Acad. Sci. USA 91, 8388– 8392 (1994).

Fitch, W. M., Bush, R. M., Bender, C. A. & Cox, N. J. Long term trends in the evolution of H3) HA1 human influenza type A. Proc. Natl Acad. Sci. USA 94, 7712– 7718 (1997).

Suzuki, Y. & Gojobori, T. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16, 1315–1328 (1999).

Yang, Z., Nielsen, R., Goldman, N. & Pedersen, A. M. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155, 431–449 ( 2000).

Yang, Z. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51, 423–432 (2000).

Strunnikova, N., Ray, S. C., Livingston, R. A., Rubalcaba, E. & Viscidi, R. P. Convergent evolution within the V3 loop domain of human immunodeficiency virus type 1 in association with disease progression. J. Virol. 69, 7548– 7558 (1995).

Wichman, H. A., Badgett, M. R., Scott, L. A., Boulianne, C. M. & Bull, J. J. Different trajectories of parallel evolution during viral adaptation. Science 285, 422–424 (1999).

Wichman, H. A., Scott, L. A., Yarber, C. D. & Bull, J. J. Experimental evolution recapitulates natural evolution. Phil. Trans. R. Soc. Lond. B 355, 1677–1684 (2000).

Bush, R. M., Smith, C. B., Cox, N. J. & Fitch, W. M. Effects of passage history and sampling bias on phylogenetic reconstruction of human influenza A evolution. Proc. Natl Acad. Sci. USA 97, 6974–6980 (2000).

Bush, R. M., Fitch, W. M., Smith, C. B. & Cox, N. J. in Options for the Control of Influenza IV (ed. Osterhaus, A. D. M. E.) (Elsevier, Amsterdam, in the press).

Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 ( 1985).

Sullivan, N. et al. CD4-induced conformational changes in the human immunodeficiency virus type 1 gp120 glycoprotein: consequences for virus entry and neutralization . J. Virol. 72, 4694–4703 (1998).

Martínez, M. A., Verdaguer, N., Mateu, M. G. & Domingo, E. Evolution subverting essentiality: dispensability of the cell attachment Arg-Gly-Asp motif in multiply passaged foot-and-mouth disease virus. Proc. Natl Acad. Sci. USA 94, 6798–6802 (1997).

Martín, M. J., Núñez, J. I., Sobrino, F. & Dopazo, J. A procedure for detecting selection in highly variable viral genomes: evidence of positive selection in antigenic regions of capsid protein VP1 of foot-and-mouth disease virus. J. Virol. Methods 74, 215 –221 (1998).

Ishimizu, T. et al. Identification of regions in which positive selection may operate in S-RNase of Rosaceae: implication for S-allele-specific recognition sites in S-RNase. FEBS Lett. 440, 337– 342 (1998).

Nei, M. & Gojobori, T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986).

Nielsen, R. & Yang, Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936 ( 1998).

Nei, M. & Kumar, S. Molecular Evolution and Phylogenetics (Oxford Univ. Press, Oxford and New York, 2000).


Results

To assess the quality of our new approach and to compare it to other alignment-free methods, we used artificially generated as well as real-world protein sequences. For the test runs, we used the default parameters of our program, namely, 6 match positions and 40 don’t care positions, i.e., a total pattern length of 46 –, a threshold of T = 0 to discard background spaced-word matches, and sets of m = 5 patterns. We compared our program to the following four other alignment-free methods that can be run on protein sequences: Average Common Substring Approach (ACS) [ 21], FFP [ 36, 8], kmacs [ 22], and CVTree [ 11]. Here, we used version 3.19 of FFP, the other programs that we evaluated did not have version numbers at the time of writing. Since the original implementation of ACS is not publicly available, we used our own implementation of this approach by running kmacs with k = 0. The competing tools were used with their default parameters. In addition to evaluating these tools on protein sequences, we ran filtered spaced word matches on the complete genome sequences of the same taxa. All test runs were done on a 10 x Intel(R) Xeon(R) central processing using E7-4850 with 2.00 GHz with four cores each, equaling 40 cores and 1,000 GB of random access memory.

Distance estimation on simulated sequences

To evaluate the distances estimated by our program, we simulated sequences with the tool pyvolve [ 43]. Pyvolve simulates sequences along an evolutionary tree using continuous-time Markov models. It can use various substitution models such as JTT [ 44] and other models. Since there are no reliable stochastic models for insertions and deletions (indels) in protein sequences, the program produces indel-free sequences. We simulated pairs of sequences of length 100,000 with distances between 0 and 2 substitutions per position in steps of 0.05 using the JTT model. To evaluate the estimated distance values, we generated 1,000 sequence pairs for each distance value and plotted the average of the estimated distances against the real Kimura distance of the respective sequence pairs, calculated with the program protdist from the PHYLIP package [ 45]. To study the robustness of the estimated distances, we added error bars representing standard deviations to the plot. In addition to running Prot-SpaM with default parameters, i.e., with sets |$$| of m = 5 patterns , we did a second series of test runs with m = 1, i.e., with single patterns. Figure 2 shows the results of these test runs.

Distances calculated by Prot-SpaM and four other alignment-free methods calculated for pairs of simulated protein sequences plotted against their distances calculated with the Kimura model. Error bars denote standard deviations. Note that Prot-SpaM estimates phylogenetic distances in terms of substitutions that have happened since two sequences evolved from their last common ancestor. The programs kmacs,CVTree,FFP, and ACS, by contrast, do not estimate distances in a rigorous way but rather use ad hoc measures of sequence dissimilarity that are not linear functions of the real distances. Also, the absolute values of these distance measures are rather arbitrary for these four other programs. We therefore normalized the distances calculated by kmacs, CVTree, FFP, and ACS such that they have a value of one for sequence pairs with a Kimura distance of one.

Distances calculated by Prot-SpaM and four other alignment-free methods calculated for pairs of simulated protein sequences plotted against their distances calculated with the Kimura model. Error bars denote standard deviations. Note that Prot-SpaM estimates phylogenetic distances in terms of substitutions that have happened since two sequences evolved from their last common ancestor. The programs kmacs,CVTree,FFP, and ACS, by contrast, do not estimate distances in a rigorous way but rather use ad hoc measures of sequence dissimilarity that are not linear functions of the real distances. Also, the absolute values of these distance measures are rather arbitrary for these four other programs. We therefore normalized the distances calculated by kmacs, CVTree, FFP, and ACS such that they have a value of one for sequence pairs with a Kimura distance of one.

Phylogenetic tree reconstruction

Next, we applied the above alignment-free methods to calculate phylogenetic trees from real-world protein sequences. For four different groups of species, we downloaded all available protein sequences from GenBank [ 46], in addition we used two data sets from Wolbachia [55], see Table 1 for details. Within each group, we calculated all pairwise distances between the species. We used the distance matrices obtained in this way as input for neighbor-joining [ 37] and compared the resulting trees to reference trees that we assume reflect the respective correct phylogeny for each group. The Robinson-Foulds (RF) distances [ 47] between the reconstructed trees and the respective reference trees are shown in Table 2.

Datasets used in this study to evaluate alignment-free methods, with number of taxa, total size, and source of the reference tree

Taxa . # taxa . Total size (MB) . Source .
Escherichia coli/Shigella29 56.41 Zhou et al. [ 48]
Wolbachia I, 252 proteins 19 1.15 Gerth et al. [ 49]
Wolbachia I, whole proteomes 19 7.96 Gerth et al. [ 49]
Wolbachia II47 14.78 See Supplementary Material
Plants 11 245.05 Hatje and Kollmar [ 50]
Prokaryotes 813 784.86 Lang et al. [ 51]
Metazoa 36 585.0 Borowiec et al. [ 52]
Taxa . # taxa . Total size (MB) . Source .
Escherichia coli/Shigella29 56.41 Zhou et al. [ 48]
Wolbachia I, 252 proteins 19 1.15 Gerth et al. [ 49]
Wolbachia I, whole proteomes 19 7.96 Gerth et al. [ 49]
Wolbachia II47 14.78 See Supplementary Material
Plants 11 245.05 Hatje and Kollmar [ 50]
Prokaryotes 813 784.86 Lang et al. [ 51]
Metazoa 36 585.0 Borowiec et al. [ 52]

Datasets used in this study to evaluate alignment-free methods, with number of taxa, total size, and source of the reference tree

Taxa . # taxa . Total size (MB) . Source .
Escherichia coli/Shigella29 56.41 Zhou et al. [ 48]
Wolbachia I, 252 proteins 19 1.15 Gerth et al. [ 49]
Wolbachia I, whole proteomes 19 7.96 Gerth et al. [ 49]
Wolbachia II47 14.78 See Supplementary Material
Plants 11 245.05 Hatje and Kollmar [ 50]
Prokaryotes 813 784.86 Lang et al. [ 51]
Metazoa 36 585.0 Borowiec et al. [ 52]
Taxa . # taxa . Total size (MB) . Source .
Escherichia coli/Shigella29 56.41 Zhou et al. [ 48]
Wolbachia I, 252 proteins 19 1.15 Gerth et al. [ 49]
Wolbachia I, whole proteomes 19 7.96 Gerth et al. [ 49]
Wolbachia II47 14.78 See Supplementary Material
Plants 11 245.05 Hatje and Kollmar [ 50]
Prokaryotes 813 784.86 Lang et al. [ 51]
Metazoa 36 585.0 Borowiec et al. [ 52]

Robinson-Foulds (RF) distances and relativeRF distances between trees generated with alignment-free methods and the respective reference trees for various sets of taxa.

Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
RF distances
Escherichia coli/Shigella4.00 6 24 40 42 38
Wolbachia I, 252 proteins 7.68 8 6 4 8 4
Wolbachia I, whole proteomes 6.00 6 8 16 8 12
Wolbachia II19.62 20 44 54 26 16
Plants 0.82 0 6 8 2 6
Prokaryotes 1,020 1,348 886 1,452 880 960
Metazoa 27.1 - 40 62 30 36
Relative RF distances
E. coli/Shigella0.08 0.12 0.46 0.77 0.81 0.73
Wolbachia I, 252 proteins 0.24 0.25 0.19 0.13 0.25 0.13
Wolbachia I, whole proteomes 0.19 0.19 0.25 0.50 0.25 0.38
Wolbachia II0.23 0.23 0.50 0.61 0.30 0.18
Plants 0.05 0.00 0.37 0.50 0.12 0.37
Prokaryotes 0.63 0.83 0.55 0.90 0.54 0.59
Metazoa 0.41 - 0.61 0.94 0.45 0.55
Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
RF distances
Escherichia coli/Shigella4.00 6 24 40 42 38
Wolbachia I, 252 proteins 7.68 8 6 4 8 4
Wolbachia I, whole proteomes 6.00 6 8 16 8 12
Wolbachia II19.62 20 44 54 26 16
Plants 0.82 0 6 8 2 6
Prokaryotes 1,020 1,348 886 1,452 880 960
Metazoa 27.1 - 40 62 30 36
Relative RF distances
E. coli/Shigella0.08 0.12 0.46 0.77 0.81 0.73
Wolbachia I, 252 proteins 0.24 0.25 0.19 0.13 0.25 0.13
Wolbachia I, whole proteomes 0.19 0.19 0.25 0.50 0.25 0.38
Wolbachia II0.23 0.23 0.50 0.61 0.30 0.18
Plants 0.05 0.00 0.37 0.50 0.12 0.37
Prokaryotes 0.63 0.83 0.55 0.90 0.54 0.59
Metazoa 0.41 - 0.61 0.94 0.45 0.55

See the main text for details. Since Prot-SpaM uses a probabilistic algorithm, different program runs may produce slightly different results. Therefore, we performed 100 program runs on each dataset and report the average RF distances, except for the large prokaryote dataset where we did only a single program run. All programs were run on protein sequences or whole proteomes, respectively, except for FSWM, which was run on whole-genome sequences of the same species (or on the gene sequences coding for the 252 selected proteins from Wolbachia I). We were unable to run FSWM on the whole genomes of the 31 metazoan species since this dataset was too large. Since the original implementation of ACS is not publicly available, we ran our own implementation, kmacs, with k = 0 instead.

Robinson-Foulds (RF) distances and relativeRF distances between trees generated with alignment-free methods and the respective reference trees for various sets of taxa.

Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
RF distances
Escherichia coli/Shigella4.00 6 24 40 42 38
Wolbachia I, 252 proteins 7.68 8 6 4 8 4
Wolbachia I, whole proteomes 6.00 6 8 16 8 12
Wolbachia II19.62 20 44 54 26 16
Plants 0.82 0 6 8 2 6
Prokaryotes 1,020 1,348 886 1,452 880 960
Metazoa 27.1 - 40 62 30 36
Relative RF distances
E. coli/Shigella0.08 0.12 0.46 0.77 0.81 0.73
Wolbachia I, 252 proteins 0.24 0.25 0.19 0.13 0.25 0.13
Wolbachia I, whole proteomes 0.19 0.19 0.25 0.50 0.25 0.38
Wolbachia II0.23 0.23 0.50 0.61 0.30 0.18
Plants 0.05 0.00 0.37 0.50 0.12 0.37
Prokaryotes 0.63 0.83 0.55 0.90 0.54 0.59
Metazoa 0.41 - 0.61 0.94 0.45 0.55
Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
RF distances
Escherichia coli/Shigella4.00 6 24 40 42 38
Wolbachia I, 252 proteins 7.68 8 6 4 8 4
Wolbachia I, whole proteomes 6.00 6 8 16 8 12
Wolbachia II19.62 20 44 54 26 16
Plants 0.82 0 6 8 2 6
Prokaryotes 1,020 1,348 886 1,452 880 960
Metazoa 27.1 - 40 62 30 36
Relative RF distances
E. coli/Shigella0.08 0.12 0.46 0.77 0.81 0.73
Wolbachia I, 252 proteins 0.24 0.25 0.19 0.13 0.25 0.13
Wolbachia I, whole proteomes 0.19 0.19 0.25 0.50 0.25 0.38
Wolbachia II0.23 0.23 0.50 0.61 0.30 0.18
Plants 0.05 0.00 0.37 0.50 0.12 0.37
Prokaryotes 0.63 0.83 0.55 0.90 0.54 0.59
Metazoa 0.41 - 0.61 0.94 0.45 0.55

See the main text for details. Since Prot-SpaM uses a probabilistic algorithm, different program runs may produce slightly different results. Therefore, we performed 100 program runs on each dataset and report the average RF distances, except for the large prokaryote dataset where we did only a single program run. All programs were run on protein sequences or whole proteomes, respectively, except for FSWM, which was run on whole-genome sequences of the same species (or on the gene sequences coding for the 252 selected proteins from Wolbachia I). We were unable to run FSWM on the whole genomes of the 31 metazoan species since this dataset was too large. Since the original implementation of ACS is not publicly available, we ran our own implementation, kmacs, with k = 0 instead.

As mentioned above, Prot-SpaM uses a probabilistic algorithm to generate pattern sets, so the results of different program runs on a sequence set can differ slightly, even if the same parameter values are used. We therefore performed 100 program runs on each dataset. Table 2 lists the average RF distances for these 100 program runs. An exception was the large prokaryotic dataset where we only performed a single program run. Since absolute RF distances are not easy to interpret, Table 2 also reports the relative RF distances, which are obtained from the absolute RF distances by dividing by the maximum possible RF distance for a given dataset. The maximum possible RF distances for a set of n taxa is 2 · n − 6 [ 53]. Program run times for the different approaches are shown in Table 3. Trees were visualized with iTOL [ 54]. Neighbor-joining trees and RF distances were calculated with the phylip package [ 45].

Program run time in seconds for different alignment-free approaches on our benchmark datasets.

Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
Escherichia coli/Shigella55 110 125 10 2,518 193
Wolbachia II19 68 46 9 5,302 135
Wolbachia I, 252 proteins 3 5 2 1 36 3
Wolbachia I, whole proteomes 11 22 21 2 178 26
Plants 464 1,107,720 365 17 17,693 850
Prokaryotes 5,502 244,139 5,492 1,929 915,635 123,520
Metazoa 1,719 - 1,973 43 151,612 9,512
Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
Escherichia coli/Shigella55 110 125 10 2,518 193
Wolbachia II19 68 46 9 5,302 135
Wolbachia I, 252 proteins 3 5 2 1 36 3
Wolbachia I, whole proteomes 11 22 21 2 178 26
Plants 464 1,107,720 365 17 17,693 850
Prokaryotes 5,502 244,139 5,492 1,929 915,635 123,520
Metazoa 1,719 - 1,973 43 151,612 9,512

Prot-SpaM and FSWM were run on 40 threads. The other tools do not support multi-threading therefore, they were run single threaded.

Program run time in seconds for different alignment-free approaches on our benchmark datasets.

Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
Escherichia coli/Shigella55 110 125 10 2,518 193
Wolbachia II19 68 46 9 5,302 135
Wolbachia I, 252 proteins 3 5 2 1 36 3
Wolbachia I, whole proteomes 11 22 21 2 178 26
Plants 464 1,107,720 365 17 17,693 850
Prokaryotes 5,502 244,139 5,492 1,929 915,635 123,520
Metazoa 1,719 - 1,973 43 151,612 9,512
Taxa . Prot-SpaM . FSWM . CVTree . FFP, k = 4 . kmacs, k = 10 . ACS .
Escherichia coli/Shigella55 110 125 10 2,518 193
Wolbachia II19 68 46 9 5,302 135
Wolbachia I, 252 proteins 3 5 2 1 36 3
Wolbachia I, whole proteomes 11 22 21 2 178 26
Plants 464 1,107,720 365 17 17,693 850
Prokaryotes 5,502 244,139 5,492 1,929 915,635 123,520
Metazoa 1,719 - 1,973 43 151,612 9,512

Prot-SpaM and FSWM were run on 40 threads. The other tools do not support multi-threading therefore, they were run single threaded.

Escherichia coli/Shigella

Our first dataset consists of 29 strains of Escherichia coli and Shigella. For each strain, we were able to download about 4,000-5,000 protein sequences the total size of this dataset is around 41 MB. Figure 5 shows the reference tree that we used and the tree obtained with the algorithm described here. The reference tree was published by Zhou et al. [ 48] and is based on a multiple sequence alignment of 2,034 core genes and a maximum likelihood method. As can be seen in Table 2, our approach produced a tree with a topology almost identical to that of the reference tree. All of the 100 program runs that we performed with Prot-SpaM produced the same tree topology the RF distance between these trees and the reference tree was 4. The other protein-based alignment-free methods led to phylogenies with RF distances to the reference tree of between 24 and 42, while the genome-based tree obtained with FSWM had an RF distance of 6 to the reference tree. These trees are shown in the Supplementary Material .

Wolbachia

As a second test case for benchmarking, we analyzed the phylogeny of Wolbachia strains, a group of Alphaproteobacteria that are intracellular endosymbionts of arthropods and nematodes [ 55]. Within Wolbachia, 16 distinct genetic lineages (“supergroups”) are currently distinguished (named by letters A-F and H-Q) that may differ in host specificity and type of symbiosis [ 56]. We re-analyzed a phylogenomic dataset by [ 57], thereby focusing on relationships of strains within supergroups (Wolbachia I). A tree generated with Prot-SpaM from this dataset is shown Fig. 4.

For a second Wolbachia benchmarking dataset, we analyzed relationships between supergroups based on available (draft) genomes see below (Wolbachia II). For within supergroup relationships (Wolbachia I), a program run of Prot-SpaM on the whole proteome recovered a tree that is largely congruent in topology and branch lengths in comparison to a phylogenomic supermatrix analysis of 252 single-copy orthologs that excluded genes that showed signs of recombination. A comparison based on RF distances showed that our new method outcompetes other available alignment-free programs (Table 2). Interestingly, when analyzing only the 252 ortholog dataset of [ 57] instead of whole proteomes, RF distances become bigger and other alignment-free methods perform better (Table 2).

Analyzing relationships between supergroups has been historically regarded as a difficult phylogenetic problem [ 58, 49]. Analyzing all annotated proteins from available genomes with Prot-SpaM supported the monophyly of all supergroups. Moreover, this analysis found the same Wolbachia strains basally branching as recent analyses suggested. Surprisingly, the phylogenomic supermatrix analysis of 252 single-copy orthologs that excluded genes that showed signs of recombination of this dataset recovered a topology that differs from the previous study in not supporting the sister group relationship of supergroups A and B. In contrast, as found in previous analyses, the sister group relationship of supergroups A and B is supported by the Prot-SpaM analysis. The Prot-SpaM analysis also recovered some relationships between supergroups that differ from the topologies of our phylogenomic analysis or expectations from a recently published phylogenomic study [ 59]. However, it is known that supergroups differ in their base (and amino acid) composition, and it is currently unknown how this may impact alignment-free methods. More sophisticated evolutionary models could alleviate these differences in future studies. Nevertheless, in this test case, Prot-SpaM also outperforms other alignment-free methods when comparing the resulting phylogenetic tree with a phylogenomic analyses based on a concatenated supermatrix (Table 2).

For the Wolbachia II dataset, we downloaded (if available) proteomes for all available Wolbachia draft and fully assembled genomes (47 in total see Supplementary Material for details). Proteins for Wolbachia strains that were lacking this information in the National Center for Biotechnology Information GenBank were derived from translations using GeneMark version 2.5 [ 60]. We predicted groups of orthologous genes between these proteomes using Orthofinder version 2.1.2 [ 61] running under default parameters. Single-copy genes present in all analyzed strains (83 in total) were aligned using MAFFT version 7.271 with the ‘L-INS-i’ algorithm [ 62] and tested for evidence of recombination using the pairwise homoplasy index [ 63] with window sizes of 10, 20, 30, and 50. Recombining loci were subsequently removed from the dataset and the remaining loci concatenated using FasConCat version 1.0 [ 64]. The resulting supermatrix (68 loci, 20,787 amino acid positions) was subject to partitioned maximum likelihood analysis following best model and partition scheme selection in IQ-TREE version 1.6.2 [ 65, 66, 67].

For the whole-proteome sequences of the dataset Wolbachia I, the RF distance to the reference tree was 6 for each of the 100 program runs. By contrast, for the 100 runs on the selected protein sequences of the same set of taxa, the average RF distance was 7.68 and the standard deviation was 0.736. For Wolbachia II, the average RF distance was 19.62 and the standard deviation was 0.89.

Large-scale microbial phylogeny reconstruction

In 2013, J. Eisen’s group published a paper on the phylogeny of the microbial genomes that were available at the time [ 51]. As a basis of their study, they selected 24 single-copy marker genes and a non-redundant subset of taxa. To obtain such a subset, they used a greedy algorithm by M. Steel [ 68], making sure that marker genes from different taxa in the resulting subset had a distance to each other of at least 2 substitutions per 100 positions. This way, they obtained a non-redundant subset of 841 bacterial and archeal genomes from the more than 3,000 microbial genomes that were publicly available. Multiple sequence alignments of the marker genes were calculated with hmmalign [ 69] and were concatenated to a supermatrix that was used as input for the phylogeny programs RAxML [ 70] and MrBayes [ 2]. In addition, the authors used the Bayesian tree-reconciliation program BUCKy [ 71] to the same set of marker genes. The trees they obtained with these different methods were found to be similar to trees obtained based on 16S RNA genes.

To evaluate Prot-SpaM, we used the 841 microbial genomes from Lang et al. [ 51] and downloaded all protein sequences from these taxa that were available through GenBank. For 28 of the 841 taxa, we were unable to obtain protein sequences, so we obtained a slightly reduced subset of 813 taxa compared to the taxa used by Lang et al. First, we applied Prot-SpaM to all available protein sequences from these 813 taxa. In addition, we ran Prot-SpaM on the protein sequences encoded by the 24 marker genes from Lang et al. and, finally, we applied our previous approach FSWM [ 33] to the 841 genome sequences. The trees that we obtained with our different alignment-free approaches are shown in Fig. 6, together with the maximum likelihood tree from [ 51], which we considered as a reliable reference. Clades from this reference tree are color coded in Fig. 6. As can be see from the color coding, the tree obtained with Prot-SpaM from the available protein sequences contains essentially the same clades as the reference tree. There are some differences within the clades that should be further investigated (J. Eisen, personal communication). The RF distance between the tree obtained with Prot-SpaM and the reference tree was 1,020.

Plants

Next, we used a set of plant taxa that has been studied by Hatje and Kollmar [ 50] and that we had used in previous studies to evaluate alignment-free approaches to genome sequence comparison [ 17, 22, 33]. The dataset that we used in these previous studies consisted of 14 Brassicales species. In GenBank, however, the proteomes could be downloaded for only 11 of the 14 species, so we had to limit our test runs to these 11 species. To obtain a reference tree, we used a tree that has been obtained with multiple sequence alignment and maximum likelihood, as published by Hatje and Kollmar [ 50 Figure 3 B]. From this tree, we removed the three species for which we could not obtain the proteome sequences in GenBank. Figure 7 shows the reference tree of the 14 original species, together with trees of the 11 species with available proteomes, calculated with the alignment-free methods that we evaluated here. For the 100 program runs with Prot-SpaM, the average RF distance between the resulting trees and the reference tree from [ 50] was 0.82 and the standard deviation was 0.9.

Metazoa

Finally, we used a set of 36 proteomes from 34 metazoan and 2 choanoflagellate taxa. These taxa have been previously used by Borowiec et al. [ 52] to study the position of the Ctenophora within the phylogenetic tree of the metazoan kingdom. The same set of taxa has also been used in a study by Zhou et al. [ 72] to evaluate maximum-likelihood programs for phylogeny reconstruction. As a reference tree, we used the tree published in [ 52]. The average RF distance of the Prot-SpaM trees to this reference tree was 27.1, with a standard deviation of 1.51.

Parameter values and number of selected spaced-word matches

Prot-SpaMhas four major parameters that can be adjusted by the user: the weight w (=number of match positions) of the binary patterns and spaced words, their length ℓ, the number m of different binary patterns used by the program, and the cutoff value T to separate homologous from background spaced-word matches. To see how these parameters influence the results of our software and to find suitable default values, we ran Prot-SpaM with varying values of these four parameters. Here, we modified one parameter at a time, using the respective default values of the remaining three parameters. The results are summarized in Tables 4 and 5. As can be seen in these tables, there are no values for w, ℓ, and T that work best for all datasets, but our default values seem to be a reasonable compromise. Using sets of m = 5 binary patterns does not improve the quality of the produced trees in terms of their RF distances to the reference trees compared to program runs with single patterns. Table 3 shows, however, that the distance values estimated by Prot-SpaM become statistically more stable if multiple patterns are used.

Distances calculated by Prot-SpaM for pairs of simulated protein sequences with a single binary pattern (m = 1, left) and with the default multiple-pattern option (m = 5, right). We performed 1,000 program runs for each value of m. The plot shows the average of the calculated distances standard deviations are shown as error bars.

Distances calculated by Prot-SpaM for pairs of simulated protein sequences with a single binary pattern (m = 1, left) and with the default multiple-pattern option (m = 5, right). We performed 1,000 program runs for each value of m. The plot shows the average of the calculated distances standard deviations are shown as error bars.

Reference tree for our dataset Wolbachia I (top) and tree calculated with Prot-SpaM using whole-proteome sequences of the same taxa (bottom) (see main text for details). Topological differences between the two trees are shown in red in the Prot-SpaM tree.

Reference tree for our dataset Wolbachia I (top) and tree calculated with Prot-SpaM using whole-proteome sequences of the same taxa (bottom) (see main text for details). Topological differences between the two trees are shown in red in the Prot-SpaM tree.

Reference tree (A) from [ 48] and tree calculated with Prot-SpaM with default parameters (B) for a set of 29 Escherichia coli and Shigella strains. Differences in the topologies between the two trees are marked in red.

Reference tree (A) from [ 48] and tree calculated with Prot-SpaM with default parameters (B) for a set of 29 Escherichia coli and Shigella strains. Differences in the topologies between the two trees are marked in red.

Program runtime and RF distances to reference trees for different parameter values with Prot-SpaM for the E. coli/Shigella and Wolbachia proteomes.

E. coli/Shigella .
Weight w 68 10
Runtime [s] 55.447.4 46.9
RF distance 46.02 6.76
Length ℓ 36 4656 66
Runtime [s] 47.5 55.460.3 66.9
RF distance 4 44.02 4.84
# patterns m1 3 57
Runtime [s] 13.5 34.4 55.475.94
RF distance 4.12 4.02 44
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 55.2 55.4 55.455.1 55.3 55.3 55.1 55.2
RF distance 11.92 12 412 12 12 12 12
Wolbachia II
Weight w4 68 10
Runtime [s] 112.4 19.418 17.8
RF distance 20.38 19.6822 22
Length ℓ 36 4656 66
Runtime [s] 17 19.421.5 23.9
RF distance 19.78 19.6819.7 19.78
# patterns m1 3 57
Runtime [s] 5.4 12.5 19.426.5
distance 19.06 19.12 19.6819.82
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 19.6 19.6 19.419.4 19.4 19.4 19.4 19.4
RF distance 22.18 18 19.6819.86 20.16 20.7 21.04 22
E. coli/Shigella .
Weight w 68 10
Runtime [s] 55.447.4 46.9
RF distance 46.02 6.76
Length ℓ 36 4656 66
Runtime [s] 47.5 55.460.3 66.9
RF distance 4 44.02 4.84
# patterns m1 3 57
Runtime [s] 13.5 34.4 55.475.94
RF distance 4.12 4.02 44
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 55.2 55.4 55.455.1 55.3 55.3 55.1 55.2
RF distance 11.92 12 412 12 12 12 12
Wolbachia II
Weight w4 68 10
Runtime [s] 112.4 19.418 17.8
RF distance 20.38 19.6822 22
Length ℓ 36 4656 66
Runtime [s] 17 19.421.5 23.9
RF distance 19.78 19.6819.7 19.78
# patterns m1 3 57
Runtime [s] 5.4 12.5 19.426.5
distance 19.06 19.12 19.6819.82
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 19.6 19.6 19.419.4 19.4 19.4 19.4 19.4
RF distance 22.18 18 19.6819.86 20.16 20.7 21.04 22

We ran our program with different values for the weight w and length ℓ of the spaced words for different numbers of patterns and for different values of the threshold T. Here, we modified the value of one of these parameters at a time and used the default values for the remaining three parameters. Default values of the modified parameters and the resulting runtimes and RF distances are shown in bold font. Since Prot-SpaM uses a probabilistic algorithm to generate pattern sets, we performed 100 program runs for each set of parameters the table shows the average RF distances of these 100 runs.

Program runtime and RF distances to reference trees for different parameter values with Prot-SpaM for the E. coli/Shigella and Wolbachia proteomes.

E. coli/Shigella .
Weight w 68 10
Runtime [s] 55.447.4 46.9
RF distance 46.02 6.76
Length ℓ 36 4656 66
Runtime [s] 47.5 55.460.3 66.9
RF distance 4 44.02 4.84
# patterns m1 3 57
Runtime [s] 13.5 34.4 55.475.94
RF distance 4.12 4.02 44
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 55.2 55.4 55.455.1 55.3 55.3 55.1 55.2
RF distance 11.92 12 412 12 12 12 12
Wolbachia II
Weight w4 68 10
Runtime [s] 112.4 19.418 17.8
RF distance 20.38 19.6822 22
Length ℓ 36 4656 66
Runtime [s] 17 19.421.5 23.9
RF distance 19.78 19.6819.7 19.78
# patterns m1 3 57
Runtime [s] 5.4 12.5 19.426.5
distance 19.06 19.12 19.6819.82
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 19.6 19.6 19.419.4 19.4 19.4 19.4 19.4
RF distance 22.18 18 19.6819.86 20.16 20.7 21.04 22
E. coli/Shigella .
Weight w 68 10
Runtime [s] 55.447.4 46.9
RF distance 46.02 6.76
Length ℓ 36 4656 66
Runtime [s] 47.5 55.460.3 66.9
RF distance 4 44.02 4.84
# patterns m1 3 57
Runtime [s] 13.5 34.4 55.475.94
RF distance 4.12 4.02 44
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 55.2 55.4 55.455.1 55.3 55.3 55.1 55.2
RF distance 11.92 12 412 12 12 12 12
Wolbachia II
Weight w4 68 10
Runtime [s] 112.4 19.418 17.8
RF distance 20.38 19.6822 22
Length ℓ 36 4656 66
Runtime [s] 17 19.421.5 23.9
RF distance 19.78 19.6819.7 19.78
# patterns m1 3 57
Runtime [s] 5.4 12.5 19.426.5
distance 19.06 19.12 19.6819.82
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 19.6 19.6 19.419.4 19.4 19.4 19.4 19.4
RF distance 22.18 18 19.6819.86 20.16 20.7 21.04 22

We ran our program with different values for the weight w and length ℓ of the spaced words for different numbers of patterns and for different values of the threshold T. Here, we modified the value of one of these parameters at a time and used the default values for the remaining three parameters. Default values of the modified parameters and the resulting runtimes and RF distances are shown in bold font. Since Prot-SpaM uses a probabilistic algorithm to generate pattern sets, we performed 100 program runs for each set of parameters the table shows the average RF distances of these 100 runs.

Program runtime and RF distances to reference trees for different parameter values with Prot-SpaM for the plant and metazoan proteomes

Plants .
Weight w4 68 10
Runtime [s] 57,578 464320 325
RF distance 2 04 6
Length ℓ 36 4656 66
Runtime [s] 383 464441 494
RF distance 2 00 0
# patterns m1 3 57
Runtime [s] 91 255 464572
RF distance 0 0 02
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 383 402 464409 391 459 439 430
RF distance 2 2 00 0 0 2 4
Metazoa
Weight w 68 10
Runtime [s] 1,7191,584 1,518
RF distance 3026 30
Length ℓ 36 4656 66
Runtime [s] 1,351 1,7191,584 2,089
RF distance 26 3024 26
# patterns 1 3 57
Runtime [s] 427 890 1,7192,078
RF distance 24 28 3026
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 2,539 2,337 1,7192,269 2,150 1,906 1,783 1,797
RF distance 30 24 3026 28 28 30 34
Plants .
Weight w4 68 10
Runtime [s] 57,578 464320 325
RF distance 2 04 6
Length ℓ 36 4656 66
Runtime [s] 383 464441 494
RF distance 2 00 0
# patterns m1 3 57
Runtime [s] 91 255 464572
RF distance 0 0 02
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 383 402 464409 391 459 439 430
RF distance 2 2 00 0 0 2 4
Metazoa
Weight w 68 10
Runtime [s] 1,7191,584 1,518
RF distance 3026 30
Length ℓ 36 4656 66
Runtime [s] 1,351 1,7191,584 2,089
RF distance 26 3024 26
# patterns 1 3 57
Runtime [s] 427 890 1,7192,078
RF distance 24 28 3026
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 2,539 2,337 1,7192,269 2,150 1,906 1,783 1,797
RF distance 30 24 3026 28 28 30 34

Parameter values as in Table 4. Because of the size of these datasets, we performed only one program run per parameter set.

We ran our program with different values for the weight w and length ℓ of the spaced words for different numbers of patterns and for different values of the threshold T. Here, we modified the value of one of these parameters at a time and used the default values for the remaining three parameters. Default values of the modified parameters and the resulting runtimes and RF distances are shown in bold font.

Program runtime and RF distances to reference trees for different parameter values with Prot-SpaM for the plant and metazoan proteomes

Plants .
Weight w4 68 10
Runtime [s] 57,578 464320 325
RF distance 2 04 6
Length ℓ 36 4656 66
Runtime [s] 383 464441 494
RF distance 2 00 0
# patterns m1 3 57
Runtime [s] 91 255 464572
RF distance 0 0 02
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 383 402 464409 391 459 439 430
RF distance 2 2 00 0 0 2 4
Metazoa
Weight w 68 10
Runtime [s] 1,7191,584 1,518
RF distance 3026 30
Length ℓ 36 4656 66
Runtime [s] 1,351 1,7191,584 2,089
RF distance 26 3024 26
# patterns 1 3 57
Runtime [s] 427 890 1,7192,078
RF distance 24 28 3026
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 2,539 2,337 1,7192,269 2,150 1,906 1,783 1,797
RF distance 30 24 3026 28 28 30 34
Plants .
Weight w4 68 10
Runtime [s] 57,578 464320 325
RF distance 2 04 6
Length ℓ 36 4656 66
Runtime [s] 383 464441 494
RF distance 2 00 0
# patterns m1 3 57
Runtime [s] 91 255 464572
RF distance 0 0 02
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 383 402 464409 391 459 439 430
RF distance 2 2 00 0 0 2 4
Metazoa
Weight w 68 10
Runtime [s] 1,7191,584 1,518
RF distance 3026 30
Length ℓ 36 4656 66
Runtime [s] 1,351 1,7191,584 2,089
RF distance 26 3024 26
# patterns 1 3 57
Runtime [s] 427 890 1,7192,078
RF distance 24 28 3026
Threshold T-50 -25 025 50 75 100 125
Runtime [s] 2,539 2,337 1,7192,269 2,150 1,906 1,783 1,797
RF distance 30 24 3026 28 28 30 34

Parameter values as in Table 4. Because of the size of these datasets, we performed only one program run per parameter set.

We ran our program with different values for the weight w and length ℓ of the spaced words for different numbers of patterns and for different values of the threshold T. Here, we modified the value of one of these parameters at a time and used the default values for the remaining three parameters. Default values of the modified parameters and the resulting runtimes and RF distances are shown in bold font.

The number of spaced-word matches in a pairwise sequence comparison depends on how similar the two sequences are to each other see [ 18] for details. Consequently, the number of spaced-word matches that are selected by our program to estimate phylogenetic distances also depends on the degree of similarity between the compared sequences. We found two extreme cases with our test data, one in the E. coli/Shigella dataset where most taxa are closely related to each other and another one in the Metazoan dataset that contains taxa with very large evolutionary distances. In the pairwise comparison of E. coliO157:H7 strain EDL933 with E. coliO157:H7 Sakai (EHEC), Prot-SpaM selected more than 6,000,000 spaced-word matches. These two proteomes have less than 1,600,000 amino acids each, so in this case >3.75 spaced-word matches per sequence position were selected. By contrast, less than 13,000 spaced-word matches were selected in the comparison of Brugia malayi and Homo sapiens. The latter proteome has a length of more than 75,000,000 amino acids, so here less than 0.00017 spaced-word matches per sequence position were selected.


Tong, S. Y. et al. Genome sequencing defines phylogeny and spread of methicillin-resistant Staphylococcus aureus in a high transmission setting. Genome Res 25, 111–118, 10.1101/gr.174730.114 (2015).

Dunn, C. W. et al. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452, 745–749, 10.1038/nature06614 (2008).

Skippington, E. & Ragan, M. A. Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella. BMC Genomics 12, 532, 10.1186/1471-2164-12-532 (2011).

Jarvis, E. D. et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346, 1320–1331, 10.1126/science.1253451 (2014).

Darling, A. E., Miklós, I. & Ragan, M. A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4, e1000128, 10.1371/journal.pgen.1000128 (2008).

Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, 10.1186/1745-6150-8-3 (2013).

Puigbò, P., Wolf, Y. I. & Koonin, E. V. The tree and net components of prokaryote evolution. Genome Biol Evol 2, 745–756, 10.1093/gbe/evq062 (2010).

Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA 102, 14332–14337, 10.1073/pnas.0504068102 (2005).

Stiller, J. W. Experimental design and statistical rigor in phylogenomics of horizontal and endosymbiotic gene transfer. BMC Evol Biol 11, 259, 10.1186/1471-2148-11-259 (2011).

Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473–476, 10.1126/science.1151532 (2008).

Shih, P. M. et al. Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing. Proc Natl Acad Sci USA 110, 1053–1058, 10.1073/pnas.1217107110 (2013).

Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060, 10.1038/nature08656 (2009).

Eyre, D. W. et al. A pilot study of rapid benchtop sequencing of Staphylococcus aureus and Clostridium difficile for outbreak detection and surveillance. BMJ Open 2, e001124, 10.1136/bmjopen-2012-001124 (2012).

Bonham-Carter, O., Steele, J. & Bastola, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 15, 890–905, 10.1093/bib/bbt052 (2014).

Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407–418, 10.1093/bib/bbt083 (2014).

Vinga, S. & Almeida, J. Alignment-free sequence comparison-a review. Bioinformatics 19, 513–523 (2003).

Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, 10.1038/srep06504 (2014).

Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368–376 (1981).

Huelsenbeck, J. P., Ronquist, F., Nielsen, R. & Bollback, J. P. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310–2314, 10.1126/science.1065889 (2001).

Huelsenbeck, J. P. & Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001).

Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522, 10.1186/s12864-015-1647-5 (2015).

Ren, J. et al. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics 32, 993–1000, 10.1093/bioinformatics/btv395 (2016).

Song, K. et al. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 20, 64–79, 10.1089/cmb.2012.0228 (2013).

Miller, R. G. Jackknife - Review. Biometrika 61, 1–15 (1974).

Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467–1490, 10.1089/cmb.2010.0056 (2010).

Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, 10.1093/nar/gkt003 (2013).

Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, 10.1186/1471-2148-9-195 (2009).

Jun, S. R., Sims, G. E., Wu, G. A. & Kim, S. H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci USA 107, 133–138, 10.1073/pnas.0913033107 (2010).

Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci USA 106, 2677–2682, 10.1073/pnas.0813249106 (2009).

Sims, G. E. & Kim, S. H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci USA 108, 8329–8334, 10.1073/pnas.1105168108 (2011).

Horwege, S. et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42, W7–11, 10.1093/nar/gku398 (2014).

Leimeister, C. A., Boden, M., Horwege, S., Lindner, S. & Morgenstern, B. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30, 1991–1999, 10.1093/bioinformatics/btu177 (2014).

Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336–350, 10.1089/cmb.2006.13.336 (2006).

Russell, D. J., Way, S. F., Benson, A. K. & Sayood, K. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11, 601, 10.1186/1471-2105-11-601 (2010).

Leimeister, C. A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000–2008, 10.1093/bioinformatics/btu331 (2014).

Haubold, B., Pfaffelhuber, P., Domazet-Lošo, M. & Wiehe, T. Estimating mutation distances from unaligned genomes. J Comput Biol 16, 1487–1500, 10.1089/cmb.2009.0106 (2009).

Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math Biosci 53, 131–147, 10.1016/0025-5564(81)90043-2 (1981).

Beiko, R. G. & Charlebois, R. L. A simulation test bed for hypotheses of genome evolution. Bioinformatics 23, 825–831, 10.1093/bioinformatics/btm024 (2007).

Bryant, D. & Steel, M. Computing the distribution of a tree metric. IEEE/ACM Trans Comput Biol Bioinform 6, 420–426, 10.1109/TCBB.2009.32 (2009).

Ragan, M. A. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol 1, 53–58, 10.1016/1055-7903(92)90035-F (1992).

Ragan, M. A., Bernard, G. & Chan, C. X. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11, 176–185, 10.4161/rna.27505 (2014).

Gordon, D. M., Clermont, O., Tolley, H. & Denamur, E. Assigning Escherichia coli strains to phylogenetic groups: multi-locus sequence typing versus the PCR triplex method. Environ Microbiol 10, 2484–2496, 10.1111/j.1462-2920.2008.01669.x (2008).

Ochman, H. & Selander, R. K. Standard reference strains of Escherichia coli from natural populations. J Bacteriol 157, 690–693, 0021-9193/84/020690-04$02.00/0 (1984).

Lecointre, G., Rachdi, L., Darlu, P. & Denamur, E. Escherichia coli molecular phylogeny using the incongruence length difference test. Mol Biol Evol 15, 1685–1695 (1998).

Farris, J. S., Albert, V. A., Källersjö, M., Lipscomb, D. & Kluge, A. G. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12, 99–124 (1996).

Shi, J., Zhang, Y., Luo, H. & Tang, J. Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics 11, 168, 10.1186/1471-2105-11-168 (2010).

Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA 74, 5088–5090, 10.1073/pnas.74.11.5088 (1977).

Fuerst, J. A. & Sagulenko, E. Beyond the bacterium: planctomycetes challenge our concepts of microbial structure and function. Nat Rev Microbiol 9, 403–413, 10.1038/nrmicro2578 (2011).

Gunasinghe, U., Alahakoon, D. & Bedingfield, S. Extraction of high quality k-words for alignment-free sequence comparison. J Theor Biol 358, 31–51, 10.1016/j.jtbi.2014.05.016 (2014).

Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF–a simulation framework for genome evolution. Mol Biol Evol 29, 1115–1123, 10.1093/molbev/msr268 (2012).

Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 17, 57–86 (1986).

Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution on genome phylogeny. Syst Biol 57, 844–856, 10.1080/10635150802559265 (2008).

Kupczok, A., Schmidt, H. A. & von Haeseler, A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol Biol 5, 37, 10.1186/1748-7188-5-37 (2010).

Bryant, D. & Steel, M. Computing the distribution of a tree metric. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 420–426, 10.1109/TCBB.2009.32 (2009).