Consensus symbols in multiple sequence alignment

Consensus symbols in multiple sequence alignment

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I was using multAlin for multiple aligning a set of sequences. The output I and came across included the following documentation (English corrected):

Consensus symbols:
! is any of IV
$ is any of LM
% is any of FY
# is any of NDQEBZ

I understand what it says but I don't know how to interpret it further. What impact does this have on the sequence alignment?

The use of these symbols in output of the type shown below is purely presentational, to help you inspect the alignment and identify regions of partial conservation.


This is quite independent of the logic (algorithm) of the program, which is indicated by the documentational comments:

Symbol comparison table: blosum62
Gap weight: 12
Gap length weight: 2

This also gives a reference to the original paper you should read to find more information about the program.

You would also be advised to read the Wikipedia entry on Multiple Sequence Alignment and on Sequence Alignment in general. I would mention that this is an old method (as the appearance of the site also indicates). If it gives useful results, fine. Otherwise you might try the more popular and more frequently updated Clustal, which is available online or as a standalone program.

Consensus sequence

In molecular biology and bioinformatics, a consensus sequence is a way of representing the results of a multiple sequence alignment, where related sequences are compared to each other, and similar functional sequence motifs are found. The consensus sequence shows which residues are conserved (are always the same), and which residues are variable.

Additional recommended knowledge

Essential Laboratory Skills Guide

8 Steps to a Clean Balance – and 5 Solutions to Keep It Clean

Daily Sensitivity Test

Developing software for pattern recognition is a major topic in genetics, molecular biology, and bioinformatics. Specific sequence motifs can function as regulatory sequences controlling biosynthesis, or as signal sequences that direct a molecule to a specific site within the cell or regulate its maturation. Since the regulatory function of these sequences is important, they are thought to be conserved across long periods of evolution. In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.

The conserved sequence motifs are called consensus sequences and they show which residues are conserved and which residues are variable. Consider the following example DNA sequence:

In this notation, A means that always an A is found in that position. [CT] stands for either C or T, N stands for any base, and means any base except A. Y represents any pyrimidine, and R indicates any purine.

In this example, the notation [CT] does not give any indication of the relative frequency of C or T occurring at that position. An alternative method of representing a consensus sequence uses a sequence logo. This is a graphical representation of the consensus sequence, in which the size of a symbol is related to the frequency that a given nucleotide (or amino acid) occurs at a certain position. In sequence logos the more conserved the residue, the larger the symbol for that residue is drawn, the less frequent, the smaller the symbol. Sequence logos can be generated using the Gestalt Workbench, a publicly available visualization tool written by Gustavo Glusman at the Institute for Systems Biology.

A consensus sequence may be a short sequence of nucleotides which is found several times in the genome and is thought to play the same role in its different locations. For example, many transcription factors recognise particular consensus sequences in the promoters of the genes they regulate. In the same way restriction enzymes usually have palindromic consensus sequences, usually corresponding to the site where they cut the DNA. Transposons act in much the same manner in their identification of target sequences for transposition. Finally splice sites (sequences immediately surrounding the exon-intron boundaries) can also be considered as consensus sequences.

Thus a consensus sequence defines a putative DNA recognition site: it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position. All the actual examples shouldn't differ from the consensus by more than a few substitutions.

Any mutation allowing a mutated nucleotide in the core promoter sequence to look more like the consensus sequence is known as an up mutation. This kind of mutation will generally make the promoter stronger and thus the RNA polymerase forms a tighter bind to the DNA it wishes to transcribe and transcription is up regulated. On the contrary, mutations that destroy conserved nucleotides in the consensus sequence are known as down mutations. These types of mutations down regulate transcription since RNA polymerase can no longer bind as tightly to the core promoter sequence.

    Multiple Sequence Analysis

One of the most popular programs for performing multiple sequence alignments is clustalw ( []). EMBOSS has an interface to clustal called emma clustal (and thus emma ) creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also produce a dendogram showing the clustering relationships used to create the alignment. The dendogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment, but it is not an evolutionary tree, although the length of the branches is related to the relative distance of the sequences. clustal finds global optimal alignments. The alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences.Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment. When gaps are inserted into a sequence to produce an alignment, they are inserted at the same position in all the sequences of the cluster. Each pairwise alignment uses the method of Needleman and Wunsch extended for use with clusters of aligned sequences.

pscan has told us that our sequence belongs to the rhodopsin family. This is a very large family of sequences - for example, you can see the Pfam entry for rhodopsin by doing a keyword search at

We will now retrieve some further members of the family from SwissProt and produce a multiple alignment we'll then use this multiple alignment to produce a profile of this group of sequences and use that to align them all to our original sequence.

First, let's retrieve the sequences using seqret :

unix % seqret
Reads and writes (returns) a set of sequences all at once
Input sequence: sw:ops2_*
Output sequence [ops2_drome.fasta]: ops2.fasta

Note our use of the wild card character * to retrieve all swissprot sequences whose identifiers begin ops2_ .

unix % emma
Multiple alignment program - interface to ClustalW program
Input sequence: ops2.fasta
Output sequence [ops2_drome.aln]: ops2.aln
Output file [ops2_drome.dnd]: ops2.dnd
..clustalw -infile=21665A -outfile=21665B -align
-type=protein -output=gcg -pwmatrix=blosum -pwgapopen=10.000
-pwgapext=0.100 -newtree=21665C -matrix=blosum -gapopen=10.000
-gapext=5.000 -gapdist=8 -hgapresidues=GPSNDQEKR -maxdiv=30..

CLUSTAL W (1.74) Multiple Sequence Alignments

Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: OPS2_DROME 381 aa
Sequence 2: OPS2_DROPS 381 aa
Sequence 3: OPS2_HEMSA 377 aa
Sequence 4: OPS2_LIMPO 376 aa
Sequence 5: OPS2_PATYE 399 aa
Sequence 6: OPS2_SCHGR 380 aa
Start of Pairwise alignments
Sequences (1:2) Aligned. Score: 91
Sequences (1:3) Aligned. Score: 37
Sequences (1:4) Aligned. Score: 48
Sequences (1:5) Aligned. Score: 20
Sequences (1:6) Aligned. Score: 32
Sequences (2:3) Aligned. Score: 37
Sequences (2:4) Aligned. Score: 48
Sequences (2:5) Aligned. Score: 22
Sequences (2:6) Aligned. Score: 31
Sequences (3:4) Aligned. Score: 40
Sequences (3:5) Aligned. Score: 23
Sequences (3:6) Aligned. Score: 32
Sequences (4:5) Aligned. Score: 20
Sequences (4:6) Aligned. Score: 34
Sequences (5:6) Aligned. Score: 18
Guide tree file created: [21665C]
Start of Multiple Alignment
There are 5 groups
Group 1: Sequences: 2 Score:6084
Group 2: Sequences: 3 Score:3046
Group 3: Sequences: 4 Score:2772
Group 4: Sequences: 5 Score:2489
Group 5: Delayed
Sequence:5 Score:2819
Alignment Score 11778
GCG-Alignment file created [21665B]

We have aligned ops2 sequences from two fruit fly species, two crab species, locust and scallop. Let's see what emma made of them:

The sequences are very similar, but there are some differences - note the gaps that have been inserted. Also note that since this is a global alignment algorithm, gaps have been inserted to make all the sequences the same length.

Differences in alignment can be very difficult to see in this format. The program prettyplot can enhance visualisation of your results, by aligning the sequences on top of one another.

unix % prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: ops2.aln
Graph type [x11]:

A graphic display will appear on your screen detailing your alignment. Identical residues are shown in red, and similar residues in green. This type of display can given you a first impression regions of conservation.

As with all EMBOSS graphical programs you can capture the output in a file rather than just viewing it on screen. The output is controlled by the -graph family of associated qualifiers (type prettyplot -help -verbose to get a complete listing of options.

We will save our pretty plot to a file in colour postscript format. To do this we use -graph cps and -goutfile rhodopsin .

unix % prettyplot ops2.aln -goutfile rhodopsin -graph cps
Displays aligned sequences, with colouring and boxing

This has created a file that can be printed on a postscript printer or turned into a PDF document with ps2pdf (not an EMBOSS program but commonly found on many UNIX/Linux systems). PDF documents can then be viewed with a PDF viewer such as Acrobat Reader.

To adjust the output of prettyplot (e.g to increase the number of residues per line) there are a number of options that can be set. Read the help file and try to plot with/without a consensus, different numbers of residues per line and so on. (hint: prettyplot -help )

prophecy is an EMBOSS program for creating a profile from a set of multiply aligned sequences. We'll use our ops2 alignment to show you prophecy

unix % prophecy
Creates matrices/profiles from multiple alignments
Input sequence: ops2.aln
Profile type
F : Frequency
G : Gribskov
H : Henikoff
Select type [F]: g
Enter a name for the profile [My matrix]: ops2 sequences
Scoring matrix [Epprofile]:
Gap opening penalty [3.0]:
Gap extension penalty [0.3]:
Output file [outfile.prophecy]: ops2.prophecy

Now let's use the profile we just created to align xlrhodop.pep to our opsin2 sequences.

unix % prophet
Gapped alignment for profiles
Input sequence(s): xlrhodop.pep
Profile or matrix file: ops2.prophecy
Gap opening coefficient [1.0]:
Gap extension coefficient [0.1]:
Output file [ops2.prophet]:

The vertical bars ( | ) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons ( : ) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function.

Generating consensus sequences from partial order multiple sequence alignment graphs

Motivation: Consensus sequence generation is important in many kinds of sequence analysis ranging from sequence assembly to profile-based iterative search methods. However, how can a consensus be constructed when its inherent assumption-that the aligned sequences form a single linear consensus-is not true?

Results: Partial Order Alignment (POA) enables construction and analysis of multiple sequence alignments as directed acyclic graphs containing complex branching structure. Here we present a dynamic programming algorithm (heaviest_bundle) for generating multiple consensus sequences from such complex alignments. The number and relationships of these consensus sequences reveals the degree of structural complexity of the source alignment. This is a powerful and general approach for analyzing and visualizing complex alignment structures, and can be applied to any alignment. We illustrate its value for analyzing expressed sequence alignments to detect alternative splicing, reconstruct full length mRNA isoform sequences from EST fragments, and separate paralog mixtures that can cause incorrect SNP predictions.

Assessment of significance

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.

Scoring functions

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.

It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.


In recent years, RNA molecules gained increasing interest since a huge variety of functions associated with them was found. Consequently, research on small RNAs has been elected as the scientific breakthrough of the year 2002 by the readers of the Science magazine ( Couzin, 2002). The function of an RNA-molecule is mainly determined by its (secondary) structure. It is assumed that the structure of an RNA is often more conserved than its sequence (even more than for proteins). Hence, one cannot use standard multiple sequence alignment techniques like e.g. Clustal W ( Thompson et al., 1994), Dialign ( Morgenstern, 1998) or T. Coffee ( Notredame et al., 2000) since they completelyneglect structural information.

Multiple sequence- and structure-based alignments of RNAs can be divided into two major classes, the probabilistic and the non-probabilistic approaches. Probabilistic approaches are based on stochastic context-free grammars (SCFG) and require an initial multiple alignment as input. The quality of the outputs crucially depends on this initial alignment. They are used to model RNA-families and/or to predict a secondary structure via comparative analysis [e.g. Cove ( Eddy and Durbin, 1994), RNACAD ( Brown, 1999) and Pfold ( Knudsen and Hein, 2003)]. A non-probabilistic, comparative approach is e.g. given by RNAlign ( Corpet and Michot, 1994) that performs an alignment between a bank of aligned sequences and a new sequence.

In this paper, we propose a non-probabilistic approach to align a set of more than two RNAs with or without known conformations. The standard approach is to perform direct pairwise alignments of RNAs using sequence and (secondary) structure information and to combine the pairwise alignments into a multiple alignment. No general approach yet exists albeit there is a wealth of approaches for pairwise alignment of RNAs (see below). The reason is that the results of the pairwise sequence/structure alignments cannot simply be aligned in a progressive way (like profiles for sequence alignments). To the best of our knowledge, there are only two exceptions, namely PMcomp/PMmulti ( Hofacker et al., 2004) and RNAforester ( Höchsmann et al., 2003). PMcomp aligns RNA base pairing probability matrices and predicts a common folding structure between two sequences. PMmulti uses PMcomp in a progressive alignment strategy and provides multiple alignments with good qualities. However, it has a high complexity of O(n 6 ) time and O(n 4 ) space for the pairwise comparisons. In RNAforester, secondary structures are interpreted as trees, and a tree-based alignment is applied.

We solved the problem of combining pairwise alignments of RNAs as follows. First, alignment edges between RNAs reflecting sequence and structure similarities are generated based on an algorithmpublished by Jiang et al., 2002. In a second step, these edges are collected in a library, which is given as input to the multiple sequence alignment method T-Coffee ( Notredame et al., 2000). Structural positions that are supported by several pairwise comparisons are strengthened. Hence, the result comprises sequence and structure similarities of RNAs albeit the progressive alignment strategy is in principle not structure-based.

We have used the algorithm of Jiang et al., 2002 since it provides the greatest scoring flexibility and has moderate complexity. But any other sequence- and structure-based pairwise alignment method can also be adapted to our approach. The computational problem of pairwise alignment of RNAs was first addressed by Sankoff, 1985 who proposed a dynamic programming algorithm that aligns a set of RNA sequences while predicting their common fold at the same time. Subsequently, a variety of pairwise sequence-structure alignment approaches have been developed. Lenhof et al., 1998 addresses the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure. Local pairwise RNA-alignments using the same scoring scheme as Jiang et al., 2002 are considered by Backofen and Will, 2004. Beside the above listed approaches, there are several approaches that work on a tree based representation of RNAs (see e.g. Jiang et al., 1995 Höchsmann et al., 2003 Shapiro and Zhang, 1990).

We tested our approach on eukaryotic SECIS-elements on tRNA-like 3′ UTR elements from Tymovirus/Pomovirus and on the Hammerhead ribozyme (type III). We compared our MARNA results with the manual alignments taken from the Rfam database and with the alignments generated by PMmulti.


PROMALS3D ( 12 ) is a progressive method that clusters similar sequences and aligns them in a fast way, and uses more elaborate techniques to align the relatively divergent clusters to each other. In the first alignment stage, PROMALS3D aligns similar sequences using a scoring function of weighted sum-of-pairs of BLOSUM62 ( 13 ) scores. The first stage is fast and results in a number of prealigned groups (clusters) that are relatively distant from each other. In the second alignment stage, one representative sequence is selected for each prealigned group. Representative sequences (also called targets or target sequences below) are subject to PSI-BLAST searches for additional homologs from UNIREF90 ( 14 ) database and to PSIPRED ( 15 ) secondary structure prediction. Then a hidden Markov model of profile–profile alignments with predicted secondary structure scoring is applied to pairs of representatives to derive sequence-based constraints. Structure-based constraints are derived from homologs with known structures (see details below) and are combined with sequence-based constraints to derive a probabilistic consistency scoring function ( 16 ). The representative sequences are progressively aligned using such a consistency scoring function, and the prealigned groups obtained in the first stage are merged into the alignment of representatives to form the final multiple sequence alignment.

In PROMALS3D, structural constraints are derived for representative sequences that have homologs with known structures. First, the program identifies homologs with 3D structures (homolog3D) for representative sequences. For each representative sequence, the profile of PSI-BLAST (stored as a checkpoint file) search against the UNIREF90 database is used to initiate a new PSI-BLAST search (one iteration, with -C option) against the SCOP40 domain database ( 17 , 18 ) that contains protein domain sequences with known structures. Only structural domains that pass certain similarity criteria (default: e -value <0.001 and sequence identity no <20%) are kept. Multiple homolog3Ds could be identified and used for one target sequence if it contains several distinct domains with known structures. Pairwise residue match constraints for two representative target sequences are derived from sequence-based target-to-homolog3D alignments and structure-based homolog3D-to-homolog3D alignments. For example, if residue A in target S1 is aligned to residue B in homolog3D T1, residue B in homolog3D T1 is aligned with residue C in homolog3D T2 according to a structure comparison program, and residue C in homolog3D T2 is aligned with residue D in target S2, then we deduce that residue A in sequence S1 is aligned with residue D in sequence S2, and this pair ( A , D ) is used as a structure-derived constraint ( Figure 1 ). The alignment between a target sequence and its homolog3D can be the PSI-BLAST alignment, or they can be re-aligned by the profile–profile comparison routine used in PROMALS. The structure constraints among target sequences are combined with those constraints derived from profile–profile comparisons in the original PROMALS to deduce a consistency-based scoring function that integrates database sequence profiles, predicted secondary structures and 3D structural information. We used an empirical weight ratio of 1.5 (can be modified in server) for structure constraints relative to the sequence constraints of profile–profile comparison in the original PROMALS.

Deducing alignment constraints using homologs with 3D structures (homolog3Ds). S1 and S2 are two target sequences. T1 and T2 are their homolog3Ds. The alignment between two sequences S1 and S2 is deduced from two sequence-based sequence-to-homolog3D alignments and one structure-based homolog3D-to-homolog3D alignment. The three aligned residue pairs (A, B), (B, C) and (C, D) indicate that the pair (A, D) is aligned in the deduced alignment between two targets.

Deducing alignment constraints using homologs with 3D structures (homolog3Ds). S1 and S2 are two target sequences. T1 and T2 are their homolog3Ds. The alignment between two sequences S1 and S2 is deduced from two sequence-based sequence-to-homolog3D alignments and one structure-based homolog3D-to-homolog3D alignment. The three aligned residue pairs (A, B), (B, C) and (C, D) indicate that the pair (A, D) is aligned in the deduced alignment between two targets.

Human Genomics in Immunology

Robert L. Nussbaum , Jennifer M. Puck , in Clinical Immunology (Fifth Edition) , 2019

Genome Annotation

A consensus sequence of the human genome is only the first step in furthering our understanding of normal biological functions and how mutations lead to abnormal functions that cause disease. The Human Genome Project has now matured into a number of important basic and applied research areas: (i) acquiring a comprehensive catalogue of human variation and the impact of such variation on phenotype, including disorders of human development (ii) comparing the genomes of humans with those of other organisms, including model organisms and human ancestors and (iii) learning how to interpret all the sequence elements within the genome, not just the codons. Even now, over a dozen years after “completion” of the human genome sequence, a complete, accurate, and single contiguous stretch representing a reference human haploid genome is still being constructed, and updated versions of the genome sequence continue to be released. As described below, the greatest challenges to completing the human genome sequence are posed by regions that contain segmental duplications of nearly the same sequence. 1

SIAM Journal on Applied Mathematics

The study and comparison of sequences of characters from a finite alphabet is relevant to various areas of science, notably molecular biology. The measurement of sequence similarity involves the consideration of the different possible sequence alignments in order to find an optimal one for which the “distance” between sequences is minimum. By associating a path in a lattice to each alignment, a geometric insight can be brought into the problem of finding an optimal alignment. This problem can then be solved by applying a dynamic programming algorithm. However, the computational effort grows rapidly with the number N of sequences to be compared $(O(l^N ))$, where l is the mean length of the sequences to be compared).

It is proved here that knowledge of the measure of an arbitrarily chosen alignment can be used in combination with information from the pairwise alignments to considerably restrict the size of the region of the lattice in consideration. This reduction implies fewer computations and less memory space needed to carry out the dynamic programming optimization process. The observations also suggest new variants of the multiple alignment problem.

Bioinformatics explained: Sequence logo

In DNA, promoter sites or other DNA binding sites are highly conserved (see figure 20.8). This is also the case for repressor sites as seen for the Cro repressor of bacteriophage .

When aligning such sequences, regardless of whether they are highly variable or highly conserved at specific sites, it is very difficult to generate a consensus sequence which covers the actual variability of a given position. In order to better understand the information content or significance of certain positions, a sequence logo can be used. The sequence logo displays the information content of all positions in an alignment as residues or nucleotides stacked on top of each other (see figure 20.8). The sequence logo provides a far more detailed view of the entire alignment than a simple consensus sequence. Sequence logos can aid to identify protein binding sites on DNA sequences and can also aid to identify conserved residues in aligned domains of protein sequences and a wide range of other applications.

Each position of the alignment and consequently the sequence logo shows the sequence information in a computed score based on Shannon entropy [Schneider and Stephens, 1990]. The height of the individual letters represent the sequence information content in that particular position of the alignment.

A sequence logo is a much better visualization tool than a simple consensus sequence. An example hereof is an alignment where in one position a particular residue is found in 70% of the sequences. If a consensus sequence is used, it typically only displays the single residue with 70% coverage. In figure 20.8 an un-gapped alignment of 11 E. coli start codons including flanking regions are shown. In this example, a consensus sequence would only display ATG as the start codon in position 1, but when looking at the sequence logo it is seen that a GTG is also allowed as a start codon.

Figure 20 . 8 : Ungapped sequence alignment of eleven E. coli sequences defining a start codon. The start codons start at position 1. Below the alignment is shown the corresponding sequence logo. As seen, a GTG start codon and the usual ATG start codons are present in the alignment. This can also be visualized in the logo at position 1.