How can I predict DNA binding affinities from a protein sequence?

How can I predict DNA binding affinities from a protein sequence?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Are there any computational tools to predict the binding affinity to specific DNA motifs from protein domain sequence information?

Those that come up from a Google search ("predict DNA binding from protein domain tool") seem pretty well suited to your question.

Depending on your computational/statistical know-how you might also find these papers relevant

  • Huang et al, Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
  • Gao et al, A threading-based method for the prediction of DNA-binding proteins with application to the human genome

Protein-DNA binding specificity predictions with structural models

Protein-DNA interactions play a central role in transcriptional regulation and other biological processes. Investigating the mechanism of binding affinity and specificity in protein-DNA complexes is thus an important goal. Here we develop a simple physical energy function, which uses electrostatics, solvation, hydrogen bonds and atom-packing terms to model direct readout and sequence-specific DNA conformational energy to model indirect readout of DNA sequence by the bound protein. The predictive capability of the model is tested against another model based only on the knowledge of the consensus sequence and the number of contacts between amino acids and DNA bases. Both models are used to carry out predictions of protein-DNA binding affinities which are then compared with experimental measurements. The nearly additive nature of protein-DNA interaction energies in our model allows us to construct position-specific weight matrices by computing base pair probabilities independently for each position in the binding site. Our approach is less data intensive than knowledge-based models of protein-DNA interactions, and is not limited to any specific family of transcription factors. However, native structures of protein-DNA complexes or their close homologs are required as input to the model. Use of homology modeling can significantly increase the extent of our approach, making it a useful tool for studying regulatory pathways in many organisms and cell types.


ΔΔ G predictions (ddG comp…

ΔΔ G predictions (ddG comp ) versus experimental measurements (ddG exp ). (…

ΔΔ G predictions (ddG comp…

ΔΔ G predictions (ddG comp ) versus experimental measurements (ddG exp ). (…

ΔΔ G predictions (ddG comp…

ΔΔ G predictions (ddG comp ) versus experimental measurements (ddG exp ). Static…

Experimental binding affinities conferred by…

Experimental binding affinities conferred by indirect readout can be explained with DNA conformational…

Degree of pairwise additivity in…

Degree of pairwise additivity in binding energies predicted with the dynamic model. Comparison…


The binding affinities of protein-nucleic acid interactions could be altered due to missense mutations occurring in DNA- or RNA-binding proteins, therefore resulting in various diseases. Unfortunately, a systematic comparison and prediction of the effects of mutations on protein-DNA and protein-RNA interactions (these two mutation classes are termed MPDs and MPRs, respectively) is still lacking. Here, we demonstrated that these two classes of mutations could generate similar or different tendencies for binding free energy changes in terms of the properties of mutated residues. We then developed regression algorithms separately for MPDs and MPRs by introducing novel geometric partition-based energy features and interface-based structural features. Through feature selection and ensemble learning, similar computational frameworks that integrated energy- and nonenergy-based models were established to estimate the binding affinity changes resulting from MPDs and MPRs, but the selected features for the final models were different and therefore reflected the specificity of these two mutation classes. Furthermore, the proposed methodology was extended to the identification of mutations that significantly decreased the binding affinities. Extensive validations indicated that our algorithm generally performed better than the state-of-the-art methods on both the regression and classification tasks. The webserver and software are freely available at and

SemanticBI is a convolutional neural network (CNN)recurrent neural network (RNN) architecture model that was trained on an ensemble of protein binding microarray data sets that covered multiple TFs (trained on DREAM5 PBM data sets). Described in Quan et al., 2021.

TFaffinity is a MATLAB code to calculate TF-DNA binding affinities using the TRAP algorithm. It is described in the article Whiehle et al., 2019.

ChIPanalyser is an R package that calculates ChIP-seq-like profiles based on a a statistical thermodynamic framework. The model relies on four consideration: TF binding sites can be scored using a Position weight Matrix, DNA accessibility plays a role in Transcription Factor binding, binding profiles are dependent on the number of transcription factors bound to DNA and finally binding energy (another way of describing PWM’s) or binding specificity should be modulated (hence the introduction of a binding specificity modulator). The end result of ChIPanalyser is to produce profiles simulating real ChIP-seq profile and provide accuracy measurements of these predicted profiles after being compared to real ChIP-seq data. Described in Martin and Zabet, 2019.

The DeepBind algorithm is based on convolutional neural networks and can discover new patterns even when the locations of patterns within sequences are unknown. For training, DeepBind uses a set of sequences and, for each sequence, an experimentally determined binding score. Sequences can have varying lengths, and binding scores can be real-valued measurements or binary class labels. The authors (Alipanahi et al., 2015) claimed that this algorithm outperforms all 26 existing methods for protein-DNA specificity prediction previously compared by Weinrouch et al., 2013. This is a stand alone application, available for Windows and Linux.

BayesPI-BAR (Bayesian method for Protein-DNA Interaction with Binding Affinity Ranking) uses biophysical modeling of protein-DNA interaction to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). It includes TF chemical potentials or protein concentrations, and direct TF binding targets as input. The authors claimed that the method compares favorably to existing programs such as sTRAP and is-rSNP, when evaluated on the same SNPs. The method is described here.

A web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein-DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of TFBS. Users can submit FASTA sequences for analysis.

TRAP calculates binding affinity based on the matrix description of a given TF and a set of DNA sequences to be annotated (input). It requires the specification of two biophysically-motivated parameters. The freely available program code is written in C. Further details are available in the paper by Roider et al., 2007.

STAP uses a biophysical model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip or ChIPSeq data. The program assumes that the measured affinity of a sequence to a TF (TF_exp) in some ChIP-chip or ChIP-seq experiment is determined by: 1) the number and strength of binding sites of TF_exp in this sequence 2) the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood. Specifically, it takes as input a set of DNA sequences, their binding affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs, including TF_exp. It will learn the relevant parameters of the biophysical model of TF-DNA interaction, including those of TF-DNA interaction and those of TF-TF cooperative interactions.

The input to MatrixREDUCE is a sequence file in FASTA format and an expression data file in tab-delimited text format (missing values are allowed). Output data include PSAMs in numeric and graphical format, parameters of the fitted model, and an HTML summary page.

BayesPI integrates Bayesian model regularization with biophysical modeling of protein-DNA interactions and nucleosome positioning to study protein-DNA interactions, using a high-throughput dataset.

The scoring function calibrated against crystallographic data on protein-DNA contacts can recover PWMs, sometimes outperforming experimental PWMs.

Prediction of Protein-Protein Binding Affinity through their Amino Acid Sequence

Protein-protein interactions (PPIs) have become necessary in order to study many biological processes. In order to study the PPIs, the binding affinity among the proteins is predicted. Experimental prediction of PPIs requires expensive setup and is very tedious. Therefore, computational methods are used to predict the binding affinity, which is less time taking and provides accurate results.

The binding affinity prediction among the protein complexes poses a problem which has been addressed since the past two decades [1,2]. Various computational methods for binding affinity prediction have been proposed using the empirical scoring functions [3,4,5], knowledge-based methods [6,7,8,9], and QSARs [10,11]. These methods have a few limitations, such as they could handle a small amount of data only, and the results are not much accurate [12].

Yugandhar and Gromiha (2014), have proposed a most accurate and novel method of binding affinity prediction using their amino acid sequence [12]. In this method, the protein-protein complexes are first classified on the basis of their molecular weights, functions, percentage of binding site residues, then the relation between the sequence and the structural properties is analyzed, and thereby the binding affinity.

The sequence-based features include predicted binding site residues [13] and property values of 20 amino acids from AAindex database [14]. The structure-based features include predicted binding site residues using the SPPIDER webserver [15], the number of hydrogen bonds [16], accessible surface area [17], non-bonded interaction energy [18], electrostatic energy and energy due to bond length, bond angle, and torsion angle [19]. The lesser number of properties are used because several of them are inter-related to each other which could cause a bias in the generation of the model [12]. After that, they compared the correlation between all possible pairs of properties, which left them with 113 features/ properties [12]. For the ease of identification of features affecting binding affinity, the protein complexes are classified into different groups: [12].

  • Antigen-Antibody: Complex formed by interaction between antigen and antibody.
  • Enzyme-Inhibitor: Complex formed by interaction between enzyme and inhibitor.
  • Other enzymes: Complexes in which one of the interacting proteins is enzyme and the other one is any thing other than an inhibitory protein.
  • G-protein containing: Complexes in which one of the interacting proteins is a G-protein.
  • Receptor containing: Complexes in which one of the interacting proteins functions as a receptor.
  • Miscellaneous: Which does not fall in any of the above classes.

How is the binding affinity predicted using amino acid sequences?

An independent regression model is generated for all the classified groups by combining more than one feature using multiple regression technique [20]. The performance of generated model is validated by jack-knife test (a resampling test performed for machine learning algorithms). After that, a step-wise least square fit test is performed using multiple regression technique for identifying the combinations of features to predict the binding affinity at high accuracy [12], and P-value is estimated to know the significance of the data (combinations of protein complexes). If the P-value <0.05, then it is statistically significant, otherwise other combinations of features are considered followed by the same procedure.


In recent years, full atomic resolution MM approaches for the binding partners combined with an implicit continuum solvent model have been applied to evaluate protein–protein complexes. In contrast to the scoring of single complexes, an ensemble of complex conformations is evaluated in MM Poisson–Boltzmann/surface area (MM-PBSA) or MM-GBSA (using the generalized Born method instead of the Poisson–Boltzmann) approaches. In most cases, the ensemble is generated using MD simulations with an explicit solvent representation. To limit the computational demand and to also keep a narrow distribution of conformations near an initial state, usually simulations of a few nanoseconds are performed. 79 Due to the large energy fluctuations of the explicit solvent molecules, the reanalysis of the trajectory (ensemble) is performed after removing the explicit water molecules (sometimes interface waters are retained) employing an implicit solvent model.

For each evaluated complex structure, the mean partner interaction energy can be calculated. In the most basic single trajectory approach, this is achieved by taking the ensemble average energies of the complex and subtract the corresponding energies of the partners from the same trajectory. This approximation implies that ligand and receptor do not undergo significant conformational changes upon binding and changes in intra-molecular energies can be neglected. The mean interaction energy consists of pairwise electrostatic Coulomb interactions and electrostatic (polar) solvation contributions obtained using the finite-difference Poisson–Boltzmann or generalized Born (GB) equations. Nonpolar solvation (or desolvation) is usually calculated from the BSA surface upon complex formation using an empirical surface tension parameter that represents both cavity creation and van der Waals interaction of the protein with the solvent. Alternatively, the nonpolar part can also be split further into a surface area dependent cavity or hydrophobic term and a change in van der Waals interaction between proteins and solvent. The latter contribution can be estimated from a solvent grid representation around the complex 79 or from a surface integral approach. 82 In order to include changes in intramolecular contributions, it is possible to run MD simulations not only for the complex but also for separate (unbound) partners and evaluate each ensemble separately. However, the single trajectory approach is used much more frequently and gives typically better convergence of the mean energies due to cancellation of the intramolecular contributions. Nevertheless, interaction energies obtained from MM-PBSA or MM-GBSA can show significant statistical errors due to numerically small interaction energies that need to be calculated from subtraction of numerically large and slowly converging mean energies (of the complex and the individual partners).

In addition to interaction energies, changes in the conformational entropy of binding partners can be estimated from a NM analysis of the complex and the isolated partners. This term is often neglected due to its large computational costs or replaced using alternative approaches based on the energy fluctuations within the ensemble or a quasi-harmonic (QH) analysis of the trajectories. Methods have also been developed to just obtain the change in translational and orientational (external) entropy of one partner with respect to the other. 83, 84

MM-GBSA and MM-PBSA have been used both for the calculation of absolute protein–protein binding affinities and to evaluate docked protein–protein complex structures (reviewed in Reference 79 ). For example, Gohlke et al. 85, 86 investigated the Ras–Raf and Ras–RalGDS complex and reported a binding free energy in good agreement with experiment, however, depending on how conformational entropy was estimated and with errors of about several kcal/mol. MM-GBSA and MM-PBSA were successfully used in several studies for reranking protein–protein docking solutions. 87-91 A very systematic study to predict the free energy of binding and to score docked complexes was recently performed by Chen et al. 89 The authors compared various force fields, protocols for performing MD simulations and using the Poisson–Boltzmann or the GB solvation model (not including conformational entropy effects) for 46 protein–protein complexes. The highest correlation between the predicted binding affinities and the experimental data was −0.64 using MM-GBSA, a low interior dielectric constant of 1 and the AMBER ff02 force field. This correlation was better than using MM-PBSA for which the highest correlation was −0.523.

Often water molecules form specific water mediated contacts at protein–protein interfaces that may not be accurately represented in an implicit solvent model. 92 It is straightforward to include interface water molecules in the calculations and only treat the bulk water as a dielectric continuum. 90 The inclusion of an explicit water model has been proven to give good results in various protein–protein binding affinity predictions for single complexes (e.g., Ulucan et al. 93 ). For example, the correlation of MM-GBSA results to experimental binding affinities of 20 native protein–protein complexes was shown to increase significantly (up to 30%) by inclusion of 30 explicit water molecules at the binding interface. 90 On a smaller test set of four proteins, crystal water molecules were added in the evERdock approach that improved water mediated contacts so that the identification of near-native binding decoys could be improved. 94

The changes in conformational entropy upon protein–protein complex formation are usually neglected in the MM-PBSA evaluation due to the large computational costs to perform NM analysis on protein–protein complexes and on the isolated partners. However, alternative methods to estimate the conformational entropy contribution have recently been introduced and tested also for evaluating predicted protein–protein complexes. In the interaction entropy (IE) approach, the protein–ligand or protein–protein interaction energy fluctuations during the MD trajectories are used to estimate the conformational entropy. 95 This method does not allow to calculate absolute entropy values but is applicable to calculate relative entropy changes, for example, after protein–ligand binding. As no extra computational effort is needed, it has been concluded that for receptor–ligand binding affinity prediction using IE is superior to the standard NM analysis for estimating entropy effects. IE has been successfully applied in studies in combination with MM-PBSA and MM-GBSA calculations of protein–protein 96, 97 or protein–ligand 95, 97 binding affinities. Interestingly, quite substantial differences in the resulting free energies of binding using NM and IE were encountered in several studies. 95, 97, 98 In a recent study of Sun et al., 97 entropy effects on the performance of endpoint methods of over 1,500 protein–ligand systems were assessed. The best correlation to the experimental binding affinities was gained with IE, whereas the absolute binding free energy values had the highest correspondence to experimental values using NM calculations. Recently, however, Kohut et al. 98 proposed that the reproducibility of IE is less robust than that of NM or QH, especially for flexible systems. The calculated entropy value is mainly determined by the highest spikes of interaction energy and it is argued that the calculated entropies are difficult to converge as the simulations are prolonged. Aldeghi et al. 99 also found a higher sensitivity of the IE term to the simulated ensemble than the other MM-PBSA terms for three sets of bromodomain-inhibitor pairs. Hence, further testing could be useful to check the robustness of the IE approach. In a study of 20 protein–protein systems using IE with MM/GBSA, the mean absolute error to experimental binding affinities could be substantially reduced by optimizing the residue type-specific dielectric constants, the errors were especially lower than with NM analysis using a standard dielectric constant of 1. 91

Formally, the solute entropy change during association can be split into an external entropic contribution due to the reduction of motion in external degrees of freedom (relative position and orientation of ligand and receptor) and internal entropy (conformational) upon complex formation. Although a full decoupling of external and internal entropy is not in general possible, one can still compute the lowest upper bound of the external entropy. 83 The number of configurations needed to obtain converged results is, however, quite high using the approach with simulations of over 1 μs required for a Barnase–Barstar complex. Furthermore, an external entropy correction alone has been shown to not necessarily improve the correlation to experimental binding affinities for protein–ligand systems. 100

4.1 Mutations influencing protein–protein binding affinities

Mutagenesis of residues at protein–protein interfaces has demonstrated that the contributions to binding affinity are not uniformly distributed but can often be attributed to a small number of residues called hot spots. 101, 102 For protein engineering, it is of significant interest to predict changes in binding affinity of protein–protein complexes due to mutations and to identify important residues for the interaction (hotspots). Several approaches based on just single complex structures are available to estimate the effect of interface mutations (recently reviewed in WIREs Computational Molecular Science 103 ). The ensembles-based MM/PBSA or MM/GBSA approaches can also be used to identify hot spots and to calculate the change in the binding free energy upon mutation of interface residues. 104 The hot spots of 15 protein–protein complexes were calculated recently with MM/PBSA using residue type specific dielectric constants (11 for charged residues and 7 for nonpolar and polar residues). 77 In this study, a mean SE of 1.1 kcal/mol was achieved in 210 mutations after geometry optimization and subsequent MD simulations in explicit solvent. Using an extension of the MM/PBSA method with residue specific dielectric constants, Petukh et al. 105 achieved a high correlation (correlation coefficient of −0.62) with experiment for a set of 1,300 mutations in 43 protein–protein complexes (several other applications are reviewed in Reference 79 ).

Besides of estimating binding free energy changes due to mutations using single complex conformations or changes in mean interaction energies obtained from the MM/PBSA or MM/GBSA methods, it is also possible to perform alchemical transformations to mutate residues in silico. In alchemical free energy simulations, one represents the selected amino acid side chain by two force fields, one representing the wild type (State A) and the other the mutated residue (State B). During a series of MD simulations, the force field for State A is switched off (decoupled from the interaction with other parts of the system) whereas the force field of State B is switched on. The changes in free energy can be calculated by integrating the generalized force along the switching pathway (thermodynamic integration [TI] 106 ), free energy perturbation (FEP) 107 or using alternative methods such as Bennett' acceptance ratio (BAR) method. 108 In order to obtain the effect of a mutation on binding affinity, the transformation needs to be performed in the complex and for the unbound solvated partner. The advantage of the approach is that all energetic as well as entropic contributions that may influence the change in binding affinity are accurately included (within the limits of a molecular mechanic force field description of the system). The disadvantage is the typically higher computational cost compared to the above described end point methods. Due to methodological progress and increased computational power, alchemical free energy methods are increasingly being used to study the effect of mutations on protein stability and protein–protein binding. 109-114 Although typically performed in explicit solvent, it is also possible to perform alchemical transformations in implicit solvent 115 with computational costs similar to MM/PBSA but avoiding the endpoint approximations inherent to endpoint ensemble approaches. A recent systematic assessment of more than 100 mutations (including charge changing mutations) in four protein–protein complexes resulted in a better performance and higher correlation of the alchemical FEP approach (root-mean-square error [RMSE] = 1.2 kcal/mol) than MM/GBSA (RMSE = 1.5 kcal/mol) in reproducing experimental affinities. 116

High-throughput assay for determining specificity and affinity of protein-DNA binding interactions

Limited information exists for the binding specificities of many important transcription factors. To address this, we have previously developed a microwell-based assay for directly measuring the affinity of DNA-protein binding interactions. We describe here the detailed protocol for determining sequence specificities of DNA-binding proteins using this assay. The described method is rapid after preparation of the reagents, the assay can be run in a single day, and its throughput can be increased further by automation. The method is quantitative but requires prior knowledge of one high-affinity binding site for the protein of interest. The protocol can be adapted for determining the effect of protein modifications and protein-protein interactions on DNA-binding specificity, and for engineering proteins with new DNA-binding specificities. In addition, the method is suitable for high-throughput screening to identify proteins or small molecules that modulate protein-DNA binding interactions.

D5. DNA Binding Proteins

  • Contributed by Henry Jakubowski
  • Professor (Chemistry) at College of St. Benedict/St. John's University

Given the relative structural simplicity and repetitiveness of DNA, it would follow that proteins that bind specifically to it might have common DNA binding domain motifs but with specific amino acids side chains allowing for specific binding interactions.

The figures shows two such proteins, the cro repressor from bacteriophage 434 and the lambda repressor from the bacteriophage lambda. (Bacteriophages are viruses that infect bacteia.) Notice how specificity is achieved, in part, by the formation of specific H-bonds between the protein and the major grove of the operator DNA.

Figure: Lambda Repressor/DNA Complex

Figure: H Bond interactions between&lambda repressor and DNA

Jmol: Updated Lambda Repressor/DNA complex Jmol14 (Java) | JSMol (HTML5)

  • zinc finger: (eukaryotes) These proteins have a common sequence motif of X3-Cys-X2-4-Cys-X12-His-X3-4-His-X4- in which X is any amino acid. Zn2+ is tetrahedrally coordinated with the Cys and His side chains, which are on one of two antiparallel beta strands, and an alpha helix, respectively. The zinc finger, stabilized by the zinc, binds to the major groove of DNA. ]

Jmol: Updated Zif268:DNA Complex Jmol14 (Java) | JSMol (HTML5)

Zn finger proteins, of which 900 are encoded in the human genome (including the eukaryotic insulators binding protein CTCF described above) can be mobilized to actual repair specific mutations in cells, which if carried out in a high enough percentage of mutant cells could cure specific genetic diseases such as some forms of severe combined immunodeficiency disease. In this new technique (Urnov et al, 2005), multiple linked Zn finger binding domains, (one of the natural-occurring ones or mutant forms produced in the lab), each one specific for a certain nucleotide sequence, is linked to a nonspeciifc endonuclease, derived from the enzyme FokI. The nuclease is active in dimeric form so the active complex requires two endonuclease domains, each bound to four different Zn finger domains, to assemble at the target site. Specificity of binding is achieved by selection by the Zn finger domains. A nick is then made by the DNA by the nuclease, and host cell repair mechanisms ensue. This process involves strand separation, homologous recombination of the nicked region with complementary DNA within the cell, and repair of the nick. If excess wild type (non-mutated) DNA is added to the cells and uses as the template, the normal DNA repair mutation would fix the mutation. Urnov et al have shown the up to 20% of cultured cells containing a mutation can be repair in the lab. If these cells gain a selective growth advantage, the mutated cells would eventually be replaced with wild type cells.

  • steroid hormone receptors: (eukaryotes) In contrast to most hormones, which bind to cell surface receptors, steroid hormones (derivatives of cholesterol) pass through the cell membrane and bind to cytoplasmic receptors through a hormone binding domain. This changes the shape of the receptor which then binds to a specific site on the DNA (hormone response element) though a DNA binding domain. In a structure analogous to the zinc finger, Zn 2+ is tetrahedrally coordinated to 4 Cys, in a globular-like structure which binds as a dimer to two identical, but reversed sequences of DNA (palindrome) within the major grove. (Examples of palindromes: Able was I ere I saw Elba Dennis and Edna dine, said I, as Enid and Edna sinned.

Consider the glucocorticoid receptor (GR) as a specific example. It binds DNA as a dimer. The two DNA binding domains of the dimer associate with two adjacent major grooves of the DNA in the GR binding sequence (GBS), a short sequence of DNA within the promoter. Meijsing, et al. have found that not only does the GBS act as a binding site for GR, allowing transcription of genes, but it also affects the conformation of the receptor, causing gene transcription to be regulated in another way. The group constructed luciferase "reporters genes" which have GBS linked to the gene for the protein luciferase, that would express the protein luciferase (which fluoresces) if they were being transcribed, with the GBS. They found that relative transcriptional activity did not correlate to relative binding affinity of GR to the GBS. GBSs which were much more active than others bound comparably with those of lower activity, while GBSs with similar transcriptional activity bound with different affinities. This shows that the GBS is conferring unique function to the GR associated with it (i.e. transcription is not simply affected by whether or not the GR is bound to the GBS). A �lever arm� of the receptor was found to undergo conformational changes when bound to DNA, with changes specific to the sequence to which it was bound. A mutant protein, GR-&gamma, was made to be identical to the wild-type protein, GR-&alpha, except in the lever arm was found to have different transcriptional activity even though they were binding to the same site on the DNA, showing that the lever arm and its conformation affects transcription.

  • leucine zippers (or scissors): (eukaryotes) These proteins contain stretches of 35 amino acids in which Leu is found repeatedly at 7 amino acid intervals. These regions of the protein form amphiphilic helices, with Leu on one face, one Leu after two turns of a helix. Two of these proteins can form a dimer, stabilized by the binding of these nonpolar, leucine-rich amphiphilic helices to one another, forming a coiled-coil, much as in the muscle protein myosin. The leucine zipper represents the protein binding domain of the protein. The DNA binding domain is found in the first 30 N-terminal amino acids, which are basic and form an alpha helix when the protein binds to DNA. The leucine zipper then functions to bring two DNA binding proteins together, allowing the N-terminal bases helices to interact with the major grove of DNA in a base-specific fashion. Valine and isoleucine, along with leucine, are often found in stretches of amino acids that can interact to form other types of coiled coils.

Figure: leucine zippers (made with VMD)

Jmol : Updated Leucine Zipper Jmol14 (Java) | JSMol (HTML5)

Just as Zinc fingers nucleases have been used to induce repair of mutations, another study of the rat genome used specially designed ZFNs to cause breaks in ds-DNA that contain mutations from inaccurate DNA repair mechanism (by NHEJ) and hence contained specific mutations (Geurts, et al. 2009). This process, �knockout of the gene,� prevents the production of the protein normally transcribed by the target gene. Five- and six-finger ZFNs were used to achieve a high level of specificity in the targeted binding to the gene for three different proteins: green fluorescent protein (GFP), Immunoglobulin M (IgM) and Rab38. The knockout was successful in 12% of the rats tested these animals had no wild-type protein and no expression. The ZFNs were sufficiently specific that no mutations were observed at any of 20 predicted non-target sites. This study supports the viability of control of transcription and expression for the treatment of disease and the importance of specific binding.

We have seen that two main factors contribute to the specific recognition of DNA by proteins the formation of hydrogen bonds to specific nucleotide donors and acceptors in the major groove, and sequence-dependent deformations of the DNA helix to altered shapes with increased affinity of protein ligands. For example the Tata Binding Protein (TBP) can interact with a widened minor grove in the TATA box. New findings support that in addition proteins are able to use information in minor grooves that have become "narrowed" depending on the nucleotide sequence.

Tracks of DNA enriched in A can lead to twisting conformations that cause inter-base-pair hydrogen bonding in the major grooves, results in the narrowing of minor grooves. High amounts of AT base pairs are concentrated in narrow minor grooves (width <5.0 �) and CG base pairs are found more frequently in wide minor grooves.

How does minor groove narrowing affect DNA recognition? Narrow minor groves enhance the negative electrostatic potential of the DNA, making it a more specific and recognizable site. The backbone phosphates of the DNA are closer to the middle of the groove when it is narrow, thus correlating narrow minor grooves with a more negative electrostatic potential.

The minor grove-interacting parts of proteins contain arginine whose side chain can be accommodated into the more narrow and negative minor groove. Arginines can bind and in some cases insert themselves as short sequence motifs which enhance the specificity of the DNA shape recognition. Arg is preferred over Lys since the effective radii of the charge in Arg is greater than of the charge carrier in Lys. This would lead to a decreased desolvation energy for Arg which would promote its binding to the narrowed major grove. This discovery shows that "the role of DNA shape must be taken into consideration when annotating the entire genome and predicting transcription-factor-binding sites".

Figure: Arg in T3c Transposase binding in Narrowed Minor Grove of T3c Transposon

ELife digest

Proteins help to copy DNA, transport nutrients and perform many other important roles in cells. To perform these tasks, proteins often interact with other proteins and work together. These interactions can be very complex because each protein has a three-dimensional shape that may change when it binds to other proteins. Also, two proteins may form several connections with each other.

It is possible to carry out experiments to calculate how likely it is that two proteins will physically interact with each other and how strong their connections will be. However, these measurements are time consuming and costly to do. Some researchers have developed computer models to help predict the interactions between proteins, but these models are often incorrect because they leave out some of the chemical or physical properties that influence the ability of proteins to interact.

With the aim of making a better model, Vangone and Bonvin examined 122 different combinations of proteins whose abilities to interact had previously been experimentally measured. Vangone and Bonvin found that the number of connections between each pair of proteins was a strong predictor of how tightly the proteins connect to each other. Particular features of the surface of the proteins—specifically, the region defined as the non-interacting surface—can also influence how strong the interaction is.

Vangone and Bonvin used this information to develop a new model that predicts how tightly proteins interact with each other based on the number of connections between the two proteins and the characteristics of the non-interacting surface. The model is simple, and Vangone and Bonvin show that it is more accurate than previous models. Defects in the interactions between proteins can lead to many diseases in humans, so this model may be useful for the development of new drugs to treat these conditions.

Methods for Detecting Protein–DNA Interactions

The chromatin immunoprecipitation (ChIP) method can be used to monitor transcriptional regulation through histone modification (epigenetics) or transcription factor–DNA binding interactions. The ChIP assay method allows analysis of DNA–protein interactions in living cells by treating the cells with formaldehyde or other crosslinking reagents in order to stabilize the interactions for downstream purification and detection. Performing ChIP assays requires knowledge of the target protein and DNA sequence that will be analyzed, as researchers must provide an antibody against the protein of interest and PCR primers for the DNA sequence of interest. The antibody is used to selectively precipitate the protein–DNA complex from the other genomic DNA fragments and protein–DNA complexes. The PCR primers allow specific amplification and detection of the target DNA sequence. Quantitative PCR (qPCR) technique allows the amount of target DNA sequence to be quantified. The ChIP assay is amenable to array-based formats (ChIP-on-chip) or direct sequencing of the DNA captured by the immunoprecipitated protein (ChIP-seq).

  • capture a snapshot of specific protein–DNA
    interactions as they occur in living cells
  • quantitative when coupled with qPCR analysis
  • ability to profile a promoter for different proteins
  • researcher needs to source ChIP-grade antibodies
  • requires designing specific primers
  • difficult to adapt for high-throughput screening

A step-by-step guide to successful chromatin immunoprecipitation (ChIP) assays

This updated overview of the ChIP procedure includes additional detail about primary antibody selection (i.e., ChIP-validated antibodies). The application note also describes and provides examples of chromatin immunoprecipitation (ChIP) as a technique for studying epigenetics, as it allows researchers to capture a snapshot of specific protein–DNA interactions.

Our 72-page Protein Interaction Technical Handbook provides protocols and technical and product information to help maximize results for protein interaction studies. The handbook provides background, helpful hints and troubleshooting advice for immunoprecipitation and co-immunoprecipitation assays, pull-down assays, far-western blotting and crosslinking. The handbook also features an expanded section on methods to study protein–nucleic acid interactions, including ChIP, EMSA, and RNA EMSA. The handbook is an essential resource for any laboratory studying protein interactions.

Contents include: Introduction to protein interactions, Co-immunoprecipitation assays, Pull-down assays, Far-western blotting, Protein interaction mapping, Yeast two-hybrid reporter assays, Electrophoretic mobility shift assays [EMSA], Chromatin immunoprecipitation assays (ChIP), Protein–nucleic acid conjugates, and more.

Learn more

Select products

The DNA electrophoretic mobility shift assay (EMSA) is used to study proteins binding to known DNA oligonucleotide probes and can be used to assess the degree of affinity or specificity of the interaction. The technique is based on the observation that protein–DNA complexes migrate more slowly than free DNA molecules when subjected to non-denaturing polyacrylamide or agarose gel electrophoresis. Because the rate of DNA migration is shifted or retarded upon protein binding, the assay is also referred to as a gel shift or gel retardation assay. Adding a protein-specific antibody to the binding components creates an even larger complex (antibody–protein–DNA), which migrates even slower during electrophoresis. This is known as a “supershift”, and it can be used to confirm protein identities. Until conception of the EMSA, protein–DNA interactions were studied primarily by nitrocellulose filter–binding assays using radioactively labeled probes.

  • detect low abundance DNA binding proteins from lysates
  • test binding site mutations using many probe configurations with the same lysate
  • test binding affinity through DNA probe mutational analysis
  • non-radioactive EMSA possible using biotinylated or fluorescently labeled DNA probes
  • analyze protein–DNA interactions in vitro
  • difficult to quantitate
  • need to perform supershift assay with antibody to be certain of protein identity in a complex

Traditionally, DNA probes have been radiolabeled with ³²P by incorporating an [γ-³²P]dNTP during a 3' fill-in reaction using Klenow fragment or by 5' end labeling using [γ-³²P]ATP and T4 polynucleotide kinase. Following electrophoresis, the gel is exposed to X-ray film to document the results. The Thermo Scientific LightShift Chemiluminescent EMSA Kit is a non-radioactive assay that provides robust and sensitive performance. The kit includes reagents for setting up and customizing DNA-binding reactions, a control set of DNA and protein extract to test the kit system, stabilized streptavidin–HRP conjugate to probe for the biotin-labeled DNA target, and an exceptionally sensitive chemiluminescent substrate module for detection.

Chemiluminescent EMSA of four different DNA–protein complexes. Biotin-labeled target duplexes ranged in size from 21–25 bp. The Oct-1, AP1 and NF-κB transcription factors were derived from HeLa nuclear extract. EBNA-1 extract is provided as a control in the LightShift Chemiluminescent EMSA Kit. Unlabeled specific competitor sequences (where used) were present at a 200-fold molar excess over labeled target. X-ray film exposure times for each system ranged from 2 minutes for EBNA, Oct-1 and AP1, and 5 minutes for NF-κB.


  1. Liviu

    You are absolutely right. In it something is also to me it seems it is good thought. I agree with you.

  2. Gardalkree

    There are other drawbacks

  3. Karmel

    I offer you to come over to the website where there are many articles on this matter.

  4. Ullock

    What words ... Super, wonderful sentence

  5. Reid

    I apologize for interfering ... But this topic is very close to me. Write to PM.

Write a message