Problems in Protein Evolution

October 1, 2001, revised October 14

Some very serious problems in the evolution of proteins threaten the theory of evolution, and appear to disprove it. A demonatration of the seriousness of these problems therefore constitutes a disproof of the theory of evolution.

Although there are other arguments against the theory of evolution, the present argument differs in several ways. Some arguments against evolution involve the improbability of abiogenesis, that is, the origin of the first life forms. These arguments are convincing, but biologists will say that some as yet undiscovered mechanism resulted in the first life forms. The biologists' argument is hard to refute in a formal way. Other arguments point to the complexity of life, and the implausibility of such a complex system evolving. For example, Behe documents the complexity of flagella and argues that they are "irreducibly complex," meaning that the system cannot function unless many parts appear, and all these parts could not have arisen at once by evolutionary processes. This is also a convincing argument, but it is hard to formalize because it does specify mathematical probabilities. Biologists will say that some as yet undiscovered mechanism resulted in the evolution of the flagellum and other structures that appear to be irredicibly complex. Other arguments involve the fact that information of the kind found in life forms does not appear by natural processes. However plausible this argument is, it does not have a formal mathematical justification. Similar comments apply to the lack of transitional forms in the fossil record, and many other arguments commonly used against the theory of evolution. The argument presented here is different, in that it involves existing genetic mechanisms, not hypothesized ones, and it involves the calculation of mathematical probabilities (or rather, improbabilities). This therefore appears to be the first argument that qualifies formally as a disproof of the theory of evolution.

This argument also involves redundancies in many of its aspects. That is, under several models, the change of shape of proteins is impossible, and each model has its own set of reasonable assumptions.

The difficulties in changing the shape of proteins by evolution involve both probabilities and laws of protein structure. The evolution of proteins of new shapes by point mutations is not possible, because the change of shape of a protein would require too many mutations. If the probability of a mutation is high enough to change the shape of the protein, then many other mutations will also occur that will essentially randomize the rest of the gene and cause the newly shaped protein to be harmful to the organism. One might argue that large scale changes to a gene could result in a protein of a new shape more readily than could point mutations. However, other arguments based on laws of protein structure prevent this. The kinds of amino acids that appear on the inside and the outside of a protein are different. There can only be a small number of insertions of a part of one protein into another that do not violate the distinction between the inside and outside of the protein, and the chance that any one of these will be beneficial, is very small.

The present argument is based on assumptions from the theory of evolution, according to which life began as a simple reproducing system that gradually developed into the life forms we see today. This system (or systems, if life developed multiple times) must have been very simple, because it had to originate without the benefit of evolution. In particular, it could only have had one or a very small number of proteins. From these few proteins, those in current life forms must have evolved.

Each protein is produced by one (or possibly several) genes. The evolution of new proteins must have occurred by mutations to these genes. Since genetic mechanisms in current life forms are strikingly uniform, with a few modifications, it is reasonable to assume that these mechanisms have been in operation for hundreds of millions or billions of years, in the accepted evolutionary scenario. Thus the evolution of many proteins from a few must have occurred by genetic mechanisms that are still in existence. Even if special mechanisms operated in the evolution of one-celled creatures, there are undoubtedly many proteins and shapes of proteins that only appear in multicellular organisms, and these must have evolved from others by currently existing genetic mechanisms. Furthermore, because all known one-celled organisms have similar genetic mechanisms, it is reasonable to assume that these mechanisms were operating for a considerable portion of the time that these one-celled organisms evolved from simpler organisms having many fewer proteins. During this time, proteins having new shapes must have appeared.

Each protein is composed of a sequence of amino acids that join together and are then called "residues." The "side chains" of the residues determine their chemical properties. Some side chains are hydrophobic (oily) and tend to cluster together inside the protein. Others are hydrophilic (water loving) and tend to occur on the outside of the protein. Thus a given shape of a protein will tend to be associated with a particular sequence of hydrophobic and non-hydrophobic side chains. Changing the shape of the protein requires changing this sequence of side chains.

A problem with the evolution of proteins having new shapes is that proteins are highly constrained, and producing a functional protein from a functional protein having a significantly different shape would typically require many mutations of the gene producing the protein. All the proteins produced during this transition would not be functional, that is, they would not be beneficial to the organism, or possibly they would still have their original function but not confer any advantage to the organism. It turns out that this scenario has severe mathematical problems that call the theory of evolution into question. Unless these problems can be overcome, the theory of evolution is in trouble.

The typical mechanism proposed to explain the evolution of new proteins is that an existing gene is duplicated, and one of the copies of the gene then begins a series of mutations that eventually result in a gene able to produce a new protein. If the mutations result in a change in the shape of the protein, then the protein will probably no longer have a function in the organism, because the function of a protein is closely related to its shape. The mutating duplicated gene is still able to produce a protein, but the protein has no function in the organism. We call such a gene "useless" to indicate that it does produce a protein but the protein has no function in the organism. This is distinct from "pseudogenes," which no longer produce proteins at all because mutations have corrupted a control region or something else necessary for the gene to function.

Let us estimate how many mutations would be required to produce a functional protein of a new shape from an existing functional protein. In an article "Laws of Form Revisited" by Michael Denton and Craig Marshall, posted April 4, 2001 on the Creation Science Resource Bulletin Board, it is stated that there are probably only a small number of protein folds, perhaps not more than a thousand or so altogether:

Consideration of these 'constructional laws' suggests that the total number of permissible folds is bound to be restricted to a very small number -- about 4,000, according to one estimate. Confirmation that this is probably so is provided by a different type of estimate, based on the discovery rate of new folds. Using this method, Cyrus Chothia of Britain's Medical Research Council estimated that the total number of folds utilized by living organisms may not be more than 1,000. Subsequent estimates have given figures of between 500 and 1,000.

References given include Chothia, C. One thousand families for the molecular biologist, Nature 357, 543-544 (1992) and Lindgard, P. & Bohr, H. How many protein fold classes are to be found? in Protein Folds (eds Bohr, H. & Brunak, S.) 98-102 (CRC Press, New York, 1996). This small number of folds is evidence that protein folds, and functional proteins, are highly constrained. Thus the probability that a random sequence of amino acids would produce a properly folding protein is very small, because a random sequence would be highly unlikely to have the proper sequence of hydrophobic and hydrophilic side chains. This in turn implies that it would typically require many mutations to produce a new protein shape from an existing one.

An estimate of the number of mutations needed to produce a protein of a new shape is provided by the article Protein Structure Prediction and Structural Genomics (Science, Vol. 294, 5 October 2001, pp. 93-96) in which it is stated that prediction of the structure of a protein is difficult if there is less than a 30% amino acid sequence identity. If there is 30% or more agreement in the amino acid sequences, the structure of the proteins will be similar. This implies that proteins of significantly different structure will differ by more than 70% of their amino acids. If a protein has 1000 coding base pairs, or 333 amino acids, then at least 233 of these must differ in order to have a 70% difference and a new shape. Each such difference will require one, and possibly two, point mutations, for probably well over 300 point mutations in all.

One can also get an estimate by noting that protein structures tend to bury hydrophobic side chains. Thus each protein fold will correspond to a particular pattern of hydrophobic and non-hydrophobic side chains. It seems that on the average one would have to change about half of the side chains from hydrophobic to non-hydrophobic and vice versa to get a new fold. Half of 333 amino acids would be 166 amino acids. Each such change would require one or two mutations, for well over 200 point mutations to change from one protein fold to another.

Another way to justify the fact that many mutations are needed to change the shape of a protein is found in (Cordes MH, Walsh NP, McKnight CJ, Sauer RT, Evolution of a protein fold in vitro, Science 1999 Apr 9;284(5412):325-328):

Mutagenesis experiments show that limited changes in sequence can have large effects on stability and activity, but generally do not lead to large shifts in structure. For example, highly disruptive mutations such as insertions in elements of regular secondary structure or hydrophobic-to-charged substitutions at core positions lead to only minor structural differences in bacteriophage T4 lysozyme and staphylococcal nuclease, pointing to a strong drive to preserve the basic native fold.

This implies that many mutations are needed to produce a new fold. It also implies that in the transition between folds, a protein passes through a region of instability. Since natural proteins tend to be stable, it must be that instability is detrimental to the organism and (under evolutionary assumptions) is eliminated from the population. Therefore proteins tend to remain in regions of stability, and many mutations are required to change their shape. Thus mutations along the path of change would be harmful to the organism and would tend to be eliminated from the population.

To get another estimate of the number, consider a couple of recent articles on the evolution of new protein shapes provided by a reader of this web site:

Science 1999 Apr 9;284(5412):325-328
Evolution of a protein fold in vitro.

Cordes MH, Walsh NP, McKnight CJ, Sauer RT.

Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 
02139, USA.

A "switch" mutant of the Arc repressor homodimer was constructed by 
interchanging the sequence positions of a hydrophobic core residue, leucine 
12, and an adjacent surface polar residue, asparagine 11, in each strand of 
an intersubunit beta sheet. The mutant protein adopts a fold in which each 
beta strand is replaced by a right-handed helix and side chains in this 
region undergo significant repacking. The observed structural changes allow 
the protein to maintain solvent exposure of polar side chains and optimal 
burial of hydrophobic side chains. These results suggest that new protein 
folds can evolve from existing folds without drastic or large-scale 
mutagenesis.

Nat Struct Biol 2000 Dec;7(12):1129-1132
An evolutionary bridge to a new protein fold.

Cordes MH, Burton RE, Walsh NP, McKnight CJ, Sauer RT.

Department of Biology, Massachusetts Institute of Technology, Cambridge, 
Massachusetts 02139, USA.

Arc repressor bearing the N11L substitution (Arc-N11L) is an evolutionary 
intermediate between the wild type protein, in which the region surrounding 
position 11 forms a beta-sheet, and a double mutant 'switch Arc', in which 
this region is helical. Here, Arc-N11L is shown to be able to adopt either 
the wild type or mutant conformations. Exchange between these structures 
occurs on the millisecond time scale in a dynamic equilibrium in which the 
relative populations of each fold depend on temperature, solvent conditions 
and ligand binding. The N11L mutation serves as an evolutionary bridge from 
the beta-sheet to the helical fold because in the mutant, Leu is an 
integral part of the hydrophobic core of the new structure but can also 
occupy a surface position in the wild type structure. Conversely, the polar 
Asn 11 side chain serves as a negative design element in wild type Arc 
because it cannot be incorporated into the core of the mutant fold

In the second article, the authors show that two point mutations (if I understand correctly) can cause a protein to rapidly oscillate between two configurations, preserving the overall shape of the protein. This must be so, because of the rapid oscillation between the two configurations. Therefore, this does not qualify as a change in the overall shape of the protein. In the first article, the authors present a mutation that exchanges one fold of a protein for another. This occurs by exchanging the functions of two residues in each strand of a beta sheet, and each ASP-LEU mutation requires at least four substitutions, for a total of at least eight point mutations. But these appear to happen in both monomers of the protein, which are coded for by the same gene, so only four point mutations would be needed. The new fold leaves most of the protein structure intact, so it is not clear that this qualifies as a change of the shape of the protein either. What happens is that a small portion of the two ends of the protein changes its configuration, but the middle of the protein does not. It would seem easier to change the configuration of the ends of a protein without modifying the overall structure than to change the middle.

We will suppose for concreteness that eight point mutations can produce a new shape of a protein that folds properly, with not too many hydrophobic side chains on the outside, et cetera, so that the protein could conceivably be a functional protein. However, it is still unlikely that the new protein has a function in the organism, because almost all mutations are harmful. Therefore it probably would require at least several such sets of mutations to produce a new protein shape that folds properly and has a useful function in the organism, or, 24 point mutations in all. In addition, a number of mutations to one or more active sites or other places in the protein involved in chemical interactions would probably be necessary, which we will estimate as at least 10 point mutations, for 34 in all. In reality, this is being very generous, and considerably more would be needed. Estimates given above were in the range 200-300 or more. Thus the figure of 34 point mutations is highly conservative. To get an estimate on the number, one could compare the amino acid sequences of functional proteins in an existing organism to determine the minimum percent difference between proteins having a significantly different shape.

Now, one problem with producing proteins of new shapes by evolution is that the mutations of a useless gene would be neutral mutations, because they would confer no benefit to the organism. From population genetics considerations it follows that almost all neutral mutations are eliminated from a population, and most such mutations do not last very long. According to the talk.origins archive, neutral mutations are eliminated from a population on the average in 2(Ne/N)ln(2N) generations (if I understand the matter correctly), where Ne is the effective population size, N is the population size, and ln is natural logarithm. Note that Ne/N is at most 1. For a population of a billion, this would be about 44 generations. For a population of a trillion, it would be about 56 generations. The chance to accumulate a significant number (even 2) of neutral mutations to a gene within 44 to 56 generations is negligible. A typical mutation rate for a gene is one mutation for every 10 ⁵ generations, that is, a mutation somewhere in the gene typically occurs about this frequently. In order to get 34 mutations, the gene would have to persist in the population for about 34*10 ⁵ generations, which is highly improbable because neutral mutations are quickly lost, as a rule. However, the effect of this is difficult to assess: even though most neutral mutations are quickly lost, those that remain may spread to a larger number of individuals.

Another difficulty with this scenario for protein evolution arises from the small number of genes. If many of these genes are useless, then the number of useful genes would be even smaller than the number of discovered genes, which seems highly unlikely. Therefore the average number of useless genes (as opposed to pseudogenes) in an organism is very small, reducing the probability that this kind of evolution can occur. Furthermore, a useless gene produces a protein that either fails to fold properly or has no useful function in the organism. Producing this protein requires extra energy without producing any benefit, and is therefore detrimental to the organism. In addition, misfolded proteins have to be removed from the cell, requiring extra energy. Furthermore, misfolded or useless proteins are actually likely to have some harmful effect on the organism. This means that these useless genes are likely actually harmful to the organism, more harmful than pseudogenes. Harmful mutations tend to be eliminated from a population, making it even more unlikely that a useless gene would persist long in a population. Finally, it is likely that some mutation to a useless gene would render it nonfunctional, producing a pseudogene, which would be unlikely to result in the evolution of a new protein of benefit to the organism. For all of these reasons, the evolution of new proteins through neutral mutations is highly unlikely.

To get a better estimate on the probabilities, 33 of these 34 mutations would be neutral or harmful; the last one might be beneficial and more likely to be retained in the population. Assume that a particular set of 34 mutations is necessary to produce a beneficial protein having a new shape. Here we are being a little bit too severe on the theory of evolution, because there may be more than one such set of mutations. A typical gene has about 1000 base pairs of coding DNA (not counting introns, etc.), though this number can vary a lot. Let N be the total number of subsets of 34 of the possible 1000 sites that can mutate. Assume that each subset is equally likely to occur. Then the probability of these particular 34 mutations is at most 1/N. This is somewhat unrealistic, because fewer than 34 mutations may occur, or more can occur; the extra mutations may change the shape of the protein in an undesirable way. This computation also ignores the fact that there are three mutations possible at each site, and it is unlikely that all of them will lead to the specified new protein shape. In this way the calculation is being overly generous to the theory of evolution.

The number of subsets of 34 objects taken from 1000 is 1000 * 999 * 998 * ... * (1000 - 33) / (1 * 2 * 3 * ... * 34). This is roughly (1000)³⁴ / (34!), or roughly (1000)³⁴/(3 * 10³⁸), which is about 10^(102-38)/3, or, 3*10 ⁶³. In order to get these specific 34 mutations would require roughly 10 ⁶³ trials; this could be accomplished by a population of 10³⁰ individuals lasting for 10³³ generations, roughly speaking, or a population of 10³³ individuals lasting for 10 ³⁰ generations, et cetera. Clearly this requires an astronomically large population or a time much longer than available, or both.

We now relax the assumption that exactly 34 mutations occur. Suppose that the probability is p that a mutation will occur at any site in a gene. Suppose that if this mutation occurs to a site outside the specified set of 34 sites, the mutation has a probability of one half of permitting the specified protein fold to form and for the protein to satisfy the constraints necessary for a functional protein (the right hydrogen bonding occurs, etc.) and be beneficial to the organism. If the protein is not beneficial to the organism, the gene will eventually be eliminated from the population. (Our computation is not sensitive to this value of 1/2. Especially if 34 is replaced by a larger number, 3/4 or 7/8 would likely work as well.) However, every one of the specified 34 sites must mutate for the specified protein fold to form and for the constraints to be satisfied. Then the probability that this fold can form and the constraints can be satisfied and the protein can be beneficial is (1 - p/2)^(1000 - 34)*p³⁴. Suppose p is about 1/30, meaning that the expected number of mutations is about 33. Then the expression becomes (1 - 1/60)^(1000 - 34)*(1/30)³⁴ or (8.9 * 10^-8)*1/(1.7 * 10⁵⁰), or 5.2 * 10^-58, a little better than the previous computation but still highly improbable. If p is larger still, the probability will be even higher, but since population genetics limits the time that a neutral mutation will persist in a population, a large value of p seems unrealistic, and even p = 1/30 requires the gene to last much too long, according to population genetics.

For p = 1/15, one obtains a probability of about 5.963*10^-55. For p = 1/10, one obtains 2.876*10^-56. For p = 1/5, one obtains 9.717*10^-69. Therefore even a larger probability of mutations does not help much.

It may appear unreasonable to require mutations at 34 specified sites, but we are being too generous in allowing such a small number of mutations to suffice, and we are ignoring many other factors mentioned above that make this probability much, much smaller. All in all, the true probability is likely to be very much smaller. However, an exact calculation is a nontrivial matter.

Is it reasonable to assume that mutations must occur at a specified set of sites in order to change a protein fold? In Cordes et al, it states

Burial of hydrophobic groups is widely acknowledged as a principal source of protein-folding stability, whereas burial of polar groups inevitably decreases stability.

Therefore it is reasonable to assume that the pattern of polar and hydrophobic side chains of residues in a protein must be changed in a specific way in order to obtain a new protein fold. This implies that mutations must occur to a specific subset of the amino acids in the protein, which justifies our assumption (except for the fact that some amino acids can be specified by more than one codon).

Is it reasonable to assume that a single mutation has a probability of one half of causing the protein to misfold or fail to be beneficial to the organism? Kimura (cited in ReMine, The Biotic Message, page 246) estimates that a mutation which alters an amino acid is ten times more likely to be harmful than neutral or beneficial. James Crow (The high spontaneous mutation rate: Is it a health risk?, PNAS Vol. 94, pp. 8380-8386, August 1997) states that "most mutations if they have effects large enough to be detected phenotypically are deleterious. ... the evidence is strong that the great majority of mutations are partially dominant, so that heterozygotes show some decrease in fitness." This implies that these mutations are harmful because they add some deleterious function to a protein. Thus one can expect most mutations that change an amino acid to be deleterious due to adding a harmful function to the protein, so it is not unreasonable to assign a probability of 1/2 to this possibility, even if these mutations do not change the fold of the protein. Such harmful mutations would tend to cause the mutating useless gene to be eliminated from the population even before the desired new protein fold is achieved. This would happen because single mutations are likely to occur before all of the mutations that cause the specified change of shape of the protein, and these mutations will tend to cause the gene to be eliminated from the population before the shape change occurs. Even the mutations contributing to the shape change are also likely to introduce deleterious additional functions to the protein.

Another possible mechanism for obtaining new protein shapes is that material is transposed from one gene into another by "transposable elements," producing a considerable stretch of genetic material at once and possibly more quickly producing a new protein. This possibility should also be considered.

In an article in the Sacramento Bee from March 19, 2001, "Much DNA just "junk" -- or is it? Human Genome Project spurs new look at mystery material," it is stated

For example, the body seems to have crafted 50 genes out of junk sequences known as transposons, so named because they are transposable, moving around the genome like text copied and duplicated in a computer file.

implying that this mechanism cannot explain most of the genes. Also, from "a priori" probability considerations, there is no reason why material transposed into a gene would be more likely to lead to a useful protein than random mutations. In addition, material inserted by a transposable element will consist of adjacent sites in the gene, meaning that contiguous sites will all be inserted at the same time. This results in a greater change than having isolated point mutations, and therefore would tend to decrease the probability of the evolution of new proteins. Finally, transposable elements newly appearing in genes are likely to render the genes useless. Thus such new appearances of transposable elements will likely be harmful, and therefore must be very rare, further reducing their probability. We will assume that this happens in a gene only about once in 10⁵ generations on the average.

We attempt a probability computation for transposable elements. Suppose that we assume that on the average, each time one of the specifed 34 sites is introduced by a transposable element, 10 other sites are introduced as well. Then when the 34 sites have all been introduced, 340 other sites will have been introduced as well. Each such new base pair has a probability of one-half of prohibiting the specified fold, for a total probability of at most (1/2)³⁴⁰ that this fold can form. This probability is at most 1/(2.3 * 10^102), which is impractically small.

Is it reasonable to assume that the genetic material inserted by a transposon (or retrotransposon) is random, as we have done? There are a number of cases to consider. Transposons often create "direct repeat" sequences on their ends and may have one or more genes in the middle. If the transposon operates by "cut and paste," then it will eventually leave the place it entered and leave behind only the direct repeats. These patterns are too regular to generate new protein structures. If the transposon operates by a "copy and paste" mode, then the main body of the transposon will be left behind. If this part contains no genes, then it will tend to be randomized by point mutations over time, justifying our assumption of randomness. If the body of the transposon consists of simple repetitive sequences, then it does not have enough variety to generate new protein shapes. If the body of the transposon contains genes, then these genes will contain sequences that will probably cause the original gene to lose its functionality. If the transposon causes a frame shift in either its own DNA or the DNA beyond itself, this will tend to randomize the gene. If the (retro)transposon contains a pseudogene, it is a LINE, and these seem to have a mechanism for avoiding insertion into functional genes. Genes can also be inserted by viruses or passed from one bacterium to another. In this case the inserted gene would be functional and it does not seem possible to have two functional overlapping genes, so the original gene would lose its function. All in all, it does not seem that there is any way for transposons or retrotransposons to contribute non-random DNA sequences to the evolution of proteins except for simple sequences that could not explain the evolution of proteins. Therefore the assumption that material inserted by transposable elements is random, appears to be correct.

Even if transposable elements could insert DNA from one gene into another, it would not help. Even proteins with similar shapes may differ in as many as 70% of their amino acids at corresponding positions in the protein. One would expect, therefore, that proteins of different shapes would differ in considerably more than 70% of their amino acids at different positions in the protein. This shows that the amino acid sequences of proteins having different shapes are significantly different throughout their whole extent. This means that there is little to be gained in protein evolution by concatenating portions of existing protein sequences to generate proteins having new shapes. Therefore genetic material from another gene, inserted in the middle of a gene, would behave essentially the same as random DNA, and the previous analysis would apply to it. Whether this insertion of material came from a transposable element or some other kind of a mutation, it would not help much.

Proteins are often composed of "domains" that fold independently, and the same or similar domains can occur in different proteins. Therefore two proteins sharing a domain might share a subsequence that is largely similar. The question then becomes how domains with new shapes could evolve. However, even similar domains in different proteins are likely to have different parts buried and exposed, so their amino acid sequences are likely to be significantly different.

One might say that exising protein classes do not provide intermediates between one protein shape and another, but perhaps other protein folds existed in the past that served as intermediates, so that the mutations required to pass between one fold and another were less. Or perhaps different protein folds existing in the past shared long common subsequences of amino acids, permitting new protein folds to be created by concatenating together long subsequences of existing folds. However, above quoted references imply that there are only a small number of protein folds possible in principle, because of the laws of physics. Therefore many additional folds not only did not exist, they could not exist in principle. Not only this, but in order for a new fold to form, it has to be beneficial to the organism, or at least not harmful. Thus the new protein must interact in a useful way with other existing proteins in the organism. This constrains the possible protein folds even more. Furthermore, two proteins A and B of different shapees are not likely to share long common subsequences S of amino acids. The reason for this is that predicting the structure of a protein from its amino acid sequence is a very hard problem. This in turn implies that the shape in which the subsequence S folds is determined not only by S but by the other amino acids in the protein. Therefore the sequence S is likely to fold into a different shape in proteins A and B, and it is also likely to have different parts buried and exposed in proteins A and B. If protein A is stable, this means that the pattern of hydrophobic and non-hydrophobic side chains in the sequence S is suited for the protein shape A. Since the shape of B (and of S in B) is so much different, and different parts of S are buried in B than in A, the pattern of side chains in S will probably be unsuited to shape B, and B will be unstable.

It should be possible at any rate to check if some of the genes in an organism can be obtained by splicing together a small number of subsequences of other genes coding for proteins having different shapes, and determine how many such subsequences are needed. If even three such subsequences are needed, then in order to obtain such a gene, a useless gene would probably have to persist for something on the order of 10^5 generations between two transposable element insertions, having no additional function during this time. This is highly unlikely due to population genetics considerations.

The sheer number of subsequences is another problem. Suppose an organism has at least 200 genes, and each gene has about 1000 base pairs of coding DNA. Each gene has about 5*10^5 subsequences of coding DNA. Thus the total number of such subsequences in all the genes is about 10^8. Each such subsequence could be inserted in about 1000 sites in a gene. The probability of getting the right site is thus about 10^-3. The probability of getting the right subsequence in the right site is then about 10^-11. A typical organism will have at most about 1/10 of the DNA coding for proteins. The chance that both ends of the sequence will come from coding DNA is about 10^-2. The chance of getting the right sequence, all of coding DNA, in the right site is then about 10^-13. Suppose five such insertions are needed to get a new stable beneficial protein shape. The chance of this is about 10^-65, much too small, even ignoring many other factors such as the fact that most neutral mutations are lost from a population soon, the need for mutations to the active sites on the protein, and the fact that such insertions would be rare. Thus one cannot expect the insertion of DNA from one gene into another to facilitate the evolution of new protein shapes, independent of prohibitions on long common subsequences in different proteins.

Thus two or three insertions to obtain a protein having a new function may be the upper limit on what is feasible. The only way to obtain a functional protein with two or three insertions from another protein is if the inserted sequences are domains. In general, if the insertion of a long subsequence S from protein A to protein B, creating protein B', could produce a stable protein, then S would most likely be a domain of A and B. It is reasonable to assume that a domain can fold in only one way. Therefore the surrounding domains of S in A and B' would have to have the same configuration for S to fit into B'. Then B, without S, would have a hole in it lined with hydrophobic side chains. This would destabilize B and would also likely cause B to have undesirable reactions with other proteins. Therefore the gene producing B would be harmful and would be eliminated from the population. If there were two or more insertions of domains from A to B the problem would be even more severe, because there would be even more exposed hydrophobic side chains in B. Thus even independent of probability considerations, transferring a domain from one protein into another by an insertion is not plausible.

Since typical proteins have few hydrophobic side chains on the outside, it must be (in evolutionary terms) that this configuration confers some advantage on the organism. Therefore if B did have many hydrophobic side chains on the outside, one would expect B to gradually mutate to replace these side chains with hydrophilic side chains. Therefore when the insertion of A occurred, the "hole" in B would have many hydrophilic side chains lining it. These would be buried inside B', destabilizing it. Since typical proteins have few hydrophilic side chains on the inside, proteins with such side chains must be harmful to the organism, and thus B' would be eliminated from the population.

Also, an insertion of a domain S into a protein B, producing B', requires tight restrictions on the geometry of the ends of the polypeptide chains in B and S. Together with the fact that the interface between B' and S would have to coincide with the part of S that was buried before, there would only be a small number of such insertions possible in an organism. The chance that any of these would be beneficial to the organism is very small. If one considers the reactions involving existing active sites on these proteins, the total number of possibilities will be very small, and it will be very unlikely that any of these new proteins will be beneficial. According to the theory of evolution, proteins of new shapes evolved many times, so the probability of this would have to be very nearly one. If one considers adding a new active site, this will require many mutations to the new shape protein, and possibly to another protein as well, and even if these occur, it only guarantees that a new reaction can occur. The probability that the reaction will be beneficial will be very small. Multiplying all these improbabilities together yields a mathematical impossibility.

For remaining scenarios, some terminology is necesssary. The protein produced by a single gene is called a "polypeptide chain." These can adhere together in some cases and act as a single unit, also called a protein, in which the polypeptide chains are called "subunits." Each polypeptide chain will tend to fold so that hydrophobic side chains are on the inside and non-hydrophobic side chains are on the outside.

Another problem with insertions of a domain S into a protein is that the geometry of the domains surrounding S would have to be essentially identical in A and B'. This is unlikely, and cannot explain the evolution of proteins in which a domain appears in a different surrounding geometry. The only way this could happen is if S and B join together as subunits to produce a protein essentially like B', even before the insertion. But in this case, no new functionality is being produced in the organism, so this cannot explain evolution of new protein functions. It also cannot explain where domains of new shapes come from. Furthermore, the interface between subunits S and B would have largely non-hydrophobic side chains; when they were joined, these would be buried inside the protein, destabilizing it.

There are mechanisms that can explain how domains, once existing, can join together. For example, two domains A and B existing separately could mutate to have side chains on their surfaces which could cause them to stick together (as subunits), while retaining their shape. This might give them a new function in the organism. Then if the geometry of the ends of A and B is just right, an insertion can cause a single gene to produce a protein with both A and B in it. If the geometry is not just right, it can be modified by point mutations to enable such an insertion. Maybe the joined protein would be more stable than A and B produced by separate genes. However, mechanisms for joining proteins into larger structures require modularity of protein structure, so that A and B fold much the same when joined as they do separately. Such modularity does not appear to exist below the domain level, for otherwise predicting protein structure would not be such a hard problem. Therefore these mechanisms cannot explain where the domains came from.

Even at the domain level, the joining of polypeptide chains into the same gene has a number of problems. Suppose polypeptide chains A and B react (stick together) as subunits to produce protein C having a new function. For this, A and B would need to mutate to have side chains on the surface so they would stick together. This would produce A' and B' that could react (join together). However, these new side chains would probably be harmful to the existing functions of A and B. Therefore there would have to be two useless genes, one for A and one for B, that could mutate to A' and B'. It is also unlikely that just joining two polypeptide chains would serve a new function in the organism. Probably several chains would have to join, reducing the probability. Furthermore, protein-protein reactions are very specific, and typically require close agreement between 10-15 side chains for two proteins to interact. To get this many side chains to have the right properties for two proteins to interact would probably require 7 or 8 amino acid changes, at least, for probably 12 substitutions on each protein, or, 24 substitutions in all. But probably just two domains joining would still not benefit the organism, so at least 48 substitutions would be necessary just to get three proteins to join together, not to mention mutations to other active sites. As shown above, this is a mathematical impossibility. In addition, many non-hydrophobic side chains would be buried when A' and B' become part of the same polypeptide chain, which would destabilize it.

It also seems peculiar that subunits A and B would ever become parts of the same gene due to an insertion if they were already fulfilling a new function as separate genes. The joining would be highly improbable because it would require just the right insertion, and it would probably actually hinder the operation of AB. It would also require tight restrictions on the geometry of the ends of A and B. Therefore this scenario does not explain why there are so many proteins with multiple domains coded by the same gene.

Even though protein structure below the domain level is not modular, it may sometimes be true that combining folds A and B produces a combined fold largely preserving the structure of A and B. This might be true often enough to permit the evolution of new domain folds. However, the problems with combining domains apply even more strongly to combining sequences below the domain level.

The exchange of domains between genes can happen by different mechanisms, but has problems as well. Suppose a protein is composed of two parts A and B, and two more subunits C and D coded by separate genes are joined together in one protein. Then the interface between A and B would consist largely of hydrophobic side chains. The interface between C and D would consist largely of non-hydrophobic side chains. If B and D have similar geometries on their ends, and B is deleted from A, A and D might join together and have a new function in the organism. Deletions are more probable than insertions because they require less information to specify. But it seems highly unlikely that AD would have a beneficial new function without many point mutations, which as shown above is a mathematical impossibility. It is also unlikely that A and D would adhere to one another, since protein-protein reactions are very specific. The fact that A would have hydrophobic side chains and D would have non-hydrophobic side chains at their interface would also prohibit their adhering to one another. This mechanism also cannot explain how large proteins (large genes) are formed from small ones.

Some mutations can exchange one part B of a gene AB for a part D of another gene CD, producing AD. This could also exchange domains between one gene and another, and preserve the parts that are buried and exposed, avoiding problems mentioned above. However, this does not explain where the domains came from, or how they joined together in the first place. It also only permits insertions in which the interface between A and B is about the same as that between C and D, and cannot explain how a domain can appear in different proteins with substantially different parts buried and exposed. It also requires very tight restrictions on the locations of the ends of the polypeptide chains in A and D, reducing the probability that this could occur. Altogether there would be only a small number of ways that such an exchange could occur in an organism, because of so many restrictions. And of course, the chance that any of these exchanges would produce a protein with a new, beneficial function in the organism is very small, whether one considers reactions involving existing active sites or the generation of new active sites by further mutations. Since the theory of evolution requires that the probability of generating proteins of new shapes be nearly one, exchanges of domains cannot explain protein evolution.

There are other problems with evolution of proteins by insertion of domains from one protein to another, or an exchange of domains between proteins. Even for point mutations, at most about one in a thousand substitutions is beneficial to an organism, and the number may be much smaller. In a highly organized system such as life, the chance that a random change will be beneficial decreases rapidly with the magnitude of the change. Therefore the chance that a large change such as adding one or more domains to a protein will produce a benefit to the organism is very small, and possibly zero. In addition, each domain will have evolved to be adapted to the protein in which it already exists. Moving it to a new protein will place it in a role for which it is not adapted, which will almost certainly result in harm to the organism. Finally, there are many constraints on protein folding in addition to the requirements that hydrophobic side chains be in the interior and non-hydrophobic side chains be on the exterior. The chance that a domain, moving from one protein to another, will satisfy these additional constraints, is very small.

As evidence of this, most mutations that change the amino acid are harmful. This implies that even exchanging one hydrophobic side chain on the inside of a protein for another, is likely to be harmful. Also, even the same domain (folding the same way) in two different organisms or proteins is likely to have many amino acids different, sometimes almost all of them, including the hydrophobic side chains on the inside. Therefore, taking a domain from one protein and putting it in another is going to result in hydrophobic side chains at the interface that do not mesh with each other. The chance that the resulting protein will be stable and fold correctly is very small.

Another problem is that the kind of mutations that can result in the moving of a domain from one protein to another, are either very improbable or almost certainly harmful to the organism.

Unless the evolution of proteins of new shapes is possible, evolution is blocked. All scenarios for protein evolution have been shown to be mathematically impossible, under reasonable assumptions.

Back to home page.