Problems in Protein Evolution

October 1, 2001, revised November 19

Some very serious problems in the evolution of proteins threaten the theory of evolution, and appear to disprove it. A demonstration of the seriousness of these problems therefore constitutes a disproof of the theory of evolution. In particular, the evolution of proteins having significantly different shapes (tertiary structures) than previously existing proteins appears to be impossible.

Although there are other arguments against the theory of evolution, the present argument differs in several ways. Some arguments against evolution involve the improbability of abiogenesis, that is, the origin of the first life forms. These arguments are convincing, but biologists will say that some as yet undiscovered mechanism resulted in the first life forms. The biologists' argument is hard to refute in a formal way. Other arguments point to the complexity of life, and the implausibility of such a complex system evolving. For example, Behe documents the complexity of flagella and argues that they are "irreducibly complex," meaning that the system cannot function unless many parts appear, and all these parts could not have arisen at once by evolutionary processes. This is also a convincing argument, but it is hard to formalize because it does specify mathematical probabilities. Biologists will say that some as yet undiscovered mechanism resulted in the evolution of the flagellum and other structures that appear to be irreducibly complex. Other arguments involve the fact that information of the kind found in life forms does not appear by natural processes. However plausible this argument is, it does not have a formal mathematical justification. Similar comments apply to the lack of transitional forms in the fossil record, and many other arguments commonly used against the theory of evolution. The argument presented here is different, in that it involves existing genetic mechanisms, not hypothesized ones, and it involves the calculation of mathematical probabilities (or rather, improbabilities). This therefore appears to be the first argument that qualifies formally as a disproof of the theory of evolution.

This argument also involves redundancies in many of its aspects. That is, under several models, the change of shape of proteins is impossible, and each model has its own set of reasonable assumptions.

The difficulties presented here do not imply that proteins cannot acquire new functions by mutations; this can happen by mutations that do not significantly affect the shape of the protein. It is only the change of the shape of a protein that presents difficulties.

Grishin [Grishin 01] proposes some mechanisms for the evolution of new protein folds without an explicit computation of probabilities, but this process is still problematical. The difficulties in changing the shape of proteins by evolution involve both probabilities and laws of protein structure. The evolution of proteins of new shapes by point mutations is not possible, because the change of shape of a protein would require too many mutations. If the probability of a mutation is high enough to change the shape of the protein, then many other mutations will also occur that will essentially randomize the rest of the gene and cause the newly shaped protein to be harmful to the organism. One might argue that large scale changes to a gene could result in a protein of a new shape more readily than could point mutations. However, other arguments based on laws of protein structure prevent this. The kinds of amino acids that appear on the inside and the outside of a protein are different. There can only be a small number of insertions of a part of one protein into another that do not violate the distinction between the inside and outside of the protein, and the chance that any one of these will be beneficial, is very small.

The present argument is based on assumptions from the theory of evolution, according to which life began as a simple reproducing system that gradually developed into the life forms we see today. This system (or systems, if life developed multiple times) must have been very simple, because it had to originate without the benefit of evolution. In particular, it could only have had one or a very small number of proteins. From these few proteins, those in current life forms must have evolved.

Each protein is produced by one (or possibly several) genes. The evolution of new proteins must have occurred by mutations to these genes. Since genetic mechanisms in current life forms are strikingly uniform, with a few modifications, it is reasonable to assume that these mechanisms have been in operation for hundreds of millions or billions of years, in the accepted evolutionary scenario. Thus the evolution of many proteins from a few must have occurred by genetic mechanisms that are still in existence. Even if special mechanisms operated in the evolution of one-celled creatures, there are undoubtedly many proteins and shapes of proteins that only appear in multicellular organisms, and these must have evolved from others by currently existing genetic mechanisms. Furthermore, because all known one-celled organisms have similar genetic mechanisms, it is reasonable to assume that these mechanisms were operating for a considerable portion of the time that these one-celled organisms evolved from simpler organisms having many fewer proteins. During this time, proteins having new shapes must have appeared.

Each protein is composed of a sequence of amino acids that join together and are then called "residues." The "side chains" of the residues determine their chemical properties. These side chains are joined together by the "backbone" of the protein. Some side chains are hydrophobic (oily) and tend to cluster together inside the protein. Others are hydrophilic (water loving) and tend to occur on the outside of the protein. Some proteins function in a hydrophobic environment, such as the cell membrane; these may have more hydrophobic residues on the outside. A given shape of a protein will tend to be associated with a particular sequence of hydrophobic and non-hydrophobic side chains. Changing the shape of the protein requires changing this sequence of side chains. Even if the shape of the protein is not changed, having too many hydrophobic side chains on the outside can cause proteins to stick together, interfering with their function, as happens in sickle cell anemia. This is caused by a single mutation replacing a hydrophilic side chain on the outside of hemoglobin with a hydrophobic side chain.

A protein can be in a folded or unfolded state, and if it is folded, it may fold into a variety of structures. The unfolded state is more flexible and less dense. In order to have a function in an organism, a protein must fold, and it must fold the same way sufficiently often. This means that the wild type fold (the one found in nature) must be significantly more stable than other possible folds. If the protein is modified, it may fold into a different structure, or simply not fold at all.

Proteins have "active sites" where they can interact with other proteins or other substances. The geometry and composition of these sites is very highly constrained, because in order for two proteins to interact, their active sites have to match very closely in their geometry and chemical properties.

A problem with the evolution of proteins having new shapes is that proteins are highly constrained, and producing a functional protein from a functional protein having a significantly different shape would typically require many mutations of the gene producing the protein. All the proteins produced during this transition would not be functional, that is, they would not be beneficial to the organism, or possibly they would still have their original function but not confer any advantage to the organism. It turns out that this scenario has severe mathematical problems that call the theory of evolution into question. Unless these problems can be overcome, the theory of evolution is in trouble.

One of the constraints on proteins is that most of the side chains on the inside of the protein are hydrophobic while most of those on the outside are hydrophilic. However, this in itself is not much of a constraint, because it would permit 49% of the side chains on the inside of the protein to be hydrophilic and 49% of the side chains on the outside of the protein to be hydrophobic. In fact, almost all of the non-polar (hydrophobic) residues are clustered in a "hydrophobic core" of the protein, so this constraint is quite severe. Russell et al [97] found that different proteins with the same three-dimensional shape tended to conserve hydrophobic and polar side chains at corresponding positions. Not only this, but for proteins with a 25 to 50 percent sequence similarity, which generally means an identical three dimensional shape, there was also a strong preservation of other amino acid properties such as positive, negative, polar, hydrophobic, and length of side chain. Other constraints are also severe. If there is a patch of hydrophobic side chains on the outside of the protein, then two proteins can stick together, as in sickle cell anemia. Some side chains are positively or negatively charged; if these are on the surface and do not bond properly, they can bond with something outside the protein, hindering its function. Thus the side chains on the surface are highly constrained, even when they do not affect the shape of the protein. Inner side chains are also severely constrained. Some side chains are "polar," which means that they can form hydrogen bonds with water or with other parts of the protein. Having polar side chains inside the protein reduces its stability if they do not form hydrogen bonds with other parts of the protein, and charged side chains inside a protein reduce stability even more [Loladze et al 01]. If polar and charged side chains are buried, then they may be stablized by hydrogen or ionic bonds. However, the number of buried polar and charged side chains increases with protein size [Kajander et al 00], suggesting that even the destabilizing buried side chains serve a function in preventing the protein from becoming too stable, which would likely prevent it from folding properly. Thus both inside and outside the protein, the constraints are very severe, and replacing a side chain with one of different properties can dramatically interfere with its functionality. Not only this, but in order for a protein to be functional, it has to fold the same way fairly often, and the final fold cannot be too flexible, or else the function of the protein will be impaired. These requirements add even more constraints to functional proteins.

More evidence for constraints on amino acid sequences in proteins comes from the following statement [Berndt 96]:

Of the twenty naturally occurring amino acids a number do demonstrate statistically significant conformational preferences. For example, Glu, Ala, Leu, Met, Gln, Lys, and Arg are preferentially found in helices (from 1.6 to 1.2 times more frequent, in decreasing order) whereas Val, Ile, Tyr, Cys, Trp, Phe, and Thr are more frequently found in strands (1.9 to 1.2 times more frequent) than in non-helical segments. The residues Gly, Asn, Pro, Ser, and Asp are 1.8 to 1.2 times more likely to be found in turn segments than any other structure.

Helices, strands, and turn segments are geometrical arrangements of side chains that appear in many protein folds, and constitute the "secondary structure" of the protein. Strands combine to form "beta sheets." The "tertiary structure" describes how these secondary structures are combined in the protein. This quotation shows that different amino acids tend to occur in different secondary structures. Also, the secondary structure formed is based on the "consensus" of all the amino acids in a region. Therefore, changing enough of the amino acids is likely to change the secondary structure, and thus change the protein fold. Changing some of the amino acids in an element of secondary structured should change the secondary structure, if most of the amino acids then favor a different structure.

[Matthews 96] found that replacing a number of the exposed side chains of T4 lysozyme with the hydrophobic Ala did not reduce stability or change the structure. This suggests that surface amino acids may not have much influence on stability. However, even at the surface, replacing an amino acid by one that favors a different secondary structure can be significant. Ala favors helices, which were the regions studied. Furthermore, the conclusions of [Matthews 96] that 50% of the sequence might be replaced by Ala in this way do not appear warranted. Many hydrophobic side chains on the surface might cause some other fold to be favored over the wild type fold. There is also significant evidence [Blanco et al 99, Krowarsch and Otlewski 01] that exposed side chains do affect protein stability. In the interior of proteins, hydrophobic residues significantly favor stability [Loladze et al 01], and in addition, the size of the side chain has a significant effect [Vlassi et al 99]. Therefore both on the surface and even more inside a protein, side chains affect stability.

Finally, mutations to various proteins have been studied and replacements of a single amino acid often cause a protein to fold incorrectly or not to fold at all. Thus it is nearly certain that a protein with a large number of random replacements of amino acids, will fail to fold properly. For example, out of 33 replacements of a single amino acid in the tumor suppressor p53 protein listed in [Wright and Lim 01], 14 of them caused the protein to fail to fold properly. Of 70 reported mutants of T4 lysozyme [Matthews 87], 20% did not fold to the wild type fold. Some of these mutations may be insertions or deletions causing frameshifts, or replacements of amino acids by a stop codon; both kinds of mutations will almost certainly cause a protein to misfold. Another study [Blanco 99] showed that 5 or 6 relatively random amino acid replacements suffice to disrupt a protein fold. Also, even relatively conservative mutations to methylamine dehydrogenase caused it to misfold [Sun et al 01]. It seems reasonable to assume that 1/4 of all mutations including frameshifts and stop codon substitutions are sufficient by themselves to cause a protein to misfold, especially since many mutations will not be detected if they result in early mortality.

Another evidence for the sensitivity of proteins to mutations is provided by the folding energy. When a protein folds, the free energy decreases by 5 to 15 kcal/mole typically [Taverna and Goldstein 01, Matthews 87], which is not much. G is used for the free energy, so D G, the change in G during folding, is typically about -10 kcal/mole. A mutation that destabilizes the protein reduces this free energy decrease. Thus if D D G, the change in D G, is less than zero, the mutation stabilizes the protein. We will call D D G the amount by which a mutation destabilizes the protein. If the free energy change D G becomes positive, then the free energy will not decrease during folding, and the protein will not fold properly. Therefore if D D G is more than 15 kcal/mole, the protein will not fold properly. Thus several mutations, each of which destabilizes the protein, can have a larger combined effect than each mutation individually. This is especially true of "disruptive" mutations, such as the replacement of a hydrophobic residue by a hydrophilic residue, and such disruptive mutations will be necessary to change the shape of the protein. Since mutations that cause proteins not to fold are common, it must be that many mutations destabilize a protein by 5 to 15 kcal/mole or more. Thus there must also be many mutations that destabilize a protein by less than this amount. Let us call a mutation disruptive if it destabilizes the protein by at least one kcal/mole or is harmful to the function of the protein. Since many mutations will destabilize the protein by more than this, ten mutations that destabilize the protein by at least 1 kcal/mole should be enough to cause any protein to fail to fold properly. Also, because of the many constraints on protein structure, mutations that destabilize a protein should be far more common than those that stabilize the protein.

Replacements of Ile by seven other hydrophobic residues at position 56 in the interior of human lysozyme all decreased the stability, and all but one by at least one kcal/mole [Funahashi et al 99]. Several decreased the stability by nearly 4 kcal/mole or caused the protein not to fold properly. At position 59, which is more flexible, results were similar: of nine replacements, all but two were disruptive, and one of the two was nearly so. Thus even replacing one hydrophobic side chain in the interior of a protein by another is nearly always disruptive; the probability that a replacement by a random amino acid will not be disruptive is about 1/20 at these positions since non-hydrophobic side chains will be at least as disruptive as hydrophobic ones [Loladze et al 01]. Also, of 13 replacements of the buried hydrophobic Ile at position 3 of bacteriophage T4 lysozyme with other side chains, all but four were disruptive, and one of these was nearly so [Matsumura et al 88]. For a random replacement at this site, the probability of a disruptive change is at least 4/5, assuming all polar replacements are disruptive. However, the volumes of the hydrophobic cores of proteins can vary widely [Tsai et al 97]. Since different proteins with the same three-dimensional structure preserve the hydrophobic and polar character of side chains at corresponding positions as well as other amino acid properties [Russell et al 97], it is reasonable to assume that even on the surface, the chance of a random replacement being disruptive is about 1/2. As another example, Rath and Davidson [Rath and Davidson 00] found several mutations of atypical side chains that stabilized a protein by up to 1.4 kcal/mole, and their effect was additive. This protein had a folding energy decrease of only 3.08 kcal/mole, but with all three mutations, the decrease was more than doubled, implying that the average contribution of each mutation was more than one kcal/mole. Of course, performing these mutations in reverse on the final protein would destabilize the final protein by the same amount. Another example is given in [Takano et al 01], in which ten Gly to Ala mutations are considered in a left-handed helical region. Three of them destabilized the protein by 1-2 kcal/mole, five others destabilized the protein by smaller amounts, and two mutations stabilized the protein slightly. These examples should suffice to show that disruptive mutations are common. Typical values for D D G are 2-5 kcal/mole [Gray et al 95]. The combined effects on stability of several mutations tend to be additive, and it is rare for a mutation to increase the stability of a protein [Matthews 87]. This is consistent with the fact that five or six random replacements of amino acids suffice to cause a protein not to fold: typical replacements decrease stability by 2-5 kcal/mole, their effects are additive, and the energy decrease on folding is only about 10 kcal/mole in most cases.

This marginal stability is actually a necessary feature of proteins, and therefore has always been a property of proteins. When a protein folds, the first configuration it adopts will be determined by a variety of random factors, and therefore is unlikely to be the wild type fold. A protein will typically sample a number of configurations before the final fold. In order for the protein to fold the same way often, the wild type fold must be stable but the other folds sampled must not be. Since the stability difference between all of these folds is likely to be small, the final fold must have only marginal stability.

Differences in residues between proteins with similar structures have been studied to see which amino acids can readily substitute for one another [Russell et al 97]. For proteins with a 25 to 50 percent sequence similarity, the patterns of differences tend to group the amino acids into the following groups: 1. His, Asn (positive or polar). 2. Lys, Arg, Gln (mostly positive). 3. Glu, Asp (negative). 4. Thr, Ser, Ala, Pro, Gly (mostly small, polar). 5. Tyr, Trp, Phe (aromatic). 6. Met, Leu, Val, Ile (aliphatic). 7. Cys. Also, safe substitutions for residues on the surface and interior of proteins were given [Bordo and Argos 91], and the maximum number of safe substitutions for any residue was 5, and the average was about two. In fact, 13 safe substitutions were given for the interior of a protein and 17 were given for the surface. Thus the surface appears to be nearly as constrained as the interior. This suggests that replacing an arbitrary residue by another has a high probability of being disruptive, whether on the surface or the interior of a protein. This is not difficult to understand. In the interior, non-hydrophobic residues will usually be disruptive, and even for hydrophobic residues, the length of the side chain is important. On the surface, the positive and negative residues are sufficiently different from each other and from the remaining polar residues that a replacement between these groups is likely to be disruptive. Also, cysteine is sufficiently different from all other amino acids that replacements by it are almost always disruptive. The number of polar residues left after removing these groups is not large, and even here, the length of the side chain has an influence on stability and secondary structure. It would appear that an arbitrary replacement of one residue by another has about a 3/4 chance of being disruptive.

Bordo and Argos [91] observed which amino acid replacements occur in natural proteins at a corresponding environment of surrounding side chains. For positions in the interior of the proteins studied, only thirty replacements were observed, and twenty of these occurred only once. Ignoring these, only ten equivalent pairs of side chains were found, and since each pair corresponds to a substitution in two directions, this means that on the average one substitution is possible in any given interior environment. For positions on the surface, 58 substitutions occurred more than once, which means about 6 possible substitutions for surface positions, on the average. Also, many figures were very small, indicating the substitution was only possible in a few environments, so most environments would not permit it. Since these figures are composites from many environments, the figures for a particular environment may be much smaller. Therefore all but about 6 replacements of surface residues would be harmful to the protein, and thus disruptive. Also, Koshi and Goldstein[97] observed that hydrophobicity of surface residues is preserved between proteins of similar structure, even though this cannot be explained in terms of protein stability. This indicates that replacing a surface residue by one of different hydrophobicity is likely to be eliminated from the population, and therefore is harmful. This makes such a replacement disruptive, according to the definition given previously.

Another evidence of the constraints on proteins is the estimate of Kimura[83] that a mutation changing an amino acid is ten times more likely to be harmful than neutral or beneficial. Since there are 20 amino acids, this implies that on the average, only two or three can occupy a given position in a protein without harm. What kind of harm can the others do? They can either cause the protein to misfold, or fold in another way, or reduce its stability significantly so its activity or lifetime is affected, or interfere with an active site, or cause a new harmful reaction such as two proteins sticking together. A decrease in stability of 1 kcal/mole is minor because protein stabilities are in the range of 5 to 15 kcal/mole anyway. Therefore if the mutation impacted the protein by reducing the stability, it would have to reduce it by more than 1 kcal/mole, which would make it a disruptive mutation according to the definition given above. Therefore at least 9/10 of the replacements would be disruptive mutations, either because of their effect on the stability or function of the protein.

Another evidence along the same line comes from the Dayhoff matrix which shows the frequency of various amino acid replacements between similar proteins. One notable feature of this matrix is that a few entries are large and the great majority of entries are very small. If all mutations were equally likely to be preserved in the population, the entries would be much more uniform. In addition, almost all of the large entries in the Dayhoff matrix correspond to substitutions that are considered safe in [Bordo and Argos 91], or that can be obtained by composing two such substitutions. Since the remaining entries are generally so much smaller, this suggests that these remaining entries correspond to amino acid replacements that are almost always disruptive. Also, if one assumes that most mutations are neutral, the difference in magnitude of the entries shows that the great majority of amino acid replacements are harmful and eliminated from the population.

The assumption that we will make is the following: Ten random disruptive mutations to a functional protein have a probability of at least 1/2 of causing the protein to have no useful function in the organism. The justification for this assumption is as follows: Since five or six relatively random mutations suffice to cause a protein to unfold, they would also destroy its function. Furthermore, since typical replacements reduce stability by 2-5 kcal/mole, and few of them increase stability, ten of then should cause a protein to unfold. In addition, ten mutations, all of which are harmful, should be enough to destroy the functionality of a protein, in most cases. Five mutations that decrease stability by at lesat 1 kcal/mole and five harmful mutations should likewise destroy the functionality of the protein, simply because the first five mutations will probably cause the protein to unfold and the last five are not likely to reverse this, and because the last five in themselves are likely to dstroy the functionality of the protein.

The typical mechanism proposed to explain the evolution of new proteins is that an existing gene is duplicated, and one of the copies of the gene then begins a series of mutations that eventually result in a gene able to produce a new protein. If the mutations result in a change in the shape of the protein, then the protein will probably no longer have a function in the organism, because the function of a protein is closely related to its shape. The mutating duplicated gene is still able to produce a protein, but the protein has no function in the organism. We call such a gene "useless" to indicate that it does produce a protein but the protein has no function in the organism. This is distinct from "pseudogenes," which no longer produce proteins at all because mutations have corrupted a control region or something else necessary for the gene to function.

A difficulty with this scenario for protein evolution arises from the small number of genes. If many of these genes are useless, then the number of useful genes would be even smaller than the number of discovered genes, which seems highly unlikely. Therefore the average number of useless genes (as opposed to pseudogenes) in an organism is very small, reducing the probability that this kind of evolution can occur. Furthermore, a useless gene produces a protein that either fails to fold properly or has no useful function in the organism. Producing this protein requires extra energy without producing any benefit, and is therefore detrimental to the organism. In addition, misfolded proteins have to be removed from the cell, requiring extra energy. Misfolded or useless proteins are actually likely to have some harmful effect on the organism. This means that these useless genes are likely actually harmful to the organism, more harmful than pseudogenes. Harmful mutations tend to be eliminated from a population, making it even more unlikely that a useless gene would persist long in a population. For example, if the useless gene has a selective disadvantage of 10^-4, meaning that in each generation, the proportion of the population having this gene is reduced by one part in 10,000, then in 10⁵ generations, the proportion of the population having this gene is reduced by a factor of 22,000. In 10⁶ generations, the proportion is reduced by a factor of over 10⁴³, which is astronomical. During this time, one would expect about 10 mutations to a useless gene having 1000 base pairs; this corresponds to a probability of about 1/100 of a point mutation at any site. An example of the removal of useless genes is provided by the parasitic Mycoplasma genitalium and Mycoplasma pneumoniae, which have lost all ability to synthesize amino acids [Himmelreich et al 97]. Finally, it is likely that some mutation to a useless gene would render it nonfunctional, producing a pseudogene, which would be unlikely to result in the evolution of a new protein of benefit to the organism. For all of these reasons, the evolution of new protein shapes through neutral mutations is highly unlikely. However, it is necessary to estimate the number of mutations needed in order to bound the probability of such new shapes evolving.

Currently, about 420 protein folds are known [Quian et al 01]. A fold is defined as a given collection of secondary structure elements with a specified topology, in a specified geometrical orientation. A protein may have more than one fold, so the number of three dimensional protein structures can be larger than the number of folds. In [Denton and Marshall 01], it is stated that there are probably only a small number of protein folds, perhaps not more than a thousand or so altogether:

Consideration of these 'constructional laws' suggests that the total number of permissible folds is bound to be restricted to a very small number -- about 4,000, according to one estimate. Confirmation that this is probably so is provided by a different type of estimate, based on the discovery rate of new folds. Using this method, Cyrus Chothia of Britain's Medical Research Council estimated that the total number of folds utilized by living organisms may not be more than 1,000. Subsequent estimates have given figures of between 500 and 1,000.

References given include [Chothia 92] and [Lindgard and Bohr 96]. This small number of folds is evidence that protein folds, and functional proteins, are highly constrained. Thus the probability that a random sequence of amino acids would produce a properly folding protein is very small, because a random sequence would be highly unlikely to have the proper sequence of hydrophobic, polar, and charged side chains. This in turn implies that it would typically require many mutations to produce a new protein shape from an existing one.

For purposes of this argument, it is reasonable to estimate the probability of producing a specific new shape of protein from existing ones. This is because the probability of producing such a new shape decreases rapidly with the number of mutations required. Therefore the probability of producing any new shape protein is dominated by the probability of producing the particular new shape that requires the smallest number of mutations. To see this, consider f(x) = 10^-x and note that the sum of f(3) + f(5) + f(8) is almost identical to f(3), because the larger arguments hardly influence the sum at all. And among the shapes requiring a minimal number of mutations, only a small fraction will be beneficial to the organism. Even if all shapes were considered, it would probably influence the probability by a factor of at most 1000 or 10,000.

Random sequences of amino acids are highly unlikely to fold properly into any shape of protein; rather, they will have many hydrophobic side chains exposed and will tend to stick together and also stick to other proteins that are in the process of folding. It is possible to estimate the probability that a random sequence of amino acids will fold properly. Each fold may have one or more families of proteins; within each family, the amino acid sequences are similar. The number of families is probably not much larger than 10,000. Assume that for each protein fold or family F there is a set of typical side chains s_i(F) at each position i. Assume that an atypical side chain reduces the stability of the fold or family by at least 1 kcal/mole, or is harmful to the function of the protein, so that at most about 10 side chains can be atypical for the fold or family. Call a position i of fold or family F r constrained if the fraction of amino acids that are typical for this position is r. Thus r = |s_i(F)|/20. Call a protein fold or family F r constrained if the probability is r that a random amino acid at a random position i on the protein will be typical for F. If F is r constrained, then the expected fraction of amino acid replacements necessary to obtain the fold or family F from a random protein without harmful replacements is 1-r. Since it is generally necessary to change more than 70 percent of the amino acids to obtain a given family, most families are r constrained for r less than or equal to 3/10, if one assumes that existing members of protein families are representative of all sequences having a given shape.

If P is a particular protein, let s_i(P) be the set of residues that can replace the existing residue at position i without being disruptive. Call such a residue typical for P at position i. Call position i of a protein n/20 constrained if n of the possible amino acid replacements including the identity replacement at position i are not disruptive. Call a protein r constrained if the chance that a random replacement at a random position is not disruptive, is r. Note that the typical residues for position i of a fold or family depend only on the properties of the fold, while the typical residues for a protein depend on the particular sequence of that protein. For example, a fold may permit a residue with a long or short side chain at a given interior position, since this change can be compensated for by other residues. But a protein will likely only permit a long or short side chain at an interior position.

Since each fold specifies that at each position, amino acids should be hydrophobic or non-hydrophobic [Russell et al 97], it is reasonable to assume that s_i(F) includes about half of the side chains, and for the hydrophobic core, the fraction is considerably less. Since almost all of the exposed side chains will be polar, the exposed positions will have r not much larger than 1/2. In addition, the exposed side chains will be constrained by their preferences for various secondary structures, so it is reasonable to assume that surface positions are 1/2 constrained or less. In the interior of the protein, even replacing one hydrophobic side chain by another is likely to be disruptive, and typically only 1/5 to 1/20 of the side chains are non-disruptive. Therefore it is reasonable to assume that buried positions in a protein are 1/4 constrained for a particular protein. The results of Bordo and Argos [91] imply that interior positions of a protein are about 1/10 constrained and exterior positions are about 1/3 constrained, as an upper limit.

The fact that members of the same family tend to agree in 70 percent or more of their sequence is actually remarkable. If residues substituted readily for one another, this would not be expected. Suppose that all positions of a particular family were 1/2 constrained; then a typical protein from this family would have 10 possible residues at each position, and thus one would expect that two proteins from this family would agree in about 10 percent of their sequence. If all positions were 1/4 constrained, then there would be five typical residues at each position, and two proteins from the family would agree on about 20 percent of their sequence. The fact that proteins of the same shape often agree on 30 percent or more of their sequence suggests that there are only three or four typical residues at each position, on the average. This is consistent with data from [Russell et al 97] in which the amino acids naturally group into groups of size three or four, on the average. This also suggests that typical positions are 1/5 constrained or even r constrained for a smaller value of r. In fact, estimates given earlier based on Dayhoff matrices and Kimura[83] suggest that proteins are 3/20 constrained.

One might argue that the effects of different replacements can cancel each other out. For example, in the interior of a protein, replacing a small side chain by a large one might be compensated for by replacing a neighboring large side chain by a small one. In reality, this effect should not matter much. Let p be the probability of an amino acid replacement at any position. Suppose p = 1/10. Then the probability of having a disruptive mutation at a given position on the interior is 3/40, assuming that the interior positions are 1/4 constrained. Consider two neighboring positions; the probability that at least one of them will have a disruptive replacement is about 3/20. What is the probability that their replacements will cancel each other out? The probability of having replacements at both positions is 1/100, and then if the first replacement substitutes a large side chain for a small one, the chance that the second replacement will compensate is 1/4. The probability is therefore 1/400 that the effects of both replacements will cancel out. Thus the effect of the cancellation only reduces the 3/20 probability to 3/20 - 1/400, which is a negligible change. The same comment applies to replacements by cysteine; just adding one cysteine in the interior is likely to be disruptive, but adding two can be stabilizing because they can bond together. But the chance of this happening is very small in comparison. Therefore, it is reasonable to ignore the interactions between replacements when considering their effect on the stability of a protein. Therefore it is reasonable to assume that any ten or so disruptive replacements to a protein will cause it to unfold. Folds may be as little as 1/2 or 1/4 constrained, but particular proteins may be as little as 3/20 constrained. This latter figure is the relevant one in deciding whether a set of random mutations will cause a protein to unfold.

For simplicity, assume that a side chain is either buried or exposed. Generally about half of the amino acids in a protein are hydrophobic (including Glycine), suggesting that the hydrophobic core of a typical protein contains about half of the side chains, though this proportion can vary. If half of the side chains are buried and half are exposed, then half of the side chains will be 1/4 constrained and half will be 1/2 constrained. This means that in order to obtain family F from a random sequence of amino acids, it is necessary to change half of the surface side chains and 3/4 of the buried ones, or, 62.5 percent altogether. In actual proteins, it is usually necessary to replace at least 70 percent of the amino acids to obtain a different fold, so the stated assumptions are somewhat too liberal. For folds with a small hydrophobic core, the positions will be less constrained, on the average, consistent with the fact that one can sometimes obtain a new fold by replacing less than half of the side chains [Service 97]. Also, this fold did not have any active sites, which would have constrained it substantially more and required replacement of more side chains. In addition, it is to be expected that some sequences will require fewer than the expected number of amino acid replacements. Furthermore, the 40 percent figure did not specify how many hydrophobic side chains remained on the surface; these can cause problems, even if the protein folds correctly.

In order to make the calculations simple, and be generous to the possibilities for protein evolution, we assume that all positions are half constrained, and that 10 of them can be atypical without destroying the fold or functionality of the protein. Each amino acid is specified by a codon of three nucleotides, each of which can be one of four bases, though the third nucleotide often does not make much of a difference. Therefore for a gene of 1000 base pairs, which is typical, there will be 333 amino acids, and the chance that a random sequence will have the right properties for a particular fold is 2^-333, or about 10^-100. Since there are about 10,000 families, the chance that a random sequence will consist of typical side chains for some fold would be at most 10^-96. However, it is more reasonable to assume that some of the amino acids can deviate from the required properties without damaging the fold. Each such deviation costs 1 kcal/mole or more of free energy, and probably 2 - 5 kcal/mole, so at most about 10 such atypical side chains can be permitted without destroying the fold or the functionality of the protein.

If each fold permits 10 of the amino acids to be atypical, then one can show that the probability that a random sequence of amino acids will fold properly is about 333!/(10! * 323!) larger, which is about 4 * 10¹⁸, giving a final probability of about 4 * 10^-78, under these assumptions. This shows that the more a protein is randomized, the less likely it is to fold properly. In reality, the probability should be much smaller, because it is not enough just to have hydrophobic or non-hydrophobic side chains at specified positions; other constraints must be satisfied, as well. Furthermore, in the hydrophobic core, the constraints are more severe.

An estimate of the number of mutations needed to produce a protein of a new shape is provided by [Baker and Sali 01] in which it is stated that prediction of the structure of a protein is difficult if there is less than a 30% amino acid sequence identity. If there is 30% or more agreement in the amino acid sequences, the structure of the proteins will be similar. This implies that proteins of significantly different structure will differ by more than 70% of their amino acids. Also, from [Service 97]:

pairs of natural proteins differing in up to 70% of their amino acid sequences virtually always fold up in to the same general 3D structure.

It follows that obtaining a different structure requires more than a 70 percent difference in amino acid sequence, almost always. If a protein has 1000 coding base pairs, or 333 amino acids, then at least 233 of these must differ in order to have a 70% difference and a new shape. Each such difference will require one, and possibly two, point mutations, for probably well over 300 point mutations in all. There was a competition ([Service 97]) to reduce this 70 percent figure. The winner succeeded in obtaining a new protein fold by changing only 40 percent of the amino acids! This suggests that reducing this 40 percent figure will be difficult.

One can also get an estimate by noting that protein structures tend to bury hydrophobic side chains. Thus each protein fold will correspond to a particular pattern of hydrophobic and non-hydrophobic side chains. It seems that on the average one would have to change about half of the side chains from hydrophobic to non-hydrophobic and vice versa to get a new fold. Half of 333 amino acids would be 166 amino acids. Each such change would require one or two mutations, for well over 200 point mutations to change from one protein fold to another. The necessary number of mutations would be somewhat less, since some atypical side chains are permitted. Therefore, a small number of random mutations will cause a protein to misfold or not fold at all, but a large number of mutations is needed to obtain a new fold.

Another way to justify the fact that many mutations are needed to change the shape of a protein is found in [Cordes et al 99]:

Mutagenesis experiments show that limited changes in sequence can have large effects on stability and activity, but generally do not lead to large shifts in structure. For example, highly disruptive mutations such as insertions in elements of regular secondary structure or hydrophobic-to-charged substitutions at core positions lead to only minor structural differences in bacteriophage T4 lysozyme and staphylococcal nuclease, pointing to a strong drive to preserve the basic native fold.

Despite this, just a few mutations are likely to cause a protein to misfold, or not to fold at all. In the transition between folds, a protein passes through a region of instability, and is likely not to fold at all. Since natural proteins tend to be stable, it must be that instability is detrimental to the organism and (under evolutionary assumptions) is eliminated from the population. Therefore proteins tend to remain in regions of stability, and many mutations are required to change their shape. Thus mutations along the path of change would be harmful to the organism and would tend to be eliminated from the population.

Another figure for the number of mutations to change a fold is given in [Chen et al 96]. In order to change the function of a protein, making a slight change in the secondary structure of one small loop, leaving most of the tertiary structure of the protein unchanged, it was necessary to replace one sequence of 7 amino acids by another sequence of 13 amino acids and change 4 other amino acids. Inserting 6 amino acids requires 18 insertions and changing 7+4 more requires between 11 and 22 more substitutions, for a total of between 29 and 40 mutations. If the insertions are done carefully, these bounds can be reduced by 3 and 6 substitutions, respectively. The required number of mutations is small in this case because the modified loop is on the exterior of the protein.

To get another estimate of the number of mutations needed, consider two additional references. In [Cordes et al 99], the authors present a mutation that exchanges one fold of a protein for another. This occurs by exchanging the functions of two residues in each strand of a beta sheet, and each ASP-LEU mutation requires at least four substitutions, for a total of at least eight point mutations. But these appear to happen in both monomers of the protein, which are coded for by the same gene, so only four point mutations would be needed. The new fold leaves most of the protein structure intact, so it is not clear that this qualifies as a change of the shape of the protein either. What happens is that a small portion of the two ends of the protein changes its configuration, but the middle of the protein does not. It would seem easier to change the configuration of the ends of a protein without modifying the overall structure than to change the middle. Changing the shape of a fold in the middle of a polypeptide chain would be unlikely to leave the ends of the fold in the same position as before, implying that additional changes would be necessary further on down the protein. Furthermore, the ends of the protein are on the outside of the protein in this case. For portions of a fold on the inside of the protein, changing the shape of the fold would also require changing the geometry of the surrounding portions of the protein, undoubtedly requiring many more mutations. For the small portion of the fold that is changed, 1/3 of the amino acids had to be changed. Larger folds and folds in the middle of the protein would require a larger proportion of amino acid changes and many more changes altogether.

In [Cordes et al 00], the authors discuss what happens when only half of these mutations occur. If only two of the four substitutions occur, the protein will rapidly oscillate between the two configurations, preserving the overall shape of the protein. This does not qualify as a change in the overall shape of the protein, but suggests that there are regions of instability between different folds.

We will suppose for concreteness that eight point mutations can produce a new shape of a protein that folds properly, with not too many hydrophobic side chains on the outside, et cetera, so that the protein could conceivably be a functional protein. However, it is still unlikely that the new protein has a function in the organism, because almost all mutations are harmful. Therefore it probably would require at least several such sets of mutations to produce a new protein shape that folds properly and has a useful function in the organism, or, 24 point mutations in all. In addition, a number of mutations to one or more active sites or other places in the protein involved in chemical interactions would probably be necessary, which we will estimate as at least 10 point mutations, for 34 in all. In reality, this is being very generous, and considerably more would be needed. Estimates given above were in the range 200-300 or more. If one assumes that a protein fold can tolerate disruptive mutations to ten percent of its amino acids, then at least 34 mutations would be needed to change the shape of the protein. Thus the figure of 34 point mutations is highly conservative.

Now, one problem with producing proteins of new shapes by evolution is that the mutations of a useless gene would be neutral mutations, because they would confer no benefit to the organism. From population genetics considerations it follows that almost all neutral mutations are eliminated from a population, and most such mutations do not last very long. According to the theory of neutral evolution, neutral mutations are eliminated from a population on the average in 2(Ne/N)ln(2N) generations (if I understand the matter correctly), where Ne is the effective population size, N is the population size, and ln is natural logarithm. Note that Ne/N is at most 1. For a population of a billion, this would be about 44 generations. For a population of a trillion, it would be about 56 generations. The chance to accumulate a significant number (even 2) of neutral mutations to a gene within 44 to 56 generations is negligible. A typical mutation rate for a gene is one mutation for every 10 ⁵ generations, that is, a mutation somewhere in the gene typically occurs about this frequently. In order to get 34 mutations, the gene would have to persist in the population for about 34*10 ⁵ generations, which is highly improbable because neutral mutations are quickly lost, as a rule. However, the effect of this is difficult to assess: even though most neutral mutations are quickly lost, those that remain may spread to a larger number of individuals. Also, there are "frequency-based selection" models under which rare alleles have an elevated chance of being retained in the population, which would tend to negate this loss of neutral mutations.

It is necessary to estimate the total population size in order to bound the probability of new protein shapes evolving. A recent article ([Whitman et al 98]) estimated that there are now about 5 * 10³⁰ prokaryotes in existence and that on the average they reproduce about once every three years. This seems to give a good bound for the total population size. The fact that prokaryotes reproduce so rarely shows that their food supply is limited, which implies that elimination of a useless gene would confer a significant advantage. Each bacterium may reproduce about once every 15 minutes, which leads to about 30,000 generations per year if much food is available, and about 10¹⁴ generations in 3 billion years. Thus the total number of individuals ever existing can be bounded by about 10⁴⁵ and may well be as small as 10⁴⁰, according to standard evolutionary assumptions. Each gene experiences a mutation about once every 10⁵ generations, implying the total number of alleles generated is about 10⁴⁰. In order to generate a protein of a new shape, the probability of doing so for each allele cannot be much smaller than 10^-40. If an allele is beneficial to the organism, it will tend to spread to many individuals, reducing the total number of alleles in the population. Therefore the probability of producing a protein of a new shape is maximized by assuming that all alleles are neutral.

To get a better estimate on the probabilities, 33 of these 34 mutations needed to change the shape of the protein would be neutral or harmful; the last one might be beneficial and more likely to be retained in the population. Assume that a particular set of 34 mutations is necessary to produce a beneficial protein having a new shape. This assumption is somewhat too severe, because there may be more than one such set of mutations. A typical gene has about 1000 base pairs of coding DNA (not counting introns, etc.), though this number can vary a lot. Suppose that the probability is p that a mutation will occur at any site in a gene. Suppose (model A) that if this mutation occurs to a site outside the specified set of 34 sites, the mutation has a probability of one half of permitting the specified protein fold to form and for the protein to satisfy the constraints necessary for a functional protein (the right hydrogen bonding occurs, etc.) and be beneficial to the organism. If the protein is not beneficial to the organism, the gene will eventually be eliminated from the population. (The computation is not sensitive to this value of 1/2. Especially if 34 is replaced by a larger number, 3/4 or 7/8 would likely work as well.) However, every one of the specified 34 sites must mutate for the specified protein fold to form and for the constraints to be satisfied. Then the probability that this fold can form and the constraints can be satisfied and the protein can be beneficial is (1 - p/2)^(1000 - 34)*p³⁴.

Suppose p is about 1/30, meaning that the expected number of mutations is about 33. Then the expression becomes (1 - 1/60)^(1000 - 34)*(1/30)³⁴ or (8.9 * 10^-8)*1/(1.7 * 10⁵⁰), or 5.2 * 10^-58, which is highly improbable. If p is larger still, the probability will be even higher. For p = 1/15, one obtains a probability of about 5.963*10^-55. For p = 1/10, one obtains 2.876*10^-56. For p = 1/5, one obtains 9.717*10^-69. Therefore even a larger probability of mutations does not help much. However, since useless genes should rapidly be eliminated from a population, a small values of p seems most realistic. These figures are ignoring the fact that a mutation to one of the 34 sites will only have a probability of about 1/2 of producing a side chain typical for the family. This reduces the probabilities by a factor of 2^-34, which is about 2 * 10 ^-11.

Is it reasonable to assume that mutations must occur at a specified set of sites in order to change a protein fold? From [Cordes et al 99],

Burial of hydrophobic groups is widely acknowledged as a principal source of protein-folding stability, whereas burial of polar groups inevitably decreases stability.

Therefore it is reasonable to assume that the pattern of polar and hydrophobic side chains of residues in a protein must be changed in a specific way in order to obtain a new protein fold. This implies that mutations must occur to a specific subset of the amino acids in the protein, which justifies our assumption (except for the fact that some amino acids can be specified by more than one codon).

Is it reasonable to assume that a single mutation has a probability of one half of causing the protein to misfold or fail to be beneficial to the organism? Kimura [83] estimates that a mutation which alters an amino acid is ten times more likely to be harmful than neutral or beneficial. From [Crow 97], "most mutations if they have effects large enough to be detected phenotypically are deleterious. ... the evidence is strong that the great majority of mutations are partially dominant, so that heterozygotes show some decrease in fitness." This appears to imply that these mutations are harmful because they add some deleterious function to a protein. Even the mutations contributing to the shape change are also likely to introduce deleterious additional functions to the protein. Thus one can expect most mutations that change an amino acid to be deleterious due to adding a harmful function to the protein, so it is reasonable to assign a probability of 1/2 to this possibility, even if these mutations do not change the fold of the protein. Also, we are assuming that all side chains on a protein are 1/2 constrained, which justifies this calculation.

A problem with the above analysis is that harmful mutations may be eliminated from the population. Mutations that are harmful and partially dominant will be eliminated and will not appear in the final protein. Therefore p must be taken as the probability that the final protein will fail to fold properly, or will fail to be beneficial to the organism. How can such a probability be estimated?

In fact, a few random mutations suffice to cause a protein to fail to fold [Blanco et al 99]. Once the protein becomes misfolded, additional mutations will not have much of an effect. Therefore it is reasonable to assume that the chance that a mutation is retained does not depend on the nature of the mutation.

A problem with model A is that it does not recognize that 10 disruptive mutations can occur without destroying the fold. Allowing for this in model B increases the probabilities by a factor of about 4 * 10¹⁸. This balances the decrease by a factor of 2 * 10^-11 due to the 34 mutations failing to achieve the desired fold, so that the probabilities are still too small.

To be more accurate, it is necessary to consider (model C) that 1/4 of all mutations will in themselves cause a protein to misfold; this includes frameshifts and creations of stop codons, as well as other mutations. This reduces the probabilities considerably more. Call a mutation catastrophic if it in itself causes the protein to misfold. Since 1/4 of all mutations are catastrophic, it follows that for every three substitutions, at least one catastrophic mutation is expected. If p = 1/30, then about 33 substitutions will be expected for a gene having 1000 base pairs. For each substitution, there will be a probability of 1/3 that it will be accompanied by a catastrophic mutation. The probability that none of these catastrophic mutations will occur is then (1 - 1/3)³³ or about 1.5 *10^-6. For p = 1/10, the probability is (1 -1/3)¹⁰⁰ or about 2.5 * 10^-18. For p = 1/5, the probability is about 6 * 10 ^-36.

Not only change of protein shape, but the development of new active sites is problematical for protein evolution. Protein interactions are highly specific, meaning that a given protein will interact with only a few others. This means that these interactions have to be specified by a considerable amount of information on the protein surface. For two proteins to interact, typically 10 to 15 amino acids on each protein must come into close proximity, and their properties must match (hydrophobic with hydrophobic, polar with polar, positive with negative) to a large degree. Of course, this is in addition to the fact that these amino acids, which are typically largely buried, influence the stability of their proteins, and cannot have too many exposed hydrophobic side chains. Thus both the hydrophobicity and shape of the side chain is significant.

The forces governing interactions between proteins are the same as those governing protein stability. However, the free energy decrease when two proteins join together cannot be as large as the folding energy decrease, or else the proteins could not separate. Also, typical replacements of amino acids destabilize proteins by 2 - 5 kcal/mole, so one or two such side chain replacements should also prevent the joining of two proteins. Interior sites of a protein are typically 3/20 constrained. Since side chains at an active site are in an interior environment when the two proteins join, they should be at least as constrained as interior residues. However, they also must function when the proteins are not joined, so they are even more constrained. Since side chains at active sites are often conserved, it must be that in many cases, only one side chain will do. Therefore it seems reasonable to assume that at a position in an active site of a protein, on the average at most two side chains can be permitted.

Active sites can only form when the geometry of the two interacting proteins or substances is closely matched. For example, one protein must be convex and another concave. If one assumes that each protein has five suitable places, then a pair of proteins would have 25 possible places to interact at active sites. If one assumes that in an early organism, when proteins would have evolved, there were at most 1000 proteins, then there would have been at most 25,000 ways that a protein could form an active site.

Suppose one assumes that a random amino acid has a 1/10 probability of having the right properties for an active site of a protein, and suppose that on the average 15 amino acids would need replacement, including those on both proteins. Also, since the side chains are so highly constrained, one can assume that two mutations are needed per side chain to get the proper amino acid, and if both mutations occur, the probability that the amino acid will have the necessary properties is 1/10. Then if the probability is p that a mutation occurs at a site in some number of generations and remains in the population a long time, the probability that the active site can form is p³⁰10^-15. Multiplying by about 10⁵ pairs of surfaces to interact yields a larger figure. If p > 1/10, the protein will be too randomized to fold properly. If p < 1/10, then the probability the active site can form is at most 10^-45, which is too small. If one assumes that mutations causing misfolding will be eliminated from the population, then only a small fraction of the possible sets of mutations can remain in the population, and therefore the probability of obtaining the active site is also much smaller. This does not even consider the harmful effects of the other mutations that would have to occur for such a large p. Also, many protein reactions require at least three proteins to interact, meaning even more constraints. Therefore it does not appear reasonable to assume that new active sites can form by substitutions.

Another possible mechanism for obtaining new protein shapes is that material is transposed from one gene into another by "transposable elements," producing a considerable stretch of genetic material at once and possibly more quickly producing a new protein. This possibility should also be considered.

In an article in the Sacramento Bee from March 19, 2001, "Much DNA just "junk" -- or is it? Human Genome Project spurs new look at mystery material," it is stated

For example, the body seems to have crafted 50 genes out of junk sequences known as transposons, so named because they are transposable, moving around the genome like text copied and duplicated in a computer file.

implying that this mechanism cannot explain most of the genes. Also, from a priori probability considerations, there is no reason why material transposed into a gene would be more likely to lead to a useful protein than random mutations. In addition, material inserted by a transposable element will consist of adjacent sites in the gene, meaning that contiguous sites will all be inserted at the same time. This results in a greater change than having isolated point mutations, and therefore would tend to decrease the probability of the evolution of new proteins. Finally, transposable elements newly appearing in genes are likely to render the genes useless. Thus such new appearances of transposable elements will likely be harmful, and therefore must be very rare, further reducing their probability. We will assume that this happens in a gene only about once in 10⁵ generations on the average.

In order to compute the probability for transposable elements in model A, B, or C, we assume that on the average, each time one of the specified 34 sites is introduced by a transposable element, 10 other base pairs are introduced as well. Then when the 34 sites have all been introduced, 340 other sites will have been introduced as well. Each such new base pair has a probability of one-half of prohibiting the specified fold, for a total probability of at most (1/2)³⁴⁰ that this fold can form. This probability is at most 1/(2.3 * 10^102), which is impractically small. It is actually not necessary to assume that 340 other sites have been introduced. Letting p be the probability that a transposable element will replace a particular amino acid by one in a different class, the computation is as before, with the added problem that insertions of any amino acid are likely to disrupt a protein's structure.

Is it reasonable to assume that the genetic material inserted by a transposon (or retrotransposon) is random, as we have done? There are a number of cases to consider. Transposons often create "direct repeat" sequences on their ends and may have one or more genes in the middle. If the transposon operates by "cut and paste," then it will eventually leave the place it entered and leave behind only the direct repeats. These patterns are too regular to generate new protein structures. If the transposon operates by a "copy and paste" mode, then the main body of the transposon will be left behind. If this part contains no genes, then it will tend to be randomized by point mutations over time, justifying our assumption of randomness. If the body of the transposon consists of simple repetitive sequences, then it does not have enough variety to generate new protein shapes. If the body of the transposon contains genes, then these genes will contain sequences that will probably cause the original gene to lose its functionality. If the transposon causes a frame shift in either its own DNA or the DNA beyond itself, this will tend to randomize the gene. If the (retro)transposon contains a pseudogene, it is a LINE, and these seem to have a mechanism for avoiding insertion into functional genes. Genes can also be inserted by viruses or passed from one bacterium to another. In this case the inserted gene would be functional and it does not seem possible to have two functional overlapping genes, so the original gene would lose its function. All in all, it does not seem that there is any way for transposons or retrotransposons to contribute non-random DNA sequences to the evolution of proteins except for simple sequences that could not explain the evolution of proteins. Therefore the assumption that material inserted by transposable elements is random, appears to be correct.

Even if transposable elements or other mutations did insert DNA from one gene into another, it would not help. Even proteins with similar shapes may differ in as many as 70% of their amino acids at corresponding positions in the protein. One would expect, therefore, that proteins of different shapes would differ in considerably more than 70% of their amino acids at different positions in the protein. This shows that the amino acid sequences of proteins having different shapes are significantly different throughout their whole extent. This means that there is little to be gained in protein evolution by concatenating portions of existing protein sequences to generate proteins having new shapes. Therefore genetic material from another gene, inserted in the middle of a gene, would behave essentially the same as random DNA, and the previous analysis would apply to it.

Proteins are often composed of "domains" that fold independently, and the same or similar domains can occur in different proteins. Folds are identified with the three-dimensional structures of domains, so that a protein can have more than one fold. Two proteins sharing a domain might share a subsequence that is largely similar. The question then becomes how domains with new shapes could evolve. However, even similar domains in different proteins are likely to have different parts buried and exposed, so their amino acid sequences are likely to be significantly different.

One might say that existing protein families do not provide intermediates between one protein shape and another, but perhaps other protein folds existed in the past that served as intermediates, so that the mutations required to pass between one fold and another were less. Or perhaps different protein folds existing in the past shared long common subsequences of amino acids, permitting new protein folds to be created by concatenating together long subsequences of existing folds. However, above quoted references imply that there are only a small number of protein folds possible in principle, because of the laws of physics. Therefore many additional folds not only did not exist, they could not exist in principle. If many such intermediates did exist, then one wonders why they no longer do. The great majority of all functional proteins that ever existed must have been such intermediates, if they ever existed, and one would expect that many of them would still be found in existing organisms. If indeed such intermediates can be constructed, then this is evidence that life did not evolve, but was designed, because if life evolved, these intermediates should still be found in living organisms.

Not only this, but in order for a new fold to form, it has to be beneficial to the organism, or at least not harmful. Thus the new protein must interact in a useful way with other existing proteins in the organism. This constrains the possible protein folds even more. Furthermore, two proteins A and B of different shapes are not likely to share long common subsequences S of amino acids. The reason for this is that predicting the structure of a protein from its amino acid sequence is a very hard problem. This in turn implies that the shape in which the subsequence S folds is determined not only by S but by the other amino acids in the protein. Therefore the sequence S is likely to fold into a different shape in proteins A and B, and it is also likely to have different parts buried and exposed in proteins A and B. If protein A is stable, this means that the pattern of hydrophobic and non-hydrophobic side chains in the sequence S is suited for the protein shape A. Since the shape of B (and of S in B) is so much different, and different parts of S are buried in B than in A, the pattern of side chains in S will probably be unsuited to shape B, and B will be unstable. This further justifies the assumption that material inserted into a gene behaves like random DNA for purposes of evolving new protein shapes.

If one assumes that inserted sequences of DNA have random properties, the evolution of new protein shapes is highly improbable, as already shown. But there are additional difficulties.

A transfer of genetic material could take two forms: Writing a gene or protein as a concatenation of sequences, so AA' indicates a protein with sequence A of amino acids followed by sequence A', a protein of the form AA' might receive a portion D from a protein CDC', resulting in a new protein ADA'. Alternatively, an existing protein of the form ABA' might have B replaced by D from a protein of the form CDC', producing a protein ADA'. In fact, just such a mechanism is proposed in [Grishin 01], in which sheets can replace coils, or vice versa, to gradually modify protein structure. A third possibility is that material is simply deleted, transforming ABA' to AA'. However, deletions cannot explain how large proteins could evolve from small ones. There are problems with all of these possibilities.

First, A or A' or D may fold differently in the protein ADA' than they did originally, or the new protein may not fold properly at all. This is especially possible below the domain level. As an example, in the replacement of one secondary structure element by another reported in [Chen et al 96], involving the replacement of a string of 7 amino acids by a string of 13, it was necessary also to replace four other amino acids in order for the protein to fold properly. Second, the geometry of D may not fit the geometry of A and A' in the protein ADA'. Third, the location of the ends of D has to be very close to the exposed ends of A and A'. Fourth, D may have different portions of its surface buried and exposed in ADA' than in CDC'. Since buried portions are hydrophobic and exposed portions are polar, this is a problem. Fifth, A and A' may have different portions buried and exposed in ADA' than they did in AA' or ABA'. Polar or charged side chains cannot be buried without destabilizing a protein. When such side chains are on the surface of a protein, they form hydrogen bonds with water. When these side chains are buried, they are no longer exposed to water. If they cannot form hydrogen or ionic bonds, the protein will be destabilized. Such bonds require a close proximity of polar with polar side chains and positive or partial positive charges with negative or partial negative charges. This requires a close matching of the properties of the buried surfaces, which is unlikely to occur. Furthermore, even if the surfaces are matched, the protein is unlikely to fold with so many buried polar side chains; such side chains tend to occur at or near the surface. Also, as noted before, too many hydrophobic side chains cannot be exposed or the proteins will stick together. Sixth, even if A, D, and A' have the proper portions buried and exposed, the hydrophobic side chains on D may not mesh properly with those on A and A. Recall that even replacing one hydrophobic side chain by another tends to destabilize a protein. Seventh, if ADA' has the same shape as AA' (or ABA') except for the portion added, and if the active sites on ADA' are not affected by D, then ADA' will have the same function as AA'. Thus this insertion or replacement does not benefit the organism, and is likely to be removed from the population. If the active sites of AA' or ABA' are affected by D, then the protein ADA' is likely to have its function impaired. Eighth, there are a large number of such insertions or replacements and not many ways a given protein ADA' can be produced. It is possible that a few of the proteins produced by such an insertion or replacement could satisfy all of these constraints. However, the probability that any of them will benefit the organism is very small. Substitutions are thought to be beneficial at most one time in 1000, and probably much less. Larger changes in a highly organized system such as life are even less likely to be beneficial. The probability that some such new protein will be beneficial is therefore substantially less than one. If proteins evolved, then many shape changes had to take place, so the probability must be very nearly one.

The population size is not even a factor here; the factors constraining this process depend only on the shapes of the proteins and the locations of their active sites. These will be the same in all members of a species, so if the small number of possible new shapes does not contain one that benefits an organism, this will be true no matter how large the population is.

In addition, D will have evolved to be adapted to the protein in which it already exists. Moving it to a new protein will place it in a role for which it is not adapted, which will almost certainly result in harm to the organism. Finally, there are many constraints on protein folding in addition to the requirements that hydrophobic side chains be in the interior and non-hydrophobic side chains be on the exterior. The chance that D, moving from one protein to another, will satisfy these additional constraints, is very small.

As evidence of this, most mutations that change the amino acid are harmful. This implies that even exchanging one hydrophobic side chain on the inside of a protein for another, is likely to be harmful. Also, even the same domain (folding the same way) in two different organisms or proteins is likely to have many amino acids different, sometimes almost all of them, including the hydrophobic side chains on the inside. Therefore, taking a domain from one protein and putting it in another is going to result in hydrophobic side chains at the interface that do not mesh with each other. The chance that the resulting protein will be stable and fold correctly is very small.

Another problem is that the kind of mutations that can result in the moving of a domain from one protein to another, are either very improbable or almost certainly harmful to the organism. The moving of a domain from one protein to another also cannot explain how domains could join together in the first place. It is reasonable to assume that domains were originally separate and covered with polar side chains. They could not have joined without burying many of these side chains. This kind of transfer between proteins also cannot explain how a domain could ever become part of a protein with different portions buried and exposed than it had originally.

The number of possible insertions of DNA from one gene to another is quite large. Suppose an organism has at least 200 genes, and each gene has about 1000 base pairs of coding DNA. Thus there would be at least 200,000 base pairs in all. Assuming an insertion of 10,000 or less base pairs, the portion inserted could begin at about 200,000 sites and have 10,000 lengths for about 2*10⁹ such subsequences altogether, and there are not likely to be many repetitions among them except for repetitive segments of DNA that do not have enough diversity to generate many new protein shapes. Thus, if a protein can be constructed by such insertions, it cannot be done in many different ways, so the number of subsequences limits the probability of this happening. Each such subsequence could be inserted in about 1000 sites in a gene. The probability of getting the right site is thus about 10^-3. The chance of getting the right sequence in the right site is then about 10^-12. Suppose the right subsequences exist somewhere in the genome to generate a new protein shape (which is highly unlikely) and five such insertions are needed to get a new stable beneficial protein shape. The chance of this is about 10^-60, much too small, even ignoring many other factors such as the fact that most neutral mutations are lost from a population soon, the need for mutations to the active sites on the protein, and the fact that such insertions would be rare.

The number of possible replacements is even larger; each involves a deletion followed by an insertion. In a gene of 1000 base pairs, there are about 10⁶ deletions possible, and even more if one also considers longer deletions, and as above, about 10¹² insertions, for 10¹⁹ or more such replacements altogether. The chance of getting the right one (assuming this cannot happen in many ways) is then about 10^-19. The probability of getting two such replacements right is about 10^-38, which is near the limit of 10^-40. When one also considers that such exchanges of material are very rare, and that the final protein has a small chance to be beneficial, the probabilities are too small. Therefore, only one such transfer of genetic material is feasible, so the total number of new folds obtainable this way and satisfying all constraints will be small.

Transfer of genes from one organism to a different one is not likely to help in the generation of new protein shapes, because such a transferred protein would have to be beneficial in both organisms, adding considerably more constraints. This could happen if the protein only involved biological mechanisms that were common among many organisms, but then the probabilities are about the same as if all these organisms belonged to the same population.

Unless the evolution of proteins of new shapes is possible, evolution is blocked. All scenarios for protein evolution have been shown to be mathematically impossible, under reasonable assumptions.

References

Baker, D. and Sali, A., Protein Structure Prediction and Structural Genomics (Science, Vol. 294, 5 October 2001), pp. 93-96.
Berndt, K., Protein Secondary Structure, in: Principles of Protein Structure using the Internet, Birkbeck College, University of London, 1996.
Blanco FJ, Angrand I, Serrano L, Exploring the conformational properties of the sequence space between two proteins with different folds: An experimental study, JOURNAL OF MOLECULAR BIOLOGY 285 (2): 741-753 JAN 15 1999.
BORDO D, ARGOS P, SUGGESTIONS FOR SAFE RESIDUE SUBSTITUTIONS IN SITE-DIRECTED MUTAGENESIS, JOURNAL OF MOLECULAR BIOLOGY 217 (4): 721-729, FEB 20 1991.
Chen R, Greer A, and Dean AM, Redesigning secondary structure to invert coenzyme specificity in isopropylmalate dehydrogenase, Proc Natl Acad Sci U S A 1996 Oct 29;93(22):12171-6.
Chothia, C. One thousand families for the molecular biologist, Nature 357, 543-544 (1992)
Cordes MH, Burton RE, Walsh NP, McKnight CJ, Sauer RT, An evolutionary bridge to a new protein fold, Nat Struct Biol 2000 Dec;7(12):1129-1132.
Cordes, M., Walsh, N., McKnight, C.J, and Sauer, R., Evolution of a protein fold in vitro, Science 1999 Apr 9;284(5412):325-328.
Crow, J., The high spontaneous mutation rate: Is it a health risk?, PNAS Vol. 94, pp. 8380-8386, August 1997.
deGrado, Proteins from Scratch, Science 278:3 (3 October 1997) 80-81.
Denton, M. and Marshall, C., Laws of Form Revisited, posted April 4, 2001 on the Creation Science Resource Bulletin Board.
Funahashi, J., Takano, K., Yamagata, Y. and Yutani, K., Contribution of amino acid substitutions at two different interior positions to the conformational stability of human lysozyme, Protein Engineering 12:10 (1999) 841-850.
TERRY M. GRAY, ERIC J. ARNOYS, STEPHEN BLANKESPOOR, TIM BORN, REBEKAH JAGAR, REBECCA EVERMAN, DARLA PLOWMAN, ANGELA STAIR, and DAISY ZHANG, Destabilizing effect of proline substitutions in two helical regions of T4 lysozyme: Leucine 66 to proline and leucine 91 to proline, Protein Science (1996), 5: 742- 751.
Grishin, N.V., Fold change in evolution of protein structure, Journal of Structural Biology 134:2-3 (2001) 167-185.
Himmelreich, R., H. Plagens, H. Hilbert, B. Reiner, and R. Herrmann, "Comparative analysis of the genomes of the Bacteria Mycoplasma pneumoniae and Mycoplasma genitalium," Nucleic Acids Research 25 (1997): 701-712.
Kajander T, Kahn PC, Passila SH, Cohen DC, Lehtio L, Adolfsen W, Warwicker J, Schell U, Goldman A, Buried charged surface in proteins, STRUCTURE 8:11 (2000) 1203-1214.
Kimura, The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge, 1983, pp. 199,210,212,296,321.
Koshi, JM and Goldstein, RA, Mutation matrices and physical-chemical properties: Correlations and implications, Proteins - Structure, Function, and Genetics 27:3 (1997) 336-344.
Krowarsch, D. and Otlewski, J., Amino-acid substitutions at the fully exposed P₁ site of bovine pancreatic trypsin inhibitor affect its stability, Protein Science 10 (2001) 715-724.
Lindgard, P. and Bohr, H. How many protein fold classes are to be found? in Protein Folds (eds Bohr, H. & Brunak, S.) 98-102 (CRC Press, New York, 1996).
Loladze, V., Ermolenko, D., and Makhatadze, G., Heat capacity changes upon burial of polar and nonpolar groups in proteins, Protein Science 10:7 (2001) 1343-1352.
MATSUMURA M, BECKTEL WJ, MATTHEWS BW, HYDROPHOBIC STABILIZATION IN T4 LYSOZYME DETERMINED DIRECTLY BY MULTIPLE SUBSTITUTIONS OF ILE-3, NATURE 334:6181(1988)406-410.
BW Matthews, Structural and genetic analysis of the folding and Function of T4 lysozyme, The FASEB Journal, Vol 10(1996), 35-41.
Matthews, B.W., GENETIC AND STRUCTURAL-ANALYSIS OF THE PROTEIN STABILITY PROBLEM, Biochemistry 26 (1987) 6885-6888.
Qian J, Stenger B, Wilson CA, Lin J, Jansen R, Teichmann SA, Park J, Krebs WG, Yu HY, Alexandrov V, Echols N, Gerstein M, PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information, NUCLEIC ACIDS RESEARCH 29 (8): 1750-1764, APR 15 2001.
Rath, A. and Davidson, A.R., The design of a hyperstable mutant of the Abp1p SH3 domain by sequence alignment analysis, Protein Science (2000), 9: 2457-2469, Cambridge University Press.
Russell RB, Saqi MAS, Sayle RA, Bates PA, Sternberg MJE, Recognition of analogous and homologous protein folds: Analysis of sequence and structure conservation, JOURNAL OF MOLECULAR BIOLOGY 269 (3): 423-439 JUN 13 1997.
Service, R., Amino Acid Alchemy Transmutes Sheets to Coils, Science Vol 277 11 July 1997 p. 179.
Srinivasan, R. and Rose, G., LINUS: A hierarchic procedure to predict the fold of a protein. Proteins: Structure, Function, and Genetics, 22: 81-99 (1995).
Dapeng Sun, Limei H. Jones, F.Scott Mathews and Victor L. Davidson, Active-site residues are critical for the folding and stability of methylamine dehydrogenase, Protein Engineering, Vol. 14, No. 9, 675-681, September 2001.
Kazufumi Takano, Yuriko Yamagata, and Katsuhide Yutani, Role of amino acid residues in left-handed helical conformation for the conformational stability of a protein, Proteins: Structure, Function, and Genetics 45:3(2001)274-280.
Darin M. Taverna and Richard A. Goldstein, Why are proteins marginally stable?, Proteins: Structure, Function, and Genetics 46:1 (2002) 105-109.
Tsai, J., Gerstein, M., and Levitt, M., Simulating the minimum core for hydrophobic collapse in globular proteins, Protein Science 6:12 (1997) 2606-2616.
Vlassi M, Cesareni G, Kokkinidis M, A correlation between the loss of hydrophobic core packing interactions and protein stability, JOURNAL OF MOLECULAR BIOLOGY 285 (2): 817-827 JAN 15 1999.
Whitman, W., Coleman, D., and Wiebe, W., Prokaryotes: The unseen majority, PNAS 95:12, June 9, 1998, pp. 6578-6583.
J.D. Wright and C. Lim, A fast method for predicting amino acid mutations that lead to unfolding, Protein Engineering, Vol. 14, No. 7, 479-486, July 2001.

Back to home page.