Rafiki Home Rafiki Store Learn about Code World Explore geometry

<< Back

Symmetry’s role in expanding the paradigm

The classic paradigm of the genetic code embodied in the codon table is that of a simplified computer function. The code is arranged as a look-up table, and therefore it operates as a function that accepts three integer arguments and returns an integer result.

The classic paradigm:

AA = GeneticCode (B1, B2, B3)

This says that the function of the genetic code is to return an amino acid identity given any sequence of three nucleotides. B1, B2, and B3 are three variables, each representing any of the four possible nucleotide bases found in messenger RNA (A, C, G, U), and AA is a variable to store one of twenty amino acids in the standard set, plus a null value representing termination codons. So the function of the genetic code under this paradigm is one that translates sixty-four possible inputs (43) into twenty outputs, and it must therefore impose a compression algorithm.

This simple function is the essence of the genetic code as we currently perceive it, regardless of how we might structure the data. This paradigm has taken us a long way in investigations of biological information, but the classic paradigm has now clearly failed. There are many phenomena demanding explanation, but no plausible explanations exist within the simple, classic perspective; plus several harmful epistemic consequences come from adopting it so completely. To gain better understanding moving forward we must examine every detail and question every assumption that led us to this classic viewpoint in the first place. It is also useful to consider an alternative, expanded paradigm, one that views the genetic code as a function accepting an entire nucleotide sequence (NS) for input and returning a protein structure as output.

The expanded paradigm:

Protein = GeneticCode (NS)

This has been the mantra throughout these pages: The genetic code builds proteins, not sequences of amino acids.

The classic paradigm was historically touted as the equivalent of this expanded paradigm. In other words, the classic paradigm when repeated on every codon in a nucleotide sequence would produce the function of the expanded paradigm. However, in order for this to work an axiom must be added - the axiom that primary sequence (PS) determines protein structure must be added to the classic paradigm: And so it was.

Sequential iteration of the classic paradigm:

NS = Codons
PS = GeneticCode (NS)
Protein = PS
Protein = GeneticCode (NS)

In the words of Anfinsen, It would be a ‘considerable simplification’ if this statement of equivalence was true. Unfortunately, it has proven false in several ways. The most conclusive disproof is the experimental evidence that ‘silent mutations’ can lead to different proteins in vivo. A NS can change but still leave PS the same, and when it does, the genetic code can actually somehow produce a different protein. It is difficult to maintain the classic paradigm in light of this evidence, because the output of the genetic code should now be seen as a discrete protein structure independent of its PS.

The classic paradigm has failed because it is based entirely on codon-amino acid correlation data from the familiar spreadsheet, and this spreadsheet is missing a large amount of critical information toward making proteins. The biological information in a protein is not equivalent to that in a primary sequence, despite longstanding dogma. There is far more information in a protein than in a sequence of amino acids, and the genetic code has some mechanism for translating at least a part of that additional information from NS. The simplest reason why the spreadsheet is misleading in this regard is that codons, in reality, do not select amino acids; rather they select tRNA molecules with the corresponding amino acids attached.

There are far more tRNA than amino acids, and for that matter there are two and one half times more anticodons than codons. We all know what the codon table looks like, but what about tRNA and the anticodon table?

Furthermore, there is more than one way for each amino acid to be oriented within a sequence. All of these quantities go into the biological information system that comprises the function of the genetic code.

Information theory is not a function of subjective value judgments; it is strictly a process of counting finite states of a system. Although there are sixty-four valid codons, there are also 160 valid anticodons. This significant step toward information expansion during translation is due to a relaxation of the Watson-Crick base pairing rules at the B3 variable standing for the third nucleotide of the codon sequence. Codons map to tRNA and tRNA map to amino acids. Evidence strongly suggests that a change in codons leads to a change in tRNA, and a subsequent change in the final protein, despite the fact that there is no change in the corresponding amino acid. Therefore, tRNA are a legitimate part of the genetic code. The code functions primarily as a whole protein generator, not as an isolated amino acid sequencer. There are vastly more possible protein conformations than there are possible primary sequences.

Classic paradigm where fGC(x) stands for ‘function of the genetic code’.

If the genetic code is to be the function responsible for protein synthesis it must have a method of selecting one protein from many possible, not just consecutive amino acids from a limited group of amino acids.

Expanded paradigm.

There is a fundamental choice to be made that extends well beyond semantics. Is the genetic code charged with sequencing amino acids or is it charged with making whole proteins from whole sequences of nucleotides? Regardless of the philosophical position taken, we can at least now clearly recognize that there is vital information involved in the process of making a protein that is not addressed by the classic paradigm, and the code must somehow go beyond specifying a sequence and into specifying a protein.

Our thinking on this matter has been the victim of reductionistic absurdity. Mountains of evidence that disprove the classic paradigm are blithely ignored, while apologists for the reductionistic failure are many. Common sense says that nature has adapted the genetic code to perform a function of protein synthesis, not amino acid sequencing. There is a real difference, and the function of the code is more accurately viewed under the expanded paradigm.

There are clues in the pattern of codon assignments to affirm this common sense. To help appreciate this evidence, let us consider a hypothetical sequence of 101 nucleotides. Under the classic paradigm this sequence contains enough information to form 33 consecutive codons, but this limited view is inaccurate. The sequence is read in triplets, so there are actually 99 codons present in three sequences shifted by a single nucleotide relative to each other. The genetic code must assign structure to all three simultaneously. It has adapted for this, and it is optimized precisely for this task.

We will refer to the reference frame as F1 and use the middle base of a codon, B2, as the reference base. Relative to F1 there is a frame shifted forward so that B2 becomes B1. This will be F2. There is a third relative frame were B2 shifts backward into B3, and we will label this F3.

There are no absolute distinctions in nature in an entire sequence of nucleotides between any group of three nucleotides and a codon; this is a relative distinction made on the basis of a reference reading frame at the time of translation. A sequence of 101 nucleotides has the information content of three proteins - or more accurately three protein segments - containing 33 amino acids apiece. In most cases there is a different amino acid assigned at every position in each of the three segments. The expanded paradigm illustrates a more accurate notion that every nucleotide in a sequence is simultaneously assigned to three different amino acids, not just one. I am going to reiterate as a point of emphasis:

Every nucleotide in a sequence is part of three codons simultaneously.

The situation is actually more complex than this, because a single nucleotide sequence is stored in a double helix, so it actually represents two nucleotide sequences - the sense strand and the anti-sense strand. The anti-sense strand is the complement of each nucleotide in the sense strand. There is increasing evidence that anti-sense reading frames of genes show up more than expected – greater than random - in reading frames on the sense strand. This actually is logical from a symmetry perspective, and it means that each nucleotide sequence really represents six potential frames: F1, F2, F3, -F1, -F2, -F3. To make it even more interesting, each sequence can be inverted, and any one of these transformations can be compounded with any other. However, for the purposes of discussion here we will focus only on the three frames of the sense strand. The arguments that apply to the sense strand can then be expanded to include the complement frames on the anti-sense strand as well.

The genetic code must assign an amino acid to every codon, so it must assign each nucleotide in every sequence three times simultaneously.

The superficial task of the genetic code was to distribute twenty amino acids across four nucleotides. However, The actual amino acids assigned to any single nucleotide are context dependent on the four nucleotides that surround each nucleotide - two on each side. The genetic code should be seen as a network of assignments involving all possible contexts of individual nucleotides.

In a random, or near random sequence of nucleotides there is seemingly no way to account for this amount of context, but the genetic code has managed to do so. It uses the middle base, B2, as the primary axis of assignment, and the first base, B1, is the second most influential.

The classic data is so familiar that it may be difficult to see in a new light, but ask yourself whether this is a case of assigning one amino acid to every codon, or if it is better described as assigning several amino acids to every nucleotide. This looks to me like a pattern made to cover all contexts for each nucleotide. Each of the four nucleotides stand for some general amino acid characteristics. Adenine means hydrophilic and uracil means hydrophobic. Cytosine means tighten the backbone, while Guanine will generally loosen it.

The amazing thing about the code is that the general characteristics of protein seqments coming out of F1, F2, and F3 share far more than random elements. This is only possible through assignment symmetry. The genetic code is perfectly assigned to achieve this feat. Coincidence?

One amino acid can be assigned to every codon in an infinite number of ways, but several amino acids assigned to an individual nucleotide in 592 possible contexts is tricky as hell, and the code did a remarkably good job. It is optimized for this task, but the classic paradigm makes an optimized view of the assignment pattern difficult to see.

We are rightly impressed by the knowledge that the genetic code more-or-less achieved the goal of assigning one amino acid to every codon, which then deceives us into assuming that it is the ultimate function of the code. Far more impressive is the fact that the genetic code almost achieved the goal of assigning three – and only three – amino acids to every nucleotide in every sequence context. This is more impressive after we realize that it is numerically impossible to do this, but the code came about as close as is possible. Therefore, we might re-examine some of the basic perceptions about parameters of the genetic code.

1. There are four bases in the genetic code – False. There are five bases in the genetic code because wobble introduces a new base and new pairing rules at B3. There are four bases upstream in mRNA, but there are five bases downstream in tRNA, so information actually expands downstream. The genetic code is an information system, and the fifth base is a vital component of that information.

2. The code operates in nucleotide triplets – True. The behavior of the code is compelling toward the notion that it is all about triplets, but we must take a good hard look at what a triplet is and what a triplet does. Triplets are especially informative in light of the cyclic permutations - codon types - that they generate. However, triplets alone cannot form peptide bonds; it takes two to tango, and they must be combined to do this.

3. Because of a functional imperative, the code cannot change – False. The code is able to change, was able to change, and will always be able to change. We must view the code not as a workable kluge but as the most highly adapted structure on the planet.

4. The code must contain exactly twenty amino acids – False. There is more than one valid anticodon for every valid codon. Therefore, no logical upper or lower boundary exists for the number of amino acids in the genetic code. The actual number might have been anything between one and infinity. The classic reasoning says that it started low and went up, but I think it more logically started high and went down. It is easier to include than restrict, so I think that precision selectivity of the code is a late evolutionary development. The explanation for twenty must be one of numerical optimization. There is a risk and reward associated with every possible number and combination of amino acids, so there must be something really special about the number and combination in the standard set.

This expanded notion is that every nucleotide sequence gives the genetic code three (or six, or twelve) bites at the apple, or chances to make a useful protein. It can theoretically produce multiple unique proteins, or protein segments from any nucleotide sequence, so it would logically try to somehow maximize this opportunity.

Working from this platform, let’s examine the evidence for how the genetic code went about its task of assigning three amino acids to every nucleotide in a sequence. In an ideal world, each nucleotide would be assigned three and only three amino acids for all instances of F1, F2 and F3. This is not mathematically possible, because although there is only one instance of F1, there are four instances of both F2 and F3 that must be assigned. A seemingly possible solution is to assign the same amino acid to all four instances of F2, and another amino acid to all four instances of F3. This solution would produce three amino acids consistently assigned across nine instances, but this too is not mathematically possible. The most viable alternative is to assign a single amino acid to the only instance in F1, a single amino acid to all four instances of either F2 or F3, and create a group of amino acids that somehow approximate each other, and assign them to the remaining four instances. This is the strategy that the code has taken.

The code operates on the input of mRNA nucleotide triplets as the fundamental unit of information. Because there are only four bases available to mRNA, there are only twenty triplets on which to operate. There are sixty-four ways to arrange these triplets, but we are going to focus on the triplets themselves. If the code were able to assign a single amino acid to all instances of a nucleotide in each frame, then we would only need the assignment data from the permutations of one triplet to determine the assignment of nine codons.

AA1 = B1, B2, B3
AA2 = B2, B3, B1
AA3 = B3, B1, B2

Because it is not mathematically possible to create a data structure in this way, this formula cannot be completely accurate for nine codons in any assignment schema where the number of amino acids is greater than one. It can be accurate for six of the nine codons if there are sixteen amino acids in the schema, but there must always be a level of approximation for some property of the amino acids in at least three of the codons. Again, it is numerically impossible to assign three and only three amino acids to all instances of a specific nucleotide for all sequences of nucleotides. The goal now is to assign as many as possible, and somehow approximate the rest - for all circumstances - as well as possible.

The assignment of a nucleotide in one frame cannot dictate the assignment of that nucleotide in another frame. The assignment in F1 is dictated by the random context of the two surrounding nucleotides. The same context of three nucleotides in F1 is only partial context in F2 and F3, and the rest of the context in those frames is randomly supplied by the sequence. Therefore, it would seem impossible to use the context of F1 to anticipate the random context of F2 and F3. The fact that the genetic code has managed to do this makes it worth a closer look. Given the assignment data of any permuted triplet we can say the following about the genetic code.

AA1 is always assigned to F1.

AA2 anticipates the properties of F2 for all possible random eventualities (B2, B3, A); (B2, B3, C); (B2, B3, G); and (B2, B3, U).

AA3 anticipates the properties of F3 for all possible random eventualities (A, B1, B2); (C, B1, B2); (G, B1, B2); and (U, B1, B2).

An example is given for the nucleotide triplet UAC. This is a tertiary triplet, so it can generate six distinct permutations, and each of these is assigned in the code to a different amino acid (an anti-Gamow pattern).

UAC = Tyrosine
ACU = Threonine
CUA = Leucine
UCA = Serine
CAU = Histidine
AUC = Isoleucine

F1, Reference: UAC = Tyrosine
F2, anticipated: ACU = Threonine
F3, anticipated: CUA = Leucine

F2, Possible: ACA = Threonine, ACC = Threonine, ACG = Threonine, ACU = Threonine
F3, Possible: AUA = Isoleucine, CUA = Leucine, GUA = Valine, UUA = Leucine

Another example from the same triplet:

F1, Reference: UCA = Serine
F2, anticipated: CAU = Histidine
F3, anticipated: AUC = Isoleucine

F2, Possible: CAA = Glutamine, CAC = Histidine, CAG = Glutamine, CAU = Histidine
F3, Possible: AUC = Isoleucine, CUC = Leucine, GUC = Valine, UUA = Phenylalanine

What this demonstrates is that small portions of data contain patterns informative to the global pattern of the entire data set. It is somewhat like a holographic data set. It is an ingenious way for nature to have structured the data so that all three frames are logically related. Although the reference amino acid, AA1, is seemingly unrelated to either AA2 or AA3, the possible assignments of both anticipated sets are closely related, if not identical. The net effect is that triplets are assigned in such a way that merely those assignments anticipate the effect that random nucleotide context will have on information output by the genetic code. This is virtually impossible by chance alone.

Although the outcome of a shift on a single codon in a random sequence should be random, the outcome of the entire nucleotide sequence is pre-determined. In contrast to a point mutation, all of the information regarding any possible frameshift is already contained by the sequence before the shift. By permuting the assignments of any given triplet, one can anticipate some property of the amino acid already assigned to an unknown outcome, despite the fact that eight possible random outcomes must be accounted for. More importantly, this function must hold to some degree across sixty-four distinct codons, and the simple formula is in fact remarkably consistent throughout the entire code.

Even partial effectiveness of this formula is no simple trick to pull off for any data set. It is like trying to fill out a complex three dimensional magic square, in the dark, with virtually no starting knowledge or constraints. In any case, the ability of a codon assignment pattern to achieve any level of frameshift anticipation, let alone a high level must be the mother of all combinatorial optimizations, and it is stunning when the components required to get to that level are properly understood.

To help visualize the concept we must appreciate the task presented to nature by the sets of numbers involved. Each F1 codon permutation has four possible outcomes in F2, and four more possible outcomes in F3. But each triplet class and codon type has a different fingerprint, or statistical scatter pattern in the three frames, as illustrated in these Rafiki maps.

Primary, secondary and tertiary triplets all have different fingerprints when shifting into either F2 or F3. Homogeneous and heterogeneous multiplets make good reference units toward understanding shift distribution patterns across the entire code. The above diagrams illustrate some of the numeric realities in the structure of the code, but the observed shift patterns are confusing to describe, so they are much easier to spot in the diagrams.

F1 codons will always map into the four codons of a single F2 multiplet. If the original codon is part of a homogeneous multiplet then it will stay in the original major pole. If the original codon is part of a heterogeneous multiplet then it will shift into the adjacent pole. F1 codons will all map into a codon from four different F3 multiplets, but all four codons from an F1 multiplet will map into the same four codons of four different F3 multiplets. The scatter of a single F1 codon is tight into F2, but spreads into F3. Conversely, the scatter from an F1 multiplet is spread into F2, but is unitary into F3. These are the features that the genetic code took advantage of in putting reliability into the simple anticipatory formula. More importantly, these are the universal numeric patterns that primarily selected for the patterns in codon assignments seen in the genetic code.

Note that these diagrams graphically demonstrate that two completely different codon assignment strategies are required in order for the anticipatory formula to work at all. First, the assignments must obey multiplet boundaries to anticipate F2. Second, all of the multiplets must be coordinated in such a way to send four possible codons to four different multiplets in F3. Further note that the reliability of the formula will be degraded by every amino acid that is added to the set, and the degradation should accelerate rapidly above twenty. This is because there are only twenty anticipatory data sets available to the formula, corresponding to the twenty distinct triplets allowed within the structure of the code. From this purely numeric perspective, twenty is an optimized value within the genetic code, and the arrangement of this specific set of twenty is optimized for the mutual anticipation of the properties of three different frames of reference. This might be the basis of the super symmetry in codon assignments that has been identified by several mathematicians and nuclear physicists.

For the code to be considered as an optimization we must consider all amino acid parameters, under all circumstances, and when applied to all possible nucleotide sequences. It is not realistic to illustrate such a broad concept in a single diagram, but we can select one important parameter to move the illustration forward. Water affinity of each amino acid plays a critical role in determining the final conformation of every protein. Therefore, it is logical that the genetic code should take this into account in forming an optimal codon assignment pattern. To help visualize how the code actually did this I will use the color wheel of relative hydropathies. A plot of the actual codon assignment data using these colors shows the unmistakable fingerprint of F2 in the genetic code.

This pattern is not surprising, because it is recognizable in virtually any treatment of the genetic code. However, it means that AA2 will always reliably anticipate F2. This oft recognized multiplet pattern is the consequence of the B3 symmetry in multiplet assignments noted for decades. The pattern was attributed primarily to wobble, and also proposed as a buffer to point mutations. But these are probably not the forces that drove it initially. If wobble were to be the optimizing force, we should see the fruits of wobble, which would show up as an optimum number of reduced tRNA. No organism demonstrates anything near this number, and it is very likely that organisms actually have more than sixty-four distinct tRNA molecules swimming in the soup of their cells. Wobble helps keep tRNA populations down, but we don’t see anything approaching thirty-one tRNA, which is the optimized minimum. Point mutations cannot be the driving force either, because they create no real pattern at all. It is unlikely that this relatively trivial and virtually non-existent pattern drove the pattern of codon assignments.

Remember, the multiplet assignments are merely step one of a two-step process. Point mutations in the third position are in fact covered, but this alone probably isn’t significant enough to drive that pattern. The second step is to interweave these multiplet assignments so that they anticipate F3. This is much tougher to do and much tougher to visualize, but the Rafiki map is unmatched for this purpose. By applying the anticipation formula to all twenty triplets in the Rafiki map we achieve an F3 color rotation on the map in the following way.

This is simply a mathematical manipulation of the data as directed by the F3 anticipation formula described above. I have applied this formula to all 64 codons and used the same color scale for water affinity. When all twenty triplets are rotated by the simple anticipatory formula, an F3 Rafiki map appears as follows.

The multiplets are now assigned in a notably consistent manner with respect to water affinity. The UUU pole is now seen as completely assigned to the most hydrophobic amino acids in anticipation of F3, and the AAA pole anticipates amino acids from the hydrophilic spectrum. This pattern is difficult to detect in classic spreadsheets, but it is undeniable in this plot.

However, this color assignment isolates only a single amino acid parameter, and it is convincing mostly in these two poles. The remaining two poles have a decent but somewhat subtler result for hydropathy in F3, but those two poles are dominated by a completely different peptide parameter. The CCC pole houses proline, and the GGG pole houses glycine. These are the two most significant amino acids with respect to the steric properties of peptide backbones. The axis between these two poles optimizes primarily on steric parameters, and secondarily on hydropathy, whereas the AAA-UUU axis is primarily organized by hydropathy. It’s really hard to know what arginine is all about under any context. It is the rebellious teenager in the code family.

The four primary triplets (AAA, CCC, GGG, UUU) play a unique and significant role in codon assignments. They are mathematically different from the other sixteen triplets based on their pattern from F1 into F2 and F3. Primary triplets have a one in four chance of remaining unchanged in each frame, but all other triplets must always create new permutations.

In the case where a homogenous multiplet is symmetrically assigned to a single amino acid, that amino acid will remain the same in six of nine possible instances. Therefore, primary triplets in conjunction with homogeneous multiplets can provide backbone benchmarks in correlating the three protein structures, F1, F2 and F3. This correlation is made along two axes: the first is relative hydropathy, and the second is the degree of steric freedom in peptide bonds.

Proline is incomparable in its degree of bond rigidity, and glycine is unmatched for steric freedom. By having solid homogeneous multiplet assignments, these two amino acids display the strongest possible symmetry within the framework of the code. As a consequence, they are exceptional for their resistance to shift replacement, and therefore they are the primary benchmarks for correlating protein structures across F1, F2, and F3. Hydropathy is not far behind in terms of symmetry of assignments and importance to a protein folding benchmark between frames. As we have seen, hydropathy dominates the other axis of correlation across the three frames.

In this context, the four major poles provide a structural template into which at least three distinct protein descriptions can be predefined for any nucleotide sequence. The genetic code has discovered this numeric curiosity and taken advantage of it. However, the code has one more clever trick up its multi-frame sleeve. Note that F2 can be anticipated by individual codons mapping into a single multiplet, but an entire multiplet can only be anticipated as one of four codons mapped into four different multiplets. These numeric relationships will not allow specific amino acid substitutions to be anticipated. However, when one codon is used preferentially from a multiplet then a specific amino acid substitution becomes guaranteed.

Codon bias is a long recognized pattern in nucleotide sequences, but less familiar are the specific numeric consequences in the assignments given three relative frames in the genetic code. F2 reflects a much higher degree of F1 structure because of codon bias. A history of repeated use of frameshifts to generate novel protein structures may actually explain severe bias in various organisms. A biased genome will produce more highly related structures in F1 and F2 than will an unbiased genome. Therefore, codon bias and frameshifting are inter-related genetic phenomena.

What's symmetry got to do with this?

The pattern of assignments as unique relationships between three reading frames is a product of symmetry. Symmetry is a well-recognized part of codon assignments, but the origin and function of this symmetry has been obscure. Consider that a completely random codon assignment pattern would produce, for any given nucleotide sequence, three randomly spaced, logically unrelated proteins on the landscape of potential proteins. However, because of symmetry the three structures are logically spaced and interrelated on this landscape. Is this just a happy coincidence, or is it an optimized function of the genetic code? The odds that any random code would have this ability are infinitesimally improbable, so we can comfortably wager that the genetic code actually adapted for this as a primary function. Consider the chain of coincidences required to grant the code this rare ability.

1. It selected a set of completely interchangeable geometries in its amino acids. All amino acids are the same type of isomer, the L-type, and spatial sameness is a requirement if assignment symmetry is to be structurally effective.

2. It selected twenty amino acids, which provides the optimized combination of symmetry and diversity in assignment schemas.

3. It assigned amino acids strictly to multiplets.

4. It symmetrically interwove the multiplet assignments.

5. It assigned two important binary axes of peptide properties to the four major poles and their primary triplets.

6. It frequently employs codon bias, guaranteeing a high percentage of directed shift substitution.

All of these observed facts of the code have independent consequences and explanations, but they all must be present to achieve the remarkable relationship of assignments between F1, F2 and F3. It is therefore plausible to suspect that these features have adapted as an optimized set. Codon assignments were not made for absolute meanings; rather each one has a relative meaning to all others. No codon-amino acid correlation can be considered a ‘good’ correlation outside of the context of all sixty-four taken together. It is a network of assignments made in concert to achieve a particular optimized result.

Consider the virtually infinite number of possible genetic codes. Assuming that no code is better than any other would lead to few patterns, and no way to explain the near universality of a changeable code. Since there are remarkable patterns and near universality in the codes we see in diverse organisms, it is valid to infer that this code is exceptionally well suited for its task.

The fact that only twenty amino acids occur in the standard code, and all of these are the same isomer is a longstanding riddle. Why not more amino acids, or less, and why not a mixture of isomers? These curiosities are required within the network of assignments to ensure that the many simultaneously defined protein structures should maintain a logical distance of shared features on the landscape of all possible proteins. To find the infinitesimally small subset of potential codes that have this bizarre ability by mere coincidence boggles the mind. There must be an advantage to the code by having this exceptionally rare ability.

Note, we have not even mentioned stereochemical genetic information in this scenario. (way too heretical at this point to cloud the seperate discussion of symmetry and frameshifting.) We are basing the conclusions here only on the notion of amino acid identities, not spatial orientations. The impact of symmetry is strong on identities alone, but it becomes much stronger if stereochemistry is a real part of the information translated by the code. The two novel arguments regarding the nature of the code and its structure are independant but mutually supportive.

The classic paradigm has clouded our view of the genetic code’s achievements. Adopting it early in investigations also had us accept that there may be no rhyme or reason to the particular patterns we might detect in the data. The linear paradigm suggests that one pattern might be every bit as good as another. This is a huge drawback to thinking in terms of 'one-dimension' of information, or one degree of freedom in assignments. We are then apt to fail to appreciate the logic in the assignment patterns toward producing not merely primary sequences of amino acids but entire protein structures. The what, how and why behind the genetic code have all been partial to total enigmas in the classic paradigm. They can be better understood within an expanded paradigm.

What does the genetic code do?

It simultaneously defines multiple overlapping structures for any nucleotide sequence. How does it do this? By employing symmetry in its pattern of assignments to select many logically related structures from all possible structures. Finally, the big picture comes into better focus when we consider why it should do these things.

To the question of why nature should have a genetic code in the first place the classic paradigm answers: To make sequences of amino acids from the information in sequences of nucleotides. Beyond the distinction between sequence and structure there is another vital division between this and the expanded paradigm. The genetic code is also charged with the task of finding proteins within sequences of nucleotides, because any useful protein must first be found in a nucleotide sequence before it is reliably and repeatedly made from it.

The genetic code is an important part of the search algorithm nature uses to find these structures, and some patterns of codon assignment will be undeniably better at finding useful protein structures than will others. There are many ways to search a random nucleotide sequence, and the genetic code has found the best way to search.

Here's a metaphor that might help. The genetic code is a die around which a string of data is wrapped. The string can wrap in many ways, but the die is cast for all wrappings in anticipation of all possible data. There are good and bad ways for precasting dies, and the genetic code has found the best set of all possible dies. Any string found to be ‘good’ in one wrapping has a much higher than random chance of being ‘good’ in the other wrappings. In a situation where many ‘good’ strings must be found as solutions for diverse and changing problems of the environment, why not use good strings as many ways as possible? If a solution works well forward, why not backward? Why not shifted or complimentary? This is the essence of symmetry. Use, re-use, re-process, invert, compliment, combine and re-combine - in general transform but leave elements untouched. Symmetry is transformation without change, and the code is founded on symmetry.

Think symmetry, think mult-tasking.

Spacing molecular properties in a random sequence is one thing, but spacing many related sequences is quite another thing. Imagine the task of purchasing three lottery tickets. All three have an equally dismal chance of anticipating a small collection of random events. Is there a logically superior way to buy three lottery tickets? No, unless some aspect of the first ticket is known before purchasing the second, and likewise the third. Imagine you are told that the digits on the first ticket are correct, but in the wrong order, the chances of having subsequent tickets pay off will skyrocket. In essence, this is what the genetic code has done. It knows vital things about one sequence that guide it in trying another, seemingly random sequence.

Imagine a computer program written so that it was processed three bits at a time, let's say a check writing program. Now imagine if that program was fed into the computer shifted one bit to the left, would you expect the program to still write checks? Most probably not. The genetic code is set up in such a way that when this program is shifted forward or backward one bit it produces amazingly 'program-like' qualities, perhaps for unrelated tasks, like an address book or paint program. Also, the program could be fed through backwards, or the XOR equivalent of the program could be fed through the computer and it has a much better than random chance of performing some function. This would come in real handy when entirely new programs are required. This property of the computer system would be an efficient way to set up a search for new programs - start with a functioning program and transform it. Each nucleotide sequence is like a program for making a protein. Because of the compatible symmetries of genomes and the genetic code, new proteins are more easily found from seemingly random material.

When looking at the genetic code there has traditionally been a bias toward considering its function toward making proteins. I am saying that there should be more emphasis put on thinking about its role in finding new proteins. This function of the genetic code is probably very important and has probably played a major role in shaping its appearance.

The genetic code not only needs to have a way to make the same protein over and over again, it also needs to constantly find new useful proteins, so it must relentlessly search the landscape of all possible proteins. The genetic code is the vehicle by which life can travel from one place to another upon the protein landscape. The 'junk' DNA in a genome probably represents the most fertile search ground for new structures. They are sequences like a taffy pull of existing sequences with proven utility. By finding a harmonic network of assignments the genectic code has optimized the speed and efficiency of a search for new protein components within an existing genome.

DNA is a complex crystal.

Protein is a more complex crystal.

The genetic code translates the first crystal into the second crystal.

Life emerges as the most complex crystal.

A genome represents a vibrant crystal factory. It devotes a small percentage of resources to production, a larger percentage to management, and the greatest percentage to R&D.

Life is nature’s most aggressive search algorithm. In contrast to life, a salt crystal rarely has a need for solving problems posed by its environment, or for finding a new way to make a novel salt crystal. Life has an insatiable need to find new ways of doing things. Organic environments are constantly changing, so life is constantly finding new ways to solve the problems they pose. Morphologic 'churn' drives the process.

Consider the analogy of a whole human organism in the time of the caveman. Being the best possible hunter solved the problem of nourishment for our caveman friend. Sexual reproduction is nature’s preferred way of finding the best possible hunter from the set of all possible hunters. Rather than finding one really good hunter and then replicating it, nature chooses to combine components from two humans for the chance of creating useful novelty. This is a strategy predicated on an anticipation of extreme and rapid environmental change. The attributes of being the best possible hunter change through time. The solution of using changeless replication will quickly fail in any complex environment. Tinkering with the hunter by making random point mutations will have a low likelihood of success as well. Sexual reproduction is the symmetric solution to finding new types of hunter. Offspring are complete transformations of two parents, yet everything is the same in some sense, because the offspring is still human. This system of sexual reproduction ensures a high chance of functionality with a guarantee of novelty.

Life shuffles and sorts to solve new problems, in this way addressing the pressing demands of a constantly changing environment.

Proteins are found in much the same way as sexual reproduction finds whole organisms. First, proteins are generally made of parts called exons. Second, these exons are free to recombine in novel ways to form entirely new proteins. Third, the genetic code has an effective way to find new exons through frameshifting, complimenting and inverting. Once a useful exon is located in F1 it is like knowing the partial numbers on a lottery ticket. If F1 is functional, then keeping the general template while creating an entirely new potential exon in F2, F3, inversions and compliments have a significantly higher potential for leading to a successful search than does sampling a random sequence. Useful sequences are cobbled together from the frameworks of known useful sequences rather than conducting random searches from scratch. The genetic code is a required part of the scheme.

The more sparse the landscape of useful proteins, the greater the advantage of logically spacing the frames relative to each other, and therefore the greater the advantage of having symmetry in the assignments. A genetic code optimized for diversity would assign sixty-four or more amino acids across all of the available codons. A genetic code optimized for consistency would assign one amino acid to all codons. The standard genetic code is optimized for diversity and consistency by symmetrically assigning a network of twenty amino acids across a network of nucleotides.

Rather than a blue-collar protein building recipe, the genetic code can more accurately be seen as an elegant method for searching nucleotide sequences to find useful proteins. Life itself is a search algorithm, and the genetic code is the lynchpin in the search.

<< Back


<Top> - <Home> - <Store> - <Code World> - <Genetic Code> - <Geometry>


Material on this Website is copyright Rafiki, Inc. 2003 ©
Last updated September 21, 2005 12:01 PM