Sunday, March 30, 2014

The primordial nature of phage genes

Recent ecological studies have shown that bacteriophage (viruses that attack bacteria) are numerically the most abundant biological entities on the planet. The estimated 1030 viruses (mostly phage) in the oceans, if stretched end to end, would span farther than the nearest 60 galaxies. These viruses are thought to cause the turnover, by virus-related death, of 20% of the ocean's biomass per day. In shotgun sequencing of marine samples, the majority of phage gene sequences are invariably found to be novel (not corresponding to any other known gene sequences). Hence, the bulk of genetic diversity on the planet may well be tied up in viral/phage "dark matter."

One of the most-studied bacteriophage classes is the so-called "T-even" (T2, T4, T6, etc.) class of phages, of which the poster child, arguably, is T4. These are phages that attack enteric bacteria (E. coli and its relatives), hence are commonly found in sewage.
T4 phage morphology.

T4 is interesting from a number of standpoints, not least of which is its distinctive head/tail morphology (see diagram). In phylogenetic studies, T4 typically shows up in basal positions on trees, meaning it is presumably ancient. Increasingly, viruses and phages are considered to be of primordial origin, possibly predating cellular life. Certainly, any theory on the origin of life has to come to grips with the fact that the major biomolecules (proteins, nucleic acids, lipids) had to exist, in some form, prior to the appearance of the first cell. Some experts suggest nucleic acids and proteins may have interacted with each other in a so-called Virus World scenario (see the excellent paper by Koonin) wherein microscopic hydrothermal pore systems (in mineral formations at the ocean floor) provided for sequestration of prebiotic processes in physical compartments that could be invaded by "selfish replicators."

Primordial interaction of proteins with nucleic acids (and their precursors) presumably gave rise to a number of artifacts that survive today, such as ribosomes (which contain over 50 small proteins in tight association with RNA), tRNA (RNA covalently bound to an amino acid),  adenine-containing cofactors (e.g. SAMe, NADPH), and viruses (capsid and other proteins bound to RNA or DNA). Conceptually, one can think of protein/nucleic-acid complexes as having diverged, at the Darwinian Threshold, along two lines: toward ribosomal life, or toward the viral world.
Bacterial cell covered with T4 virions.

The genes for certain viral capsid proteins (with colorful names like Jelly Roll Capsid) are among a number of "viral hallmark genes" that show no homology to any genes from the cellular world. Presumably, some of these genes are of truly primordial ancestry. We have a valuable clue to the origin of at least some of these genes in the case of phage T4. A number of tantalizing reports from the 1970s (see here, here, and here) suggest that the enzymes dihydrofolate reductase and thymidylate synthase (both encoded by T4 DNA) are, in fact, components of the virion baseplate and/or tail structure of T4. Hence, at least in some cases, it's conceivable that virion structural proteins began as enzymes.

What's particularly intriguing about the T4 enzymes is that T4's thymidylate synthase (ThyA), which is phylogenetically ancient, is encoded in the phage DNA immediately downstream of the gene for dihydrofolate reductase, with no intervening "junk DNA." Why is this significant? In many organisms (as I explained in an earlier post), these two enzymes occur in a single large bifunctional enzyme that's proposed to be the result of a gene fusion event. In organisms that have the double enzyme, the reductase occurs at the beginning (the N-proximal end) of the protein.

Just for fun, I took the protein sequence for T4 dihydrofolate reductase and fused it (in Notepad) with the sequence for T4 thymidylate synthase, then did a BLAST search of the fusion sequence against all the protein sequences at UniProt.org. The naturally occurring bifunctional ThyA/dihydrofolate reductase enzymes from peach, balsam, rice, castor bean, and clementine (Citrus clementina) all showed up as hits, with E-values of 10-67 or better.

This doesn't prove that the bifunctional enzymes of the peach, etc. came from T4 phage, of course, but it is consistent with the general idea that the bifunctional ThyA/dihydrofolate reductases of algae and protists could (at least in theory) have started out as phage gene fusion products.

Let's put it this way: Weirder things have been known to happen.

Saturday, March 29, 2014

Virus genes don't always come from the host

Usually, when a virus contains a certain kind of gene, and the host contains the same kind of gene, it's assumed the virus got its copy from the host. This is not a terribly safe assumption, however. In some cases it's demonstrably wrong.

For a striking example of how wrong this assumption can be, you need look no further than the tiny Chlorella alga that can be found living symbiotically inside the fresh-water ciliate Paramecium. When it's not living inside Paramecium, Chlorella is subject to infection by PBCV-1 (the Paramecium bursaria Chlorella virus).

Tiny green Chlorella cells can be seen here growing
inside Paramecium. The full Paramecium cell is
shown in the inset at lower left. (Photo by Charles Krebs.)
Both Chlorella and PBCV-1 have a gene for an enzyme called thymidylate synthase, which is the enzyme that produces thymidine monophosphate (dTMP, or just TMP), a precursor molecule for making DNA. Ordinarily, one would assume that the virus picked up the gene for this enzyme from its host at some point in the past. But there's a problem.

The only thymidylate synthase gene in Chlorella's genome codes for a protein with 508 amino acids. The PBCV-1 virus version of this gene codes for a much shorter protein with only 216 amino acids. It turns out there's a perfectly good explanation for the size difference. Like other small algae (such as Micromonas, Ostreococcus, and Bathycoccus) and certain protozoans as well, Chlorella has evolved a bifunctional enzyme. In Chlorella, the same enzyme acts as both a thymidylate synthase and as a dihydrofolate reductase. In most higher organisms, two different enzymes carry out these functions. Organisms that have the dual-function enzyme are presumed to have developed this capability through a gene fusion event sometime in the (most likely distant) past.

It turns out the PBCV-1 virus synthase not only isn't bifunctional, it carries out its thymidylate reaction by an entirely different mechanism than that used in the host enzyme. The host enzyme employs folate (but no flavins) as a cofactor, whereas PBCV-1 is strictly dependent on flavin adenine dinucleotide (FAD), as verified experimentally by Graziani et al. in 2006. We now know that many bacteria use the FAD version of this enzyme (often called ThyX, as disintguished from ThyA, the folate-only enzyme). And the FAD users all have relatively small thymidylate synthases, of about 200 to 300 amino acids.

The above scenario isn't exclusive to Chlorella and PBCV-1. It turns out, certain other small algae (Micromonas, Ostreococcus, and Bathycoccus; all happen to be salt-water algaae) have a bifunctional thymidylate kinase, yet they are subject to infection by viruses that use the much smaller, mono-functional flavin-binding enzyme.

In all these cases, the virus uses an entirely different style of enzyme than the host to carry out TMP production. There is essentially zero chance that the virus derived its enzyme from the host (or vice versa), because the reaction mechanisms of ThyA and ThyX are radically different. (For more detail on this, see the excellent review article at http://www.ncbi.nlm.nih.gov/books/NBK6401/.) These aren't orthologues; these aren't paralogues; these are entirely different enzymes.

So where did the virus get its thymidylate synthase from, if not the host?

If you take the protein sequence for the PBCV-1 thymidylate synthase and run a BLAST search at UniProt.org, the best non-viral hits (in the range of 58% identities, 77% similarities, E-value 10-69) are for the thymidylate synthases of Prochlorococcus marinus and other cyanobacteria, with cyanophages also scoring high. This makes a great deal of sense, because the photosynthetic Prochlorococcus and its relatives are thought to be some of the most ancient bacteria on earth (possibly going back 3.8 billion years). They're thought to be the ancestors of chloroplasts. At one point, they were almost certainly the predominant life form in the oceans. Since phycodnaviruses (of which PBCV-1 is a member) are thought to be quite ancient, it's entirely possible they got their thymidylate synthase from cyanobacteria. That's certainly what the protein-sequence evidence suggests.

I'll go with the evidence.

Friday, March 28, 2014

A virus with a metabolic gene

Everybody knows viruses aren't alive; or at least, virions (extracellular viral particles) aren't alive. A virus needs a host in which to multiply. Once inside the host, the virus hijacks host processes to its own ends. So typically, a virus's genome contains genes for capsid proteins, replication enzymes, nucleases for breaking down the host's nucleic acids, proteases for breaking down proteins, and so on.

The last thing in the world you'd expect to find in a viral genome is a bonafide metabolic gene. But guess what? That's exactly what you find in the DNA of certain marine viruses that attack some of the world's smallest algae cells, namely algae of the Ostreococcus and Micromonas varieties.

Electronic microscopy of infected Ostreococcus tauri cells. The bar represents 500 nanometers, in photos A through D; in E and F, the bar is 50 nm. Virus particles are shown with arrows. Chl–chloroplast; Cyt–cytoplasm, n–nucleus, m–mitochondrion, Sg–starch grain. B & C show viruses accumulating in the cytoplasm before cell lysis occurs. In D, virus particles clump together around a lysed cell. In E, a full virus particle is stuck to the cell. F shows an empty particle left on the cell surface after injection of its contents into the cell. From Derelle et al., "Life-Cycle and Genome of OtV5, a Large DNA Virus of the Pelagic Marine Unicellular Green Alga Ostreococcus tauri," PLoS, 2008.

Ostreococcus is unusual in being a full-blown marine eukaryote that's smaller, physically, than some bacteria. At less than a micron in diameter, Ostreococcus has room for exactly one mitochondrion, one chloroplast, a nucleus containing around 13 million base-pairs of DNA, a starch grain, and an overnight bag containing some cytoplasm. It's crowded in there.

It turns out, Ostreococcus is vulnerable to attack by a number of viruses. The viruses are surprisingly large (with around 200K base-pairs of DNA), but the real surprise is what's in the viral genome: a true metabolic gene, pfkA, which encodes the enzyme phosphofructokinase (PFK).

PFK is a key enzyme of glycolysis, the anaerobic energy pathway that converts glucose to pyruvate and ATP. If you ask a biochemist to name an enzyme that's stereotypically metabolic, chances are pretty good she'll name PFK. It's the poster-child of metabolic enzymes.

If you go to http://www.uniprot.org/uniprot/E4WM35 and click the Blast tab, then click the BLAST button, you'll run a search against millions of protein sequences at UniProt.org (using the O. tauri virus PFK protein sequence as a query). What you'll get back is something like this:


Top Hits against O. tauri virus 6-phosphofructokinase

Organism
Length
%ID
Score
E-value
Gene Identifier
Ostreococcus tauri virus 2
282
100.0%
1,458
0.0
OtV2_159
Ostreococcus lucimarinus virus OlV4
282
91.0%
1,347
0.0
OLOG_00278
Ostreococcus lucimarinus virus OlV1
282
91.0%
1,344
0.0
OlV1_173
Ostreococcus lucimarinus virus OlV3
282
89.0%
1,323
0.0
OMVG_00088
Ostreococcus lucimarinus virus OlV6
282
89.0%
1,323
0.0
OLVG_00080
Ostreococcus lucimarinus virus OlV5
282
89.0%
1,318
0.0
OLNG_00083
Ostreococcus tauri virus 1
282
84.0%
1,251
6.0×10-171
OTV1_172
Ostreococcus virus OsV5
282
83.0%
1,234
2.0×10-168
OsV5_197f
Ostreococcus tauri virus RT-2011
286
55.0%
808
1.0×10-103
OtV6_175
Micromonas sp. RCC1109 virus MpV1
287
51.0%
764
5.0×10-97
MpV1_177
Micromonas pusilla virus SP1 (MpV-SP1)
269
40.0%
532
2.0×10-62
MPXG_00096
Micromonas pusilla virus PL1
269
40.0%
519
2.0×10-60
MPWG_00076
Actinoplanes friuliensis DSM 7358
435
28.0%
246
2.0×10-20
AFR_30340
Actinoplanes sp. N902-109
442
27.0%
238
2.0×10-19
L083_5877
Paraprevotella clara CAG:116
325
30.0%
229
1.0×10-18
BN471_01612

All of these hits except the last 3 are viral PFK proteins. The last 3 organisms in the table (representing the best non-viral hits) are bacteria. Notice that the %ID (percentage of identical amino acids in the protein sequence) quickly drops off as you go from Micromonas pusilla virus to bacteria. Also notice, the viral host organisms are nowhere in sight. The viral PFK does not match the host PFK (meaning, perhaps, that one does not derive from the other, or that they do derive from each other but have diverged so far apart, over the millennia, that they're no longer similar).

There are no other glycolysis enzymes (as far as I know) in the viral genomes. So what on earth is PFK doing there?

Interesting you should ask.

First, it's been known for some time that fructose-1,6-biphosphate (the end product of the reaction catalyzed by PFK) has the effect of delaying cell death in animal tissues. In the cell nucleus, fructose-1,6-biphosphate isn't just a metabolic intermediate, but an important signalling molecule.

When University of Louisville scientists overexpressed PFK in HeLa cells, they observed increased cell proliferation. HeLa cells, like most eukaryotic cells, have several forms of the PFK enzyme, and one is localized to the nucleus. When the nuclear enzyme is overexpressed, it leads to increased expression of several key cell cycle proteins, including cyclin-dependent kinases (proteins that control the mitosis cycle).

When I read about the University of Louisville work, I decided to run a BLAST search against viral genomes using the CDKA1 (cyclin-dependent kinase) gene of Arabidopsis thaliana (a commonly studied plant) as a query, to see if any viruses come with their own CDK enzymes. I got 465 hits (all viral), albeit mostly of low quality (33% identities, best E-value 10-31), for proteins variously identified as "uncharacterized protein," "putative serine/threonine protein kinase," "cyclin-domain fused to serine-threonine kinase," and so on.

Ordinarily I'd dismiss hits of this low quality level as being spurious. But experience has shown that viral enzymes are pretty much always "weak-signal" hits when probed with a non-viral query. In plain English: Viral proteins rarely show much homology with their supposed host orthologs. In this case, I'm willing to believe that a good many of the Arabidopsis CDKA1 hits do, in fact, represent cyclin-dependent kinases encoded by viruses. It's the kind of dastardly thing large DNA viruses are capable of.

Let's put it this way: If no large DNA virus encodes a cyclin-dependent kinase, I'd be very surprised. Viruses are good at figuring out how to prolong the life of a cell that doesn't even know it's dead yet.

Phosphofructokinase proves it.

Wednesday, March 26, 2014

A virus, a worm, and a louse walk into a bar

Since large DNA viruses are in the business of making large amounts of DNA, it shouldn't come as a surprise that many of them carry a gene for ribonucleoside diphosphate reductase, the enzyme that allows deoxy-bases (dADP, dCDP, etc.) to be created for use in deoxyribonucleic acid (DNA). The host organism, of course, has its own reductases for this purpose. But you have to imagine that when a giant DNA virus comes barging into a host cell and begins its crash program of digesting host nucleic acids into monomers (free nucleotides), the virus has a huge need to convert those monomers, quickly, into the deoxy form.

So I wasn't totally surprised to find that the genome for PBCV-1, the virus that infects Chlorella algae, contains genes for RNDR (ribonucleoside diphosphate reductase). What's surprising is that the virus brings not one, but two such genes. One gene encodes a short protein (about 370 amino acids); another gene encodes a protein with 771 amino acids. In the case of Paramecium bursaria Chlorella virus NY2A (PBCV-NY2A), which is essentially a variant of PBCV-1, there's actually a third gene, for a protein having 1,103 amino acids.

Why so many genes?

It turns out there are three major types of RNDR enzyme in living organisms, and a given organism can have more than one type. There's an aerobic enzyme (class I) that uses a tyrosine oxygen for radical generation. There's a larger (~1200 AA) class II enzyme that requires adenosylcobalamin (B12) as a coenzyme. And there's an anaerobic class III enzyme that relies on S-adenosylmethionine (SAMe) as a cofactor. Based on the relative sizes of these various enzymes, it appears the PBCV-NY2A virus may be harboring all three. However, most phyocodnaviruses infecting algae seem to have class I and class III reductases, but not the bigger class II.

Human body louse.
The more-or-less standard assumption, when a virus has an enzyme that the host also has, is that the virus obtained its copy of the gene from the host (at some point in the distant or not-so-distant past). That assumption may have to be revisited for PBCV-1's class III reductase. When you do a protein alignment of the viral reductase against the sequence for the host alga's reductase, you expect to see a lot of sequence similarity. What you find in the case of PBCV-1 vs. Chlorella is that the host enzyme shares only 48% amino-acid identities with the viral enzyme. "Well," you're saying, "but that's pretty good, right?" Not so fast. When you take the virus's enzyme sequence and run a search against the entire UniProt.org database, the most similar non-viral sequence turns out to be the reductase enzyme not of the virus's host (Chlorella) but of Haemonchus contortus, the barber-pole worm, with 53% sequence identities. Also very closely matched: the reductase from Pediculus humanas, the human body louse. Three other organisms also have a closer match of their reductases to the PBCV-1 reductase than Chlorella. (See table further below.)

So did this marine-virus reductase gene actually come from a louse, a worm, or a fungus, rather than from an algal host? Not likely. What's going on here, then? Frankly, it's a mystery. For one thing, we have no way of knowing how ancient the PBCV-1 reductase gene is or how fast it has evolved over the ages, relative to the host gene. Some scientists believe the three classes of ribonucleotide reductase originally stemmed from a common ancestor that was similar to the current class III (anaerobic) enzyme. This makes sense, in that the enzyme probably first came about in a highly anoxic ocean environment, billions of years ago, well before atmospheric oxygen began to accumulate, and maybe before sea water had accumulated much dissolved oxygen gas. The PBCV-1 virus reductase may derive from this ancient design. It's possible that Chlorella and its ancestors evolved extensively over the last few hundred million years, whereas the barber-pole worm and body louse (whose ancestors got the ancient class III proto-enzyme) may not have evolved as rapidly. Therefore, the worm enzyme, the louse enzyme, and the viral enzyme may all still share similarities with the progenitor enzyme that Chlorella no longer shares.

But there are also the forces of selection to consider. Modern ribonucleoside reductases incorporate allosteric control mechanisms that fine-tune the enzyme's capabilities with respect to deoxynucleotide (and small-peptide) concentrations. For example, a 50-amino-acid region at the beginning (N-terminal) end of the enzyme allows the enzyme to be feedback-inhibited by dATP. A virus interested in maximizing the production of deoxy-nucleotides might not want or need this sort of allosteric feedback mechanism. Also, the G+C content of the viral genome is significantly lower than that of the host  (40% vs. 60%), meaning that the viral enzyme might very well be optimized to produce deoxy-nucleotides in different ratios than the normal NTP-pool setpoints desired by the host. In short, it's possible to imagine that the virus's nucleotide requirements are, in fact, much more like a barber pole worm's than those of a healthy Chlorella.

Still, you have to admit: Nature comes up with strange bedfellows.

Here are a few protein matches between PBCV-1 (virus) reductase and other reductases:

Organism Length %ID Score E-value Gene identifier
Paramecium bursaria Chlorella virus 1 (PBCV-1) 771 100% 4727 0 A629R
Acanthocystis turfacea Chlorella virus Canal-1 763 76% 3746 0 Canal-1_104L ATCVCanal1_104L
Haemonchus contortus (Barber pole worm) 795 53% 2513 0 HCOI_01437900
Pediculus humanus subsp. corporis (Body louse) 795 53% 2483 0 Phum_PHUM350970
Salpingoeca rosetta (choanoflagellate) 779 51% 2479 0 PTSG_01558
Pneumocystis murina (fungus) 844 51% 2479 0 PNEG_03325
Schizosaccharomyces japonicus (yeast)83451%24780SJAG_04665
Chlorella variabilis (Green alga) 810

48% 2276 0 CHLNCDRAFT_32953
Cellulophaga phage phi13:1 789 47% 2039 0 Phi13:1_gp061
Cyprinid herpesvirus 3 806 45% 2092 0 CyHV3_ORF141 KHVJ151
Acanthamoeba polyphaga moumouvirus 849 43% 1947 0 Moumou_00516

Length refers to the total protein length in amino acids. Percent ID means the percent of target-protein amino acids that were an exact match against (aligned) query-sequence amino acids. Score is a figure of merit for the total matching; E-value represents the expectation that the matches could have occurred by chance (zero, here, in every case; meaning, these similarities probably could not have happened by chance). Finally, the Gene Identifier will let you look up these sequences at UniProt.org or other sequence database sites.

For more on the subject of ribonuceotide reductases in viruses, see the review of phage metagenome RNRs at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653736/.

Monday, March 24, 2014

Viral abduction of proteins

I have to admit, when I first saw the paper by Newcomb and Brown called "Internal Catalase Protects Herpes Simplex Virus from Inactivation by Hydrogen Peroxide" (J. Virology, 2012, 86:21; full article here), I was tempted to dismiss it as a fluke. What Newcomb and Brown found is that herpes virus appears to contain a fully functioning version of the enzyme catalase. This enzyme, which is present in human cells and, in fact, most aerobic life forms (but also some anaerobes), breaks down hydrogen peroxide to molecular oxygen and water. It detoxifies hydrogen peroxide, if you will.
Herpes simplex virus components.

The odd thing about herpes virus containing catalase is that the herpes genome does not contain a gene for catalase (and this is acknowledged by Newcomb and Brown). Thus, any catalase present in the virion has to have been made by the host cell. The enzyme is piggybacking a ride inside the virion.

It appears that, far from being a fluke, the herpes virus tegument proteins (proteins that lie just underneath the capsid proteins that make up the outer shell of the virion) have evolved in such a way as to attract or stick to catalase, sucking it along for the ride.

Having catalase on board brings survival benefit to the virus. According to Newcomb and Brown:
HSV-1 [herpes virus] was found to be more sensitive to killing by hydrogen peroxide in the presence of a catalase inhibitor than in its absence. The results suggest a protective role for catalase during the time HSV-1 spends in the oxidizing environment outside a host cell. 
In what sense would catalase protect the virus? Peroxides are damaging to DNA, and herpes is a DNA virus. Where do peroxides come from? Short answer: phagocytes (think white blood cells). When a phagocyte ingests bacteria (or any material), its oxygen consumption increases. The increase in oxygen consumption, called a respiratory burst, produces reactive oxygenated species (nitric oxide, superoxides, hydrogen peroxide), which are toxic to most life forms, unless (of course) detoxifying enzymes come into play. In this case, herpes comes well-prepared for the confrontation. It brings copies of the host's own catalase.

This is an extremely clever adaptation (if that's what it is). If you're a virus, why go to the trouble of adding a dedicated catalase gene to your DNA if you can simply recruit host catalase into the capsid by suitable modification of a tegument protein?

Arenavirus can capture ribosomes in virions.
Selective entrainment of host proteins is not unknown in viruses (it's been well studied in HIV and in vesicular stomatitis virus, for example). Even in the case of catalase, it's been known since 1938 that vaccinia virus, a relative of smallpox, carries with it the host's own catalase.

Perhaps the most extreme (and startling) example of viral recruitment of host proteins into virions is provided by Arenavirus (an agent of aseptic meningitis in humans), which can package up host-cell ribosomes (see photo).

It could very well be that most large viruses, such as NCLDVs (and mid-size viruses as well; herpes is by no means large), routinely package host enzymes in their virions. As modern proteomic techniques are brought to bear on the study of virion-associated host proteins, we can probably expect many additional discoveries of this sort in the near future.

Sunday, March 23, 2014

Nucleus-like viruses and their enzymes

Recent findings in virology have forced biologists to consider many notions that just a few years ago would have seemed heretical and/or science-fiction-like. For example, there is now serious discussion of the possibility that cellular life descended from viruses (the Virus World theory; see also this paper). A growing (but still minority) viewpoint is that viruses should be considered symbionts rather than simply parasites (see the review by Villereal). Some have dared to propose that the eukaryotic cell nucleus actually stemmed from a virus. Others have speculated the reverse: that the large DNA viruses are actually escaped, spore-like nucei. Meanwhile, some say that during an earlier RNA World, viruses became the original inventors of DNA.

There's no question that large viruses of the NCDLV class have nucleus-like properties. Within a short time of infection, these viruses set up a complex structure inside the cell known as the virus factory, and the factory looks a lot like a cell nucleus. The authors of a recent paper on Mimivirus (the famously huge virus that infects freshwater amoeba) admitted that in previous work, they did, in fact, mistake the virus factory for the nucleus. (See photo.)

Which is the nucleus and which is the virus factory? In this photo, VP is a virus particle (of the enormous mimivirus) developing inside Acanthamoeba. A smaller virus factory (S) is just beginning to form on the left.

Macroscopic aspects aside, the large "nucleocytoplasmic" viruses (some of which infect animals and marine life, not just amoeba) bring with them many genes for enzymes that are normally found in a cell nucleus. I'm not talking about genes for DNA polymerases, topoisomerases, etc., but genes that act on small molecules. In a previous post, I mentioned the example of PBCV-1 (a virus that infects the alga Chlorella) having its own gene for aspartate transcarbamylase (ATCase), which is an enzyme that catalyzes the first committed step in pyrimidine synthesis. This enzyme (common to most living things) is predominantly found in the cell nucleus of higher organisms.

There are other examples. Many NCLDV-group viruses have a gene for deoxy-UTP pyrophosphatase, an enzyme that breaks the high-energy phosphates off dUTP so that uracil isn't accidentally incorporated into DNA. One can imagine that after a virus invades a cell and unleashes its nucleases on the cell's own RNA, many ribonucleotides (breakdown products of RNA) will be liberated; and many of these will then be reduced to deoxy-nucleotides (by ribonucleoside-diphosphate reductase) in preparation for viral DNA synthesis. As it happens, dUTP is quite easily incorporated into DNA (and is promiscuous in its Watson-Crick pairing with other nucleobases); the resulting malformed DNA can trigger apoptosis in some cells. The virus takes no chances. It brings its own dUTPase to make sure uracil never gets into its DNA by mistake.

Some viruses bring their own gene for thymidylate synthase, to bring about the conversion of dUMP to dTMP (in other words, methylation of uracil, in its deoxy-ribonucleoside-monophosphate form, to give thymidine monophosphate). Some also have a gene for thymidylate kinase, which converts dTMP (often just called TMP) to dTDP (or TDP).

Yet another "small-molecule" enzyme encoded by large DNA viruses is ribonucleoside-diphosphate reductase (RDPR). This enzyme is fundamental to the whole DNA synthesis enterprise. Its job is to convert ordinary ribonucleotides to the deoxy form that DNA needs. Without this enzyme, you can make RNA but not DNA. So it's typically found in the cell nucleus (in higher organisms).

It turns out, a gene for RDPR is contained in a great many viral genomes. When I did a BLAST search of the protein sequence for Chlorella virus ribonucleoside reductase against the UniProt database of virus sequences, the search came back with 863 hits, spanning viruses belonging not only to the NCDLV class (pox, mimivirus, phycodnaviruses, etc.) but also the Herpesviridae, plus many bacteriophage groups as well. In terms of the sheer variety of virus groups involved, it's hard to think of another "small-molecule-processing" enzyme that spans as many viral taxa. We're talking about everything from relatively small bacteriophages to mimivirus, and lots in between.

The reductase gene is so widespread, it made me wonder what its phylogenetic distribution might look like. In other words: Are viral RDPRs related to each other? Are they related to the host's own RDPR? Does the enzyme's evolution follow the viral path, or the host path?

Just for fun, I obtained a number of ribonucleoside reductase (small subunit) protein sequences for viruses, plants, animals, bacteria, fungi, and various eukaryotic parasites (using the tools at UniProt.org), then fed the results to the tree-maker at http://www.phylogeny.fr. What I got was the following "maximum likelihood" phylogenetic tree. (See this paper for details on the tree algorithm. Also, be sure to check out this nifty paper to learn more about how to read this sort of tree.)

For convenience, names of viruses are depicted in blue. Notice how, except for the Vaccinia-Variola group, which is deeply nested, most of the viral nodes are ancestral to most of the higher-organism nodes; you have to go through many levels of viral ancestors to get from the original, universal ancestor (presuming there was one) to the reductase gene of the pig, say. From this diagram, it would appear that the Pox-family reductase gene is derived, in some way, from a highly evolved host. But that's the exception, not the rule. All of the other viral genes are outgroups and/or, more usually, ancestors of one another.

Mimivirus is fairly high up the chain and shows relatedness to two very common freshwater and soil bacteria (Pseudomonas and Burkholderia).

It would be fun to go back and remake the tree, adding more organisms. (If you end up trying this, let me know the results.) For now, I'm comfortable concluding that except for pox-family viruses, the ribonucleoside reductase produced by major DNA viruses and phages are not derived from current-day hosts. A parsimonious (but not necessarily correct!) explanation is that the phage reductases are ancestral to host orthologs; but it is also possible that the phage reductases derive from very ancient hosts (not depicted in the tree), with current-day hosts appearing to derive from phage genes when in fact the similarity is to a long-ago host ortholog. In any case, the tree shows that organismal RDPRs tend to be related to organismal RDPRs and viral versions are related to viral versions. What we don't see anywhere is a viral sub-tree growing out of a host sub-tree (as would be the case if the viral enzymes simply derived from modern host enzymes).

The UniProt identifiers of the protein sequences used in this study are given below in case you want to try to replicate these results (or perhaps extend them). To retrieve the protein sequences in question, go to http://www.uniprot.org/ and click the Retrieve tab, then Copy and Paste the following sequences (one to a line) exactly as shown:



O57175
P33799
M1I7H3
E5ERR7
Q6GZQ8
Q77MS0
P28847
M1I8A4
W0TWG5
Q7T6Y9
Q9HMU4
T0MT29
201403222BWOVN08AD
B3ERT4
F2II86
F2L908
U7RFH3
Q4KLN6
I3LUY0
B9RBH6
Q9LSD0
S8GD97
W4I9N3
Q4DFS6
A4HFY2
G3XP91
S8B144

Thursday, March 20, 2014

A surprise in an algal virus

Virology (the study of viruses) is undergoing a quiet revolution. The discovery of the mammoth mimivirus and the NCLDV family of super-large viruses (with genomes equivalent in size and complexity to that of a small bacterium) have forced a reexamination of the nature and role of viruses in the biosphere.

Traditionally, viruses have been seen as stray grabbags of genetic material whose genes are limited to replication functions (plus a few structural genes for capsid proteins), presumably mostly derived from host DNA. This point of view is now officially defunct. Many viral genes have no analog in the host world, and increasingly, large DNA viruses are found to contain genes for enzymes traditionally thought of as metabolic. (See the remarkable paper by Monier et al., "Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus," in Genome Research, 2009.)

The freshwater ciliate Paramecium bursaria is familiar to
generations of biology students. The many green inclusions are
Chlorella algae, living symbiotically inside the Paramecium.
Even knowing this, I was stunned to find, recently, while browsing proteins at UniProt.org (yes, I need to get a life), that a virus of the Chlorella alga contains a gene for ATCase: aspartate transcarbamylase. (Don't worry, I'll explain.) A dozen strains of this virus have been DNA-sequenced, and they all contain a gene for ATCase (and you can see them here).

Just so you know what the heck I'm talking about: The freshwater ciliate Paramecium (see photo) can often be found living in a symbiotic partnership with members of the algal genus Chlorella. The algae cells, living inside the Paramecium, allow the Paramecium to survive in high-sunlight/low-nutrient conditions. It's often said that the Paramecium also provides a means of locomotion for the otherwise non-motile algae. What's ironic is that the Chlorella genome has been found to contain flagellar genes (even though the alga itself doesn't swim), but that's another story.

Most organisms in this world are vulnerable to viral infection, and it turns out Chlorella is no exception. Chlorella can become infected with PBCV-1 (Paramecium bursaria Chlorella virus), which is a DNA virus with a comparatively large 330-kilobase-pair genome. The latter genome has an amazing 800+ open reading frames, meaning it can (in theory) be encoding as many as 800+ genes, which is huge. Most of the gene sequences correspond to "uncharacterized proteins," at this point. We don't know what most of these proteins do.

We do know what ATCase does. Aspartate transcarbamylase (also called aspartate carbamoyltransferase) is one of the best-studied enzymes in the history of enzymology. It catalyzes the first step in the biosynthesis of pyrimidines (e.g. uracil, cytosine, and thymine), which are essential for making RNA and DNA. Hence, virtually all living cells have this enzyme (even genome-reduced organisms like Buchnera aphidicola have it). But no viruses have itexcept for the Paramecium bursaria Chlorella virus, that is. (In a quick check of the UniProt database, I was unable to find another virus that has this enzyme, although I found a tantalizing report in the literature from decades ago describing a several hundred percent increase in ATCase activity in virus-infected cowpea and soybean leaves.)

It's interesting that the Chlorella virus isn't happy merely to use the host's existing pyrimidine pool. It brings its own copy of ATCase to speed things along, suggesting (perhaps) cytoplasmic pyrimidine nucleotide levels may be rate limiting (a bottleneck) for this virus's replication and transcription. Other viruses solve this problem by bringing their own nucleases with which to break down host RNA and DNA. The Chlorella virus has plenty of those as well.

Certainly, if the Chlorella virus is actually making 800+ gene products, it's going to need a lot of uracil. But the virus also has genes for polysaccharide production, and uracil nucleotides are needed for those too. Whatever the reason, PBCV has decided it needs to bring its own ATCase gene.

So the $64,000 question is: Where did this gene come from? Is it derived from Chlorella's own ATCase? Is it bacterial or archaeal? Is it uniquely viral?

I ran a quick phylogenetic analysis of ATCase protein sequences from a handful of organisms using the phylogeny tools at http://www.phylogeny.fr. Here's the phylogeny tree I came up with:



Reading from the top down, the first two organisms (Halorubrum and Thermococcus) are archaeons: single-cell extremophiles. The next four organisms, ending with E. coli, are bacteria. Notice that the PBCV virus (in blue-green) comes in a branch containing (underneath it) the host cell, Chlorella variabilis, two land plants (Genlisea, which is a carnivorous plant, and Glycine max, which is the soybean), and two algae (Chlamydomonas and Volvox). The clear implication is that the viral ATCase and the modern-day Chlorella ATCase both came from an ancient ancestor that pre-dates modern plants. (Note: For tips on how to interpret phylo-trees of this sort, be sure to check out the excellent post, How to Read a Phylogenetic Tree.)

Strange and wonderful: that's virology for you.

Sunday, March 16, 2014

How quickly did life arise on earth?

One of the great mysteries of life on earth is how life was able to appear so quickly on a newly formed planet. Earth was recently confirmed to be at least 4.375 billion year old. And yet there is evidence of life existing on Earth 3.8 billion years ago. Which means early earth had a chance to cool, and then form life, in a period of "only" 575 million years or so. No one knows how long the cool-down period took, but biogenesis may well have had considerably less than 575 million years in which to take place.
Francis Crick

Sir Francis Crick (one of the discoverers of the structure of DNA) found it hard to believe DNA-based life could get started in the time available. He famously quipped: “You would be more likely to assemble a fully functioning and flying jumbo jet by passing a hurricane through a junk yard than you would be to assemble the DNA molecule by chance in any kind of primeval sea of soup in 500 to 600 million years. It is just not possible.”

There are two aspects to rapid biogensis that bear emphasis. First, it's possible early earth was bombarded with comets, meteors, and/or asteroids containing significant quantities of ice and organic molecules. Meteors similar to the Murcheson meteorite (which is older than Earth) may have brought amino acids and complex organic compounds to the earth fully formed.

As recently as a billion years ago, there were no life forms bigger than a jellyfish on earth, and no vascular plants.

Also important is the fact that chemical reactions increase (logarithmically) in speed with increases in temperature (a relationship made quantitative by Arrhenius in the late 1800s). It is possible that the early chemistry leading to the precursors of life took place under high-temperature conditions, perhaps in deep ocean waters, near thermal vents, where (due to the great ambient pressure) water boils at much higher temperature than at the surface. (Just ten meters deep, water boils at 246°F or 120°C.) Chemistry at these high temperatures and pressures would have been very rapid.

It's interesting to note that by some estimates, all of the water in all the oceans on Earth can cycle through all the hydrothermal vents in the sea in only 10 million years. That's a relatively short cycle time compared to the 500+ million year time frame in which life appeared.

If monomeric molecules formed quickly under high-temperature/high-pressure conditions, they would possibly also have degraded quickly. Some researchers say that life may have had only a short window in which to appear before monomers broke down again into simpler solutes. In other words, high-temperature conditions favored a situation of rapid appearance of life, or no appearance.

Getting from simple monomers to stable macromolecules almost certainly required lower temperatures, since high temperatures disrupt the hydrogen bonds on which macromolecular 3D conformations depend. Nevertheless, the boiling point of water at sea level needn't be considered an impediment to the creation of life. We know that some types of bacteria on earth can survive (and even thrive) at temperatures of 122°C. So it's theoretically possible that early life on Earth could have emerged even in boiling seas.

A harder problem (for astrobiologists) is imaging how life could form in ultra-cold conditions, as in the methane oceans on Titan. Under super-cold conditions, hydrogen bonds might be too sticky to allow conventional life to appear; instead, low-temperature bio-molecules might interact based solely on van der Waals forces. Exactly what form such molecules would take is anyone's guess.

Friday, March 14, 2014

Why do some bacteria have GC-rich DNA?

 A longstanding open problem in biology is why the G+C (guanine plus cytosine) content of DNA varies so much across taxonomic groups. In theory, the amounts of the four bases in DNA (adenine, guanine, cytosine, and thymine) should be roughly equal, and regression to the mean should drive all organisms to a genomic G+C content of 50%. That's not what we find. In some organisms, like Mycobacterium tuberculosis, the G+C content is 65%, whereas in others, like Clostridium botulinum (the botulism organism) the G+C content is only 28%.

We know that, in general, G+C content correlates (not perfectly, though) with large genome size, in bacteria. Very low G+C content usually means a smaller genome size, and in fact tiny intracellular parasites and symbionts like Buchnera aphidicola (the aphid endosymbiont) have some of the lowest G+C contents of all (at 23%).

It's not hard to understand the presence of low-GC organisms, since it's well known that most transition mutations are GC-to-AT transitions. The high prevalence of mutations in the direction of A+T has often been called "AT drift."

But some organisms go the other way, developing unusually high G+C content in their genomes, indicating that something must be counteracting AT drift in those organisms.

Recently, a group of Chinese scientists (see Wu et al., "On the molecular mechanism of GC content variation among eubacterial genomes," Biology Direct, 2012, 7:2) has advanced the notion that high G+C content is due, specifically, to the presence of the dnaE2 gene, which codes for a low-fidelity DNA repair polymerase. This gene, they say, drives A:T pairs to become G:C pairs during the low-fidelity DNA repair that goes on in certain bacteria in times of stress. Not all bacteria contain the dnaE2 polymerase. Wu et al. discuss their theory in some detail in a .January 2014 article in the ISME Journal.

In earlier genomic studies of my own, I curated a list of 1373 eubacterial species (in which no species occurs twice), spanning a wide range of G+C values. When I learned of the dnaE2 hypothesis of Wu et al., I decided to check it against my own curated collection of organisms.

The first thing I did was go to UniProt.org and do a search on dnaE2. Some 1882 hits came back in the search, but many hits were for proteins inferred to be DNA polymerase III alpha subunits, not necessarily of the dnaE2 variety. In order to eliminate false positives, I decided to restrict my search to just bonafide dnaE2 entries that have been reviewed. That immediately cut the number of hits down to 77.

But among the 77 hits, some species were listed more than once (due to entries for multiple strains of the organism). I unduplicated the list at the species level and ended up with 60 unique species.

At this point, I wrote a little JavaScript code to check each of the 1373 organisms in my curated list against the 60 known-dnaE2-containing organisms obtained from UniProt. There were 47 matches. The matches are plotted in red in the graph below.

Click image to enlarge. In this plot, genome A+T content (a taxonomic metric) is on the x-axis and coding-region purine content is on the y-axis. (N=1373) The points in red represent organisms that possess a dnaE2 error-prone polyerase. See text for discussion.

This graph plots A+T content (which of course is just one minus the G+C content) on the horizontal axis, against coding-region purine content (A+G) on the vertical axis. (For more information on the significance of coding-region purine content, see my previous posts here and here. It's not important, though, for the present discussion.) Notice that the red points tend to occur on the left side of the graph, in the area of high G+C (low A+T) content. The red dot furthest to the right represents the genome of Saccharophagus degradans. Only 6 out of 47 dnaE2-positive organisms have G+C content below 50% (A+T above 50%). The rest have genomes rich in G+C.

This is, of course, just a quick, informal test (a "sanity check," if you will) of the Wu hypothesis regarding dnaE2 (which is a repair polymerase not needed for normal DNA replication, nor possessed by all bacteria). Various types of sampling errors could invalidate these results. Also, the Wu hypothesis itself is open to criticism on the grounds that correlation does not prove causation. Nevertheless, it's an interesting hypothesis and a random check of 47 dnaE2-positive species in my collection of 1373 organisms tends to provide at least anecdotal verification of the Wu theory that dnaE2 causes drift toward high G+C content.

Of course, Wu's theory does not explain the wide range of G+C contents observed in organisms other than bacteria. (There is no dnaE2 in eukaryotes, for example.) The general notion, however, that genomic G+C content tends to be a reflection of the components of a cell's "repairosome" (the enzyme systems used in repairing DNA) has substantial merit, I think. On that score, be sure to see my earlier analysis of how the presence or absence of an Ogg1 gene influences coding-region purine content.

Here, by the way, are the 47 dnaE2-containing organisms that show up as red dots in the graph above:

Agrobacterium tumefaciens
Agrobacterium vitis
Alkalilimnicola ehrlichii
Anaeromyxobacter dehalogenans
Anaeromyxobacter sp.
Aromatoleum aromaticum
Azoarcus sp.
Bdellovibrio bacteriovorus
Bordetella bronchiseptica
Bordetella parapertussis
Bradyrhizobium sp.
Brucella abortus
Burkholderia mallei
Burkholderia pseudomallei
Caulobacter crescentus
Corynebacterium diphtheriae
Corynebacterium efficiens
Corynebacterium glutamicum
Corynebacterium jeikeium
Dechloromonas aromatica
Gluconobacter oxydans
Hahella chejuensis
Idiomarina loihiensis
Methylococcus capsulatus
Mycobacterium bovis
Mycobacterium tuberculosis
Nocardia farcinica
Propionibacterium acnes
Pseudomonas fluorescens
Pseudomonas mendocina
Pseudomonas putida
Pseudomonas syringae
Ralstonia pickettii
Ralstonia solanacearum
Rhizobium sp.
Rhodopseudomonas palustris
Ruegeria pomeroyi
Saccharophagus degradans
Sinorhizobium medicae
Symbiobacterium thermophilum
Synechocystis sp.
Teredinibacter turnerae
Vibrio parahaemolyticus
Vibrio vulnificus
Xanthomonas axonopodis
Xanthomonas campestris
Xanthomonas oryzae

Tuesday, March 11, 2014

Bacteria in hail stones

An interesting open problem in biology is how so many signature microbial species (not just in the oceans and soil, but in anoxic lake sediments, hot springs, deep underground rock formations, etc.) got so widely and uniformly distributed. How is it, for example, that if you dig down a foot into the topsoil of any back yard in North America, and do the same in any back yard in Japan, say, you are practically guaranteed to find examples of Pseudomonas aeruginosa and Bacillus mycoides (and hundreds of other characteristic soil species)? Did these species get spread by the wind? By birds? By rain?

Probably all three. We know, for example, that African dust storms can carry particles as far as Houston, Texas. But also, bacteria routinely occur in the atmosphere, in clouds, and even in hail stones.

"Hailstones: A Window into the Microbial and Chemical Inventory of a Storm Cloud," by Temkiv et al. (2013), describes finding examples of γ-Proteobacteria, Sphingobacteriales and Methylobacterium (plus some 3000 organic compounds) in hail stones. A similar finding was reported in a blog post by University of Wisconsin bacteriology professor John Lindquist in 2006. Looking at recently fallen hail stones, Lindquist was able to culture purple photosynthetic bacteria (Rhodopseudomonas species) from his samples.

To my knowledge, no one has yet tried to characterize the viral or bacteriophage content of atmospheric moisture. If you're looking for a thesis project in microbiology, this could be a good one.