Saturday, August 31, 2013

The Geography of Cancer

Yesterday I wrote about the geography of colorectal cancer (CRC) and showed a map of CRC mortality in the U.S. The striking thing about the map was (is) that CRC is much more of a problem in the northern states than in the southern states. It turns out this is not an anomaly and in no way limited to the U.S. Colorectal cancer tracks latitude, worldwide. The further you live from the equator, the greater your chances of dying of colorectal cancer.

The above graph comes from a 2005 paper by Mohr et al. (no longer online; but see the 2013 paper by Cuomo, Mohr, et al.) that correlates cloud cover and distance from the equator with cancer rates in 175 countries. It shows quite clearly that colorectal cancer incidence varies with latitude. The countries with the lowest CRC rates are near zero degrees latitude (the equator).

This effect doesn't just apply to colorectal cancer, though. It also applies to breast cancer:


Breast cancer and colorectal cancer are distinctly different cancers, so in order for these graphs to be as similar as they are, there must be a common denominator of extremely broad applicability underlying the latitude trend. It turns out the common denominator is vitamin D.

I'll spare you the book-length treatment. Suffice it for now to say: More than 2,500 research studies have been published in biomedical journals investigating the inverse association between vitamin D, its metabolites, and cancer, including almost 300 epidemiological studies. For a good overview, I recommend the review article by Garland et al. (2009). You might notice (as I did) a certain amount of hesitancy on the part of big-name researchers to come right out and pronounce vitamin D a bonafide cancer-preventive agent, due to the relative dearth of prospective (intervention-based) randomized controlled trials. (One intervention study worth reading is the 2007 trial by Lappe et al. in Am J Clin Nutr.) After the CARET disaster, no one wants to get caught recommending a vitamin regimen based on epidemiological happy-talk, and I can understand that. Nevertheless, I think the weight of the evidence in favor of vitamin D, at this point, is substantial enough (and any down side negligible enough) that people should start thinking about taking substantial amounts of vitamin D as prophylaxis against cancers of all kinds (not just CRC and breast). My advice is: Read the literature and decide for yourself. Don't wait for FDA, CDC, the National Cancer Institute, or anyone else to give you the green light on this one. They've got their own agendas to worry about.

Friday, August 30, 2013

Cancer Mortality: North versus South

Today's maps are again brought to you courtesy of http://ratecalc.cancer.gov/ratecalc/, where you can easily spend a lunch hour (and then some) becoming engrossed in epidemiological mysteries with no apparent answer.

You don't have to spend much time with cancer maps to convince yourself that cancers occur non-randomly with respect to geography. A good case in point is colorectal cancer (CRC), which appears to be mostly a disease of the northern latitudes, at least in the U.S.


Sure, CRC occurs in the southern states, too. But there's no denying the preferential buildup of mortality in the Northeast. Bear in mind these maps are population-corrected; they reflect death rates per 100,000 people. (In other words, red areas aren't simply high-population areas.)  I think it's interesting to note that CRC tends to track the Mississippi River (and perhaps the Hudson River as well). Don't ask me what it means.

When you look at liver cancer in men, you get more or less the inverse picture:


Evidently your chances of dying of liver cancer are best in the South. Why? Risk factors for liver cancer include gender (male), obesity, alcohol consumption, ethnicity (Asian), smoking, diabetes, use of steroids and certain other drugs, and exposure to aflatoxins (a type of toxin produced by fungus). Some of these factors (alcohol, obesity, diabetes, smoking) are correlated with poverty. Here's a map of poverty in the U.S. (also see the maps in my earlier post on poverty and obesity).


The conclusion isn't that poverty causes cancer. Poverty does, however, correlate with many of the things that lead to liver cancer. Obesity correlates very well with poverty (see my earlier post), and as it happens, obesity increases your odds of death from liver cancer by a factor of 4.5 (see this NEJM paper, p. 1630, for details).

As to why colorectal cancer mostly kills northerners: That's a bit of a mystery. Risk factors for colorectal cancer include alcohol, obesity, diabetes, smoking, animal fat (red meat diet), sugar consumption, inflammatory diseases of the bowel, sedentary lifestyle, and genetics. I don't think any of those things correlate with living in the Northeast. In fact, some of them (through poverty) actually correlate with living in the South. However, it's fairly well known that vitamin D is protective against many types of cancer and lack of sunshine is a risk factor for cancer.

There's actually a microbiological factor associated with CRC that's worth talking about in some detail (in a later post) that could, conceivably, relate to geography. More on that later.


Wednesday, August 28, 2013

Prostate Cancer and Selenium

Today at BigThink (and here) I'm blogging about prostate cancer, which is the second leading cause of cancer death in American men. In particular, I want to get the word out about selenium's well-documented ability to protect again prostate carcinomas. It turns out there are other important health benefits (for people of both sexes) to selenium, some of which I summarize in my BigThink post.

While selenium's exact mode of action is still not wholly known, it appears the mineral induces apoptosis of cancer cells by triggering caspase-3 a cysteine-aspartate protease involved in the "execution phase" of programmed cell death (apoptosis).

After writing the BigThink post, I decided to have a little fun and try for some do-it-yourself epidemiology graphs. I was startled by the results. This is what I came up with.

Prostate cancer in the U.S., 1970-2004 (obtained from http://ratecalc.cancer.gov/ratecalc/).

The above map shows U.S. prostate cancer by county. Rates are population-adjusted (so you're not seeing mere population density effects). It's obvious that the cancer rate is not randomly distributed by geography. But what, I wondered, could account for the uneven distribution?

I searched online for a map of soil selenium distribution, and this is what I found (at http://tin.er.usgs.gov/geochem/doc/averages/se/usa.gif):


Obviously the inverse correlation between selenium and prostate cancer is not perfect. (How could it be? Americans are a mobile lot; and people don't simply eat locally grown vegetables, etc.) But I think the two maps speak pretty clearly to the role of selenium in protecting against prostate cancer. If you want to draw a different conclusion, so be it.

Monday, August 26, 2013

The Geography of Cancer, by Gender

It's a mystery. Why should the geography of cancer be different for men than for women? Maybe you can think of an answer. If so, be sure to leave a comment.

Here's the graph for cancer mortality rates among all U.S. women (1985-2004), by county, courtesy of http://ratecalc.cancer.gov/ratecalc/:


Notably, cancer in women seems to be higher along the Mississippi and Ohio Rivers as well as the Hudson River. There is also a lot of red in coastal New England, the Gulf states (except southern Texas and southern Florida), and northern California.

Compare the above graph to the graph for U.S. men:


The map for men shows an extraordinary concentration of cancer mortality in the South. Note the huge difference between southern Florida and northern Florida, and also between west Texas and east Texas. A thin strip of reduced-cancer geography follows the Appalachian Trail.

It should be noted that the numbers for men are much higher than for women. In the graph for women, the darkest reds start at 182.35, which is only light pink on the men's graph. Nevertheless, the key point is that the geographical distribution of mortality is qualitatively much different for men. The highest death rates for men are clearly in the southern states.

It wasn't always this way. The two graphs above are for the years 1985 to 2004. The graph below shows what cancer mortality for U.S. men looked like using statistics from 1950 to 1984:


This graph shows a pattern much more similar to the women's graph. Somehow, between the 1960s and the 1990s, cancer in men migrated to the southern states. But why?

One clue might be found in the country's changing demographics: The U.S. had a much younger population in 1950 than in 2004. In 1950, barely 8% of the country was age 65 or over. Today about 20% of U.S. citizens are 65 or over. The cancers of old age, in men, are lung (No. 1) and prostate (No. 2). And it does appear that cancer mortalities from those two categories map disproportionately to the southern states. (I'll show the prostate map in my next post.) But still, cancer knows nothing about geography. Why should it be more pronounced in the South? Especially since a map of U.S. population by age shows no accumulation of older Americans in the South?

This is, in my opinion, a rather large open question in epidemiology. It would be nice if someone could provide an answer.

Saturday, August 24, 2013

The Adjustment Bureau

I confess, most fiction bores the living spit out of me. The real world is already 15,000 times as perverse as anything I'm going to encounter in a novel, and frankly (let's be blunt), 999 out of 1000 novels are written to such preposterously low literary standards that I, for one, can rarely get past page two of a novel without thinking of Dorothy Parker's famous comment: "This is not a book to be tossed aside lightly. It must be thrown across the room with great force."

The same goes for movies. It seems like I start five movies for every one I finish. Consequently, when that rare movie or book comes along that catches me off guard and forces me to admire it against my will, I always take the time to go back and do a post-mortem. I ask myself: What was it about this movie that reeled me in, so to speak, even though I didn't think there was a chance in hell I was going to like it? At what point did I agree to suspend disblief? What hooked me? What worked so well that I couldn't stop watching?
Terence Stamp as über-adjuster Thompson.

These are especially pertinent questions for The Adjustment Bureau (2011), which I knew (or so I thought), going in, I couldn't possibly stand to watch, much less enjoy. After all, I'm not religious; I don't believe in angels. The mere mention of a movie about angels makes me want to projectile-vomit. And Matt Damon? Haven't we seen just about enough of that guy on the big screen? (And the not-so-big screen?)

We enter the story with a cloyingly upbeat Congressman David Norris (played by Damon) stumping for a Senate race. He's giving speeches, shaking hands, grinning the win-grin; it looks very much as if the youngest-ever U.S. Representative from New York is destined to become the youngest-ever Senator-from-New-York. But the story immediately pivots. As Norris watches election-night returns on a hotel-room TV, it's increasingly obvious that he's not only going to lose his Senate bid, but lose big. News-channel pundits are already writing his political obituary. He has only a few minutes to pull together a concession speech.

Norris heads downstairs to face a ballroom packed with disappointed supporters, but on the way he darts into a men's room to practice his speech, not realizing that a gorgeous young wedding crasher, Elise Sellas (played by Emily Blunt), is hiding from hotel police in a nearby stall. After several telling moments of solo speech-practicing, Norris decides to use the stall. Out pops an embarrassed Sellas, who explains her predicament. She suddenly recognizes Norris as "the guy running for Senate." Some nervous-clever banter ensues. She critiques the speech. Soon the two are trying to ignore some pretty heavy chemistry.

It's at this critical juncture that writer-director George Nolfi (The Bourne Ultimatum) takes the film's first big risk, having David and Elise ram lips together in the men's room. The "cute meet" takes place only a few minutes into the movie, and the credibility of all that follows rides (arguably) on the believability of this key moment. If it doesn't work, the viewer will disengage; if it works, it better work like hell. Incredibly, the actors manage to pull it off—enough for me to nod approvingly and say "Okay, brave move, I'm buying it so far. Let's see what happens next."

Blunt and Damon pull off an unlikely (but ultimately successful) "cute meet" early in the film.
We soon learn that Norris is being shadowed by a nondescript man in a suit and hat named Mitchell (superbly played by Anthony Mackie), whom we see sitting on a park bench in the morning, receiving instructions from another nondescript "suit" named Richardson (played by John Slattery) to the effect that Norris has to be made to spill his coffee no later than 7:05 a.m. Mitchell tells Richardson not to worry, he'll get Norris as soon as he enters the park.

With this scene, director-writer Nolfi deftly dodges a bullet. Had Nolfi clearly announced Richardson and Mitchell as angels, I would have said "That's enough for me" and stopped watching. Rather than put halos around anyone's head, Nolfi (with the sure knowledge that every viewer has seen The Matrix multiple times) clearly labels the suits in the park as Agents. All we know about them is that they're tailing our man and plan to disrupt his life. To what end, though? "Spilling the coffee" could be code language. Maybe it's a sting operation of some kind. Maybe they're NSA? CIA? ("Good," I'm saying, tossing a chip into the pile. "I'm in for one more round. Deal cards.")

Mitchell, the agent, dozes off while waiting in the park, waking up just in time to see Norris board his bus. The 7:05 window is gone. Mitchell grabs a notebook from his pocket, flips it open, and looks at it in alarm. He jumps up and bolts after the bus, sprinting into and out of traffic, eventually getting struck (though not killed, of course) by a cab.

Bear in mind, at this point in the movie there've already been several "shits" and "damns" and other small comforts to let me know this isn't a Jesus-freak operation. Not by a long shot. The suits seem vaguely menacing (and now inept; they can't even make someone spill coffee on time, or avoid a taxi-cab enema in New York City). God's away on business, apparently.

When our man Norris takes his seat on the bus, the bus lurches and he spills his coffee—not on himself, but on the lady sitting next to him, which turns out to be (who else?) Elise Sellas, Miss Kissyface from the men's room. Banter turns to flirting as Norris's shitty little Blackberry keeps ringing. At one point, Sellas plops it in his coffee. It keeps ringing. "Sturdy little fucker, isn't it?" she says.

Norris, insistent that he pay for the dry-cleaning of Elise's java-stained skirt, gets her phone number (she chides him mercilessly for resorting to such a crude pickup technique). Eventually, the bus stops and she gets off. Norris stands in the doorway of the bus, unsure what to say to this mystery-woman that keeps materializing in his life. "The morning after I lost the election I woke up thinking about you," he blurts out. As the bus starts to go, Elise smiles sadly and gives him the finger.

For me, this was the magic lock-in moment when writer-director Nolfi had me in leg irons; I was "all in" from that point on. I'm still trying to figure out who came up with the idea of Sellas suddenly flipping Norris the bird. It's not in the screenplay (nor in the disappointingly brief Philip K. Dick short story upon which the movie is based). I'm guessing Matt Damon and Emily Blunt worked out the sequence on their own. Whoever thought of it, it was brilliant. The awkwardness, the sexual tension, the prospect of something meaningful suddenly and unexpectedly leapfrogging mere infatuation (with all the emotional panic that that entails—who the hell wants to mess up something sweet and spontaneous with True Love? "Read my finger" indeed), came off as believable and compelling.

Norris arrives at his new job (at an investment banking firm; the consolation prize for losing a Senate race, apparently) and says hi to people as he walks by, unaware that everyone is frozen rock-solid. He arrives in a conference room to find agent (adjuster) Richardson there with some other suits, one of whom is waving an electronic wand over a frozen coworker. There's a moment of confusion as the "adjusters" try to understand how Norris is moving, talking, able to see them. (This is practically the only part of the movie that's true to the Philip K. Dick story.) Richardson orders his men to grab Norris. A Bourne-style chase scene ensues. Director Nolfi is now firmly in his element.

Norris eventually wakes up to find Richardson explaining to him that he (Norris) stumbled into something he wasn't supposed to witness and would therefore have to be "reset." Norris lilstens in horror as Richardson and his fellow suits talk about the situation, the need to "reset" him. A higher-ranking suit suddenly arrives and calls off the reset; there's a suit bitch-fight. Eventually it's revealed to Norris that he went "off Plan," and that the job of the suits is to make the rare but occasionally necessary random adjustments that keep people "on Plan." Norris will go free, but only on the condition that he not reveal the existence of the agents (or a "Plan") to anyone, lest he be "reset." Before he's allowed to leave, Norris is forced to give the slip of paper with Elise's phone number to Richardson, who (as a horrified Norris looks on) sets it ablaze. "You weren't ever supposed to see her again," Richardson explains. Instead of meeting her on the bus that morning, Norris was to have spilled coffee on his shirt and gone back to his apartment to change.

It's made abundantly clear to Norris that if he ever tries to connect with Elise Sellas again, he'll be reset: his memory erased, his life gone, back to square zero (or wherever you go when you're reset).

At this point in the film, I'm solidly hooked. I'm in it now for the love angle (I want to see how Norris and Sellas reconnect) and I'm in it for the sci-fi angle (I want to see Neo/Norris fight the Matrix/Plan and win out over the agents). Director Nolfi has me watching a goddam angel movie. The little bitch.

Norris will, of course, veer way Off Plan to find the woman of his dreams, and Elise will eventually have to be told about The Plan. Waves of new agents will have to track Norris down like a dog, and they'll be thwarted, by turns, necessitating successive levels of escalation. Eventually, sinister super-adjuster Thompson (expertly played by Terence Stamp) will take over. But Norris has a surprise ally in Mitchell, the rank-and-file case worker who flubbed the original coffee-spill errand. It turns out angels sometimes want to defect.

This is an ingenious inversion of the angel myth, portraying angels as bumbling functionaries with their own personal foibles, answering to a hands-off Higher Authority that stays in the background until direct intervention is unavoidable. It turns out there is, after all, a Plan, but it's not perfect, and humans regularly stray from it, requiring tiny "butterfly effect" perturbations that bring things back into alignment. But wait a minute, if angels can read minds and anticipate the future, why can't they just prevent excursions in the first place? Mitchell explains to Norris matter-of-factly: "We lack the manpower to be all places at all times." Adjustments have to be just-in-time/just-enough. Adjusters operate on a need-to-know basis, in a command structure that resembles that of an intelligence organization—or perhaps a large software system, where encapsulation and delegation are strictly enforced to prevent unwanted intimacy between objects, while exceptions bubble up through a call hierarchy.

This is a superbly crafted drama in which the final act leads right where you know it has to lead—to the Ultimate Authority. I don't want to give away the ending. I'll say this: It obeys every good rule of screenwriting, every ironclad rule of storytelling (e.g., making things go from dire to impossibly bad, even when things can't get any worse), but without resorting to the deus-ex-machina ending you're absolutely convinced has to be coming.

A story that can keep you guessing until the very end is rare. Philip K. Dick would, I think, be proud.

Thursday, August 22, 2013

Leonard's Laws

Legendary novelist Elmore Leonard recently passed away at age 87. He left behind not only an impressive body of work but a philosophy of writing, exemplified in his oft-cited Ten Rules.

Leonard's Rules strike me as being remarkably fresh and relevant, overlapping, as they do, not only common sense but modern reading (and publishing) preferences. No one these days wants to read (or publish) Henry James. Today's readers, editors, and agents look for a more streamlined, get-straight-to-the-point approach to writing that eschews descriptive dead weight.

Let's go through Leonard's ten rules one by one.
Elmore Leonard, 1925-2013.

1. Never open a book with weather. Already, Leonard is communicating subtext to the reader: Don't bore me with useless particulars, on any topic, but especially not on the most useless of topics, meteorology. Presumably we'll know soon enough, by the characters' actions or words, whether it was a dark, stormy night.

2. Avoid prologues. I take this to mean narrative setup of any kind, at any point in the story. The story is its own setup. The old screenwriter's axiom "enter the scene late, leave early" comes to mind. Backstory is still story. Present it as such.

3. Never use a verb other than "said" to carry dialogue. As with any rule containing the word "never," this rule can certainly be broken, but as with all truly important rules, it takes an expert to know when to break it, so be forewarned. Try this simple test: Go through a piece of dialog you've written and replace every dialog specifier (remarked, mumbled, exclaimed, gasped, etc.) with "said" and/or with nothing. Does it read better? Flow better?

4. Never use an adverb to modify the verb "said" (he said menacingly). Why? It's overreach. Plus it's editorializing; you're telling the reader how he or she should feel/interpret what's being said. Let the reader interpret as she wishes. Let timing (adroit placement of beats) determine gravity. Give cues, not directorial advice.

5. Keep your exclamation points under control. This rule could well have been given an exclamation point, of course, but Leonard demonstrates his own advice by not doing that. An exclamation point is directorial advice. Use sparingly.

6. Never use the words "suddenly" or "all hell broke loose." If you're still using trite phrases like "all hell broke loose" in your writing, you're in more trouble than you think. If you've heard of a phrase before, it's trite. Get rid of it. (Except in dialog, and even then, don't overdo it.) Adverbs signal missed opportunities to show rather than tell.

7. Use regional dialect (patois) sparingly. If a character is speaking in a thick accent, give one explicit cue in the narrative, then move on. (Let the reader imagine/construct the patois on his/her own.) You're not Faulkner, you're not Twain.

8. Avoid detailed descriptions of characters. A moviegoer will make up a scarier monster in his head than you can show on the screen. But guess what? The Imagined Monster Principle applies to all characters, not just movie monsters. One or two vivid details about a character (preferably shown rather than told) will usually be enough to start the snowball. Add a speck of detail now and then to keep it going. More is not better.

9. Don't go into great detail describing places and things. Description, in general, is out of fashion these days, so if you grew up reading the classics and you're trying to emulate some of your favorite 19th and 20th Century writers, please reconsider. Write the screenplay version first. Later on, when you're rich and famous, you can pretend you're Margaret Atwood.

10. Try to leave out the parts readers tend to skip. What parts are those? Try this: Take something you wrote and read it aloud, pretending (as you read) that Stephen King is sitting right behind you, arms crossed, saying "Okay, so what?"

And then there's Leonard's Eleventh Rule: "If it sounds like writing, rewrite it."

Best, rule, ever.

Wednesday, August 21, 2013

The Mysterious Death of Michael Hastings


So much has already been written about the death of journalist Michael Hastings, there doesn't seem much point in adding more. I'm writing about it today, nonetheless, for two reasons. First, there are still some people who haven't heard about the story; talking about it here helps get the word out. Second, much of the coverage I've seen omits key facts. I think those facts need to be part of the story.

What do we know for sure?

We know that on June 18, 2013, at about 4:30 in the morning, a 2013 Mercedes CLK250 coupe driven by 33-year-old Michael Hastings was headed south on the 600 block of North Highland in Los Angeles when it slammed into a palm tree at high speed, killing Hastings. (See video.)

We know from statements given to the press by his widow that Michael Hastings was working on a story for Rolling Stone about CIA Director John Brennan at the time of his death.

We also know that Hastings had been in contact with Wikileaks lawyer Jennifer Robinson just a few hours before his fatal car crash.

And we know that before the crash, Hastings sent an e-mail to friends with a Subject line of "FBI Investigation, re: NSA." (Most news accounts fail to mention this Subject line or the fact that it refers specifically to NSA, a topic Hastings was not known to be working on.)

The last-minute e-mail sent by Hastings warned recipients (which included colleagues at Buzzfeed) that FBI was interviewing his "close friends and associates"; Hastings suggested they may want to seek legal counsel before talking to law enforcement. The e-mail ended with "I’m onto a big story, and need to go off the radat [sic] for a bit."

NSA is not specifically mentioned anywhere in the e-mail other than the Subject line. It seems likely, though, that an NSA connection of some kind is what Hastings was referring to by "I'm onto a big story." Why? Well for one thing, as I said, NSA is not mentioned anywhere else in the e-mail. But also, when you find out NSA might be targeting you, the obvious thing you do in response is "go off the radar"—refrain from using the Internet, stop sending texts, stop making phone calls. Stop being tracked by NSA.

Hastings must certainly have known that his last-minute e-mail would be intercepted by NSA. (Putting 'NSA' in the Subject line may have been his way of acknowledging NSA as a recipient.) The e-mail served a double purpose of telling the Feds: "I know you're spying on me, and since you are, you should also know that my friends will be making appropriate legal preparations." Hastings took his own advice by reaching out to the Wikileaks attorney.

The security-cam video of the fatal car crash (above) shows signs of being a quick hand-held phone-cam "recording of a recording" by the restaurant owner before the original footage was surrendered to police. (You can see the jerky motions of the handheld phone and the out-of-frame items on the left.) The Hastings Mercedes flashes into view at 0:14 and explodes 3 seconds later. The distance from the security cam to the tree-impact point was estimated by someone on the scene to be 255 feet. (That equates to a speed of 85 feet per second, or 58 mph.) The tape seems to show the car's brake lights on as it speeds past, indicating that Hastings may have been trying (in vain) to get the car to slow down.

Google street view of the scene (photo is from 2011). Red arrow points to the tree that was hit.
Security camera was located under the scalloped awning of the building on the right.

Two days after  the crash, LAPD announced that there appeared to be no evidence of foul play. In reality, an investigation to determine the likelihood of foul play in a case like this takes considerably longer than two days. And Detective Connie White from LAPD’s West Traffic Bureau admitted a week later (to the boyfriend of the owner of the pizza restaurant from which the security-cam footage came) that foul play had not, in fact, been ruled out. She didn't elaborate on why LAPD had felt it necessary to rush to the earlier judgment that foul play was not involved.

Most media sources are still, in fact, reporting that LAPD has ruled out foul play.

Many people have been mystified as to how Hastings could have lost control of what can only be considered an extraordinarily responsive car; how such a car could (barring intoxication of the driver) drive itself, with great precision, straight into a palm tree, essentially.

Six days after the crash, former Bush security advisor Richard Clarke told the Huffington Post exactly how that's possible, explaining:
What has been revealed as a result of some research at universities is that it's relatively easy to hack your way into the control system of a car, and to do such things as cause acceleration when the driver doesn't want acceleration, to throw on the brakes when the driver doesn't want the brakes on, to launch an air bag. You can do some really highly destructive things now, through hacking a car, and it's not that hard.
It's not known what kind of GPS the Hastings car had, but if it was equipped with Wide Area Augmentation System sensing (as many Magellan units now come with), it would have had the one-meter accuracy needed to impact a specific tree on a specific highway—assuming a cyber-attack took place and that it was GPS-guided.

What we should all be wondering, right now, is: Where is the car's black box? Was it recovered from the crash? Who is analyzing it?

If Hastings was indeed the victim of a car-electronics cyber-attack, we'll probably never know about it, of course, because it will be covered up as a "national security matter." So we're left to go by what we do know. I leave it to the reader to draw his or her own conclusions.

Please leave a comment below.

Tuesday, August 20, 2013

Memory Leakage in Firefox

The other day, I was grousing in public (on Twitter) about Firefox memory consumption on my machine. I started posting my memory consumption stats to Twitter every few hours, showing Firefox using 500 megs, then 600, then 800, then 1.1 gigabytes, etc., over a total period of about eight hours. Eventually, Mozilla's Ben Kelly reached out via Twitter to offer help.

Kelly suggested I open a new tab and go to a URL of about:memory. By doing this, you can get numerous views into memory usage (including some extremely verbose reports). Turns out this trick works in Chrome, as well.

I sent Kelly a memdump and he then asked if there was any reason I was still running Firefox 15 (on Vista). I sheepishly told him there was no reason other than sheer laziness and sloth on my part. He pointed out that the latest version of Firefox incorporates 140+ memory-related fixes. I knew what I had to do.

Visitors to this blog tend to be users of Firefox
or Chrome. Why so many people still use
Internet Explorer, I don't know.
That evening, I upgraded to Firefox 23. It was a suitably painless process in that it went quickly, requiring minimal intervention on my part, and resulted in a new version of Firefox (looking much like the old version, thankfully) with all my bookmarks and old settings in place. However, the first time I went to a site that relies on Flash (Google Finance, in this case), I was presented with grey boxes insisting I upgrade Flash. I went ahead and did that, and of course I had to restart Firefox to make the changes take effect.

A similar scenario happened with the Acrobat plug-in; upgrade required. Not a huge deal. Nevertheless it's the kind of small inconvenience that, if you multiply by a dozen or more plugins, acquires a Chinese-water-torture aspect after a while. It gets to be annoying enough to keep you from upgrading Firefox as often as you should.

While I was visiting someone's web site, I wanted to know how they were doing a particular HTML trick, so I typed F12 to pop the Firebug console. Except, nothing happened. "Crap," I muttered. "Firebug isn't compatible with the new version of Firefox."

I went to the Firebug site, figuring if I downloaded the latest version of Firebug it would solve everything. To my horror, I learned that the latest release of Firebug is compatible with Firefox 22 but not 23. Fortunately, I was able to locate a beta version of the next release of Firebug. And it works fine with FF 23 (so far).

So, but. Did the upgrade to Firefox 23 solve my memory-usage issues? Short answer, no. Firefox 23 is certainly less memory-leaky than Firefox 15, but it went from using 177 megs of RAM to 1.1 gigs in 20 hours, then died and popped into Mozilla Crash Reporter shortly thereafter. (A certain plug-in seems to have brought the world to an end.) I had half a dozen tabs open: Twitter, Blogger, Blogger, Gmail, Blockbuster, BigThink.com

I'll keep you posted as to what I find out about memory leakage in Firefox. As of now, I consider it to be an ongoing problem; maybe not for everybody, but for me, at least.

Note: I still consider it important to stay with Firefox for most of my browsing needs. Why? Because of the many privacy-oriented plug-ins/extensions available for it. I am troubled by privacy issues around Chrome and IE. Firefox is a clear "least evils" choice—for me. For now. 

Sunday, August 18, 2013

What Came Before 'RNA World'?

I go to bed sometimes wondering what early earth was like. I try to imagine how it's possible that life could have arisen when this planet was perhaps only 1% of its current age, barely cool enough for the oceans not to boil off.

It's generally understood that life originated around 3.8 billion years ago in tide pools, swamps, lakes, or possibly the deep ocean, while organic molecules rained down from lightning-filled skies heavy with pyroclastic gases. This is the so-called Primordial Soup Theory of Haldane and Oparin, given experimental weight by Miller and Urey. It leaves open rather a lot of important details, but clearly implies that biopoiesis arose in an aqueous phase through interaction of co-solutes.

Did life begin in, under, or near hydrothermal vents?
Some researchers believe serpentinite rock structures
associated with white chimneys could have provided
pH gradients suitable for biopoesis.
From a chemical standpoint, the characteristic defining feature of life is catalysis; in particular, the catalytic formation of catalysts that catalyze their own formation. In the standard Crick dogma of DNA -> RNA -> protein, we leave undrawn the many monomer/protein interactions that lead back to DNA. Nevertheless, it's clear that 85% to 90% of proteins and 10% to 15% of RNA molecules play mainly catalytic roles in cell chemistry.

For precisely this reason, aqueous-phase Soup Theory should probably be reconsidered. Any chemist will tell you that surface catalysis and phase boundary catalysis are orders of magnitude more effective than pure liquid-phase catalysis. This is why catalytic converters on cars are not giant bongs with fluid in them but instead contain a ceramic honeycomb core overlaid with a solid-phase platinum-palladium washcoat. It is also why the largest industrial catalytic operations (including fluid catalytic cracking of petroleum oil, which is fluid only in terms of the flow of ingredients; the catalyst itself is a solid powder) employ surface catalysis. Indeed, catalysts are often used in powdered, sintered, or coated-bead form specifically to maximize surface area. In living cells, enzymes are only partially solvated (interior portions are typically hygrophobic), and most enzymes can in fact be imagined as solid fixtures onto which reactants are adsorbed. (Surely no one thinks of ribosomes as being "in solution" in the way that, say, a sodium ion is in solution.) Surface catalysis characterizes living systems as well as industrial processes.

We also know that crowding effects are important in controlling enzyme shape and activity, and in the absence of crowding, some enzymes tend to partially unfold. Indeed, it seems likely molecular confinement has (to some extent) driven the evolution of protein primary and tertiary structure. Some would argue that biological macromolecules resembling those of today could not reasonably have arisen in a confine-free aqueous phase and that (therefore) the proto-biotic "soup" envisioned by Oparinn and Haldane is unlikely to have produced cellular life. Some say it's much more likely that biopoiesis began in an environment of solvated clay particles, serpentine rock near hydrothermal vents, or (perhaps) a feldspar lattice of some kind. A colloid (such as clay) offers many advantages. For a clay to be a clay, particles must be no larger, on average, than 2 microns. This is a perfect substrate size for growth of loosely bound biological macromolecules. Such particles offer a huge amount of surface area per unit volume, much more than could be realized through, say, the attachment of catalytic foci to sheets of silica-laden rock.

Such is the state of our ignorance on biopoiesis that there's still no clear agreement on whether proteins appeared first, or nucleic acids (or perhaps biologically active lipids). The jury is still out. The so-called RNA World theory has gained a tremendous following in the last 30 years, based in part on work by Cech and Altman showing that RNA is capable of catalyzing protein formation by itself.  But a fundamental unanswered problem in RNA World theory is how pyrimidines, purines, or other monomers managed to link up with sugars and then form the first RNA molecules in the absence of a suitable catalyst. (RNA can catalyze the formation of RNA, but how did the first RNA-like oligomer arise, without a catalyst?) Pyrimidines and purines are not known to spontaneously bind to ribose, much less form phosphorylated nucleotides, on their own. By contrast, amino acids can easily condense to form dipeptides, and dipeptides can catlyze the formation of other peptides. (For example, the dipeptide histidyl-histidine has been shown to catalyze the formation of polyglycine in wet-dry cycled clay.) Thus, it's at least plausible that proteins came first.

Ironically, abiotic formation of purines and pyrimidines is not, in itself, an insurmountable problem, provided we accept that hydrogen cyanide and formaldehyde were present in the primordial "soup." (Both HCN and formaldehyde have been produced with good yields in spark-discharge experiments involving diatomic nitrogen, CO2, water, and hydrogen. Even in the absence of molecular hydrogen, the yield of HCN and H2CO can approach 2%.) HCN undergoes a base-catalyzed tetramerization reaction to produce diaminomaleonitrile (DAMN), which, with the aid of u.v. light, can go on to yield a variety of purines. Acid hydrolysis of the HCN oligomers thus produced can lead (somewhat circuitously) to pyrimidines.

Abiotic formation of sugars is also possible if formaldehyde is present. Condensation of formaldehyde in the presence of calcium carbonate or alumina yields glycoaldehyde, which can begin a cascade of aldol condensations and enolizations that produce a formidable array of trioses, tetroses, pentoses, and higher sugars via Butlerow chemistry (also called the formose reaction).

The greatest problem with RNA World theory thus isn't the ab initio creation of bases or sugars, but rather their attachment to one another. In current biologic systems, pyrimidines are attached to sugars by displacement of pyrophosphate at the sugar's C1 position (something that has not succeeded in the lab under prebiotic conditions). In living systems, purine nucleosides are created by piecing together the purine base on a preexisting ribose-5-phosphate. It's hard to see how that could occur abiotically.

It's worth noting, too, that while spontaneous creation of sugars and bases can occur through condensations and other reactions, the result would not simply be just the riboses and purines and pyrimidines seen today; rather, there would arise a zoo of different products, including all the stereoisomers of such products. (There are, among the pentoses alone, twelve different possible stereoisomers.) Somehow, early systems would have to have converged on just the sugars, just the bases, and just the isomers of them needed to promulgate living systems.

Not that an abundance of isomers is a bad thing. Maybe pre-cellular "miasmal" life actually comprised a remarkable zoo of thousands (or hundreds of thousands) of potential biomolecular precursors, of which only the most catalytogenic survived. If muds and clays offered the particle substrates on which these molecules were formed, one can imagine that sticky molecules (those with the power to adhere tenciously to clay particles, sealing them off from other, competing molecules) would have eventually won control over the means of catalysis. This would have meant micron-sized clay particles covered over with what would today be called nonsense proteins: ad-hoc polypeptides made of whatever amino acids (and other reactive species) might most easily polymerize.

What might these nonsense proteins have been capable of? In a Shakespeare-monkey typing pool world, any kind of protein is possible, subject only to steric hindrance, crowding effects, and the laws of chemistry. It seems likely that a one-micron clay particle coated with Shakespeare-monkey proteins would expose, if only by accident, hundreds of thousands of active sites of various kinds, creating catalytic opportunities of exactly the sort needed to take chemical evolution to the next stage.

Some enterprising 21st-century Urey or Miller needs to affix tens or hundreds of thousands of nonsense proteins to hundreds of thousands (or better, millions) of clay particles, soak it all in monomers of various kinds (amino acids, sugars, bases, lipids), and see what comes out. Experiments need to be done with activated colloids of various kinds, using temperature cycling as an energy source, using (and not using) oxidizing and reducing agents, with and without wet/dry cycling, with and without freezing and thawing, electrical energy, etc. We need to focus our efforts on what came before RNA World, what life was like before there were templates, before there was a genetic code, before Crick dogma. What were proteins like before the invention of the start codon or the stop codon? (Was protein size determined by Brownian dynamics? Reactant exhaustion? Molecular crowding? Intervention by chaperones or proteases?) What kinds of "protein worlds" might have existed under acidic conditions? Basic conditions? High redox-potential conditions? High or low temperature conditions? Phosphate-rich (or -poor) conditions? Repeat all of the above with and without u.v. light. With and without pyroclastic gases. With and without lightning. With and without cosmic rays. With and without adenylated coenzymes.

Experiments are waiting to be done—by the thousands—in vitro, in silico, in lutum.

Friday, August 16, 2013

Bacterial Genes in Rice: A Cautionary Tale

Something very strange happened the other day.

I was fooling around looking for flagellum genes in various organisms, hoping to find homology between bacterial flagellum proteins and eukaryotic cilia proteins. All of a sudden, a search came back positive for a bacterial gene in rice, of all things.

On a lark, I decided to check further. ("If one gene transferred, maybe there are more," I reasoned.) It was late at night. Before going to bed, I downloaded the DNA sequence data for all 3,725 genes of Enterobacter cloacae subsp. cloacae strain NCTC 9394 and set up a brute-force BLAST search of the 3,725 bacterial genes against all 49,710 genes of Oryza sativa L. ssp. indica. I set the E-value threshold to the most stringent value allowed by the CoGeBlast interface, namely 1e-30, meaning: reject anything that has more than a one-in-1030 chance of having matched by chance. I went to bed expecting the search to turn up nothing more than the one flagellum protein-match I'd found earlier.

When I woke up the next morning, I was stupefied to find that my brute force blast-n (DNA sequence) search had brought back more than 150 high-quality hits in the rice genome.

I later found 400 more bacterial genes, from Acidovorax, a common rice pathogen. (Enterobacter is not a known pathogen of rice, although it has been isolated from rice.)

But before you get the impression that this is some kind of major scientific find, let me cut the suspense right now by telling you the bottom line, which is that after many days of checking and rechecking my data, I no longer think there are really hundreds of horizontally transferred bacterial genes lurking in the rice genome. Oh sure, the genes are there, in the data (you can check for yourself), but this is actually just a sad case of garbage in, rubbish out. The Oryza sativa indica genome, I'm now convinced, suffers from sample contamination. That is to say: Bacterial cells were present in the rice sample prior to sequencing. Some of the bacterial genes were amplified and got into the contigs, and the assembly software dutifully spliced the bacterial data in with the rice data.

My first tipoff to the possibility of contamination (aside from finding several hundred bacterial genes where there shouldn't be any bacterial genes) came when I re-ran my BLAST searches using the most up-to-date copy of the indica genome. Suddenly, many of the hits I'd been seeing vanished. The most recent genome consists of 12 chromosome-sized contigs. The earlier genome I had been using had had the 12 chromosomes plus scores of tiny orphan contgis. When the orphan contigs went away, so did most of my hits.

When I looked at NCBI's master record for the Oryza sativa Indica Group, I noticed a footnote near the bottom of the page: "Contig AAAA02029393 was suppressed in Feb. 2011 because it may be a contaminant." (In actuality, a great many other contigs have been removed as well.)

When I ran my tests against the other sequenced rice genome, the Oryza sativa Japonica Group genome, I found no bacterial genes.

Contamination continues to plague the Indica Group genome. The 12 "official" chromosomes of Oryza sativa indica have Acidovorax genes all over the place, to this day. I suppose technically, it is possible those genes represent instances of horizontal gene transfer. But if that's what it is, then it's easily the biggest such transfer across species lines ever recorded. And it happened only in the indica variety of rice, not japonica. (The two varieties diverged 60,000 to 220,000 years ago.)

The following table shows some of the Acidovorax genes that can be found in the Oryza satisva Indica Group genome. This is by no means a complete list. Note that the Identities number in the far-right column pertains to DNA-sequence similarity, not amino-acid-sequence similarity.

Acidovorax Genes Ocurring in the Published Oryza sativa indica Genome
Query gene
Function
Rice gene
Query coverage
E
Identities
Aave_0021
phospho-2-dehydro-3-deoxyheptonate aldolase
OsI_15236
100.0%
0.0
93.6%
Aave_0289
orotate phosphoribosyltransferase
OsI_36535
100.0%
0.0
96.8%
Aave_0363
lipoate-protein ligase B
OsI_15083
100.0%
0.0
94.6%
Aave_0368
F0F1 ATP synthase subunit B
OsI_15082
100.0%
0.0
98.9%
Aave_0372
F0F1 ATP synthase subunit beta
None
100.1%
0.0
98.2%
Aave_0373
F0F1 ATP synthase subunit epsilon
OsI_15081
100.0%
0.0
97.8%
Aave_0637
twitching motility protein
OsI_37113
100.1%
0.0
95.5%
Aave_0916
general secretory pathway protein E
OsI_17332
86.9%
0.0
96.6%
Aave_1272
NADH-ubiquinone/plastoquinone oxidoreductase, chain 6
OsI_28652
100.0%
0.0
97.3%
Aave_1273
NADH-ubiquinone oxidoreductase, chain 4L
OsI_28651
100.0%
3e-174
100%
Aave_1301
DedA protein (DSG-1 protein)
OsI_21534
97.3%
0.0
96.8%
Aave_1312
hypothetical protein
OsI_15703
99.8%
0.0
93.4%
Aave_1948
histidine kinase internal region
OsI_23297
100.0%
0.0
96.3%
Aave_1950
hypothetical protein
OsI_23296
100.0%
0.0
96.6%
Aave_1957
penicillin-binding protein 1C
OsI_15534
100.1%
0.0
92.8%
Aave_1958
hypothetical protein
OsI_15533
99.2%
0.0
92.2%
Aave_2274
major facilitator superfamily transporter
OsI_33140
95.1%
0.0
92.5%
Aave_2484
2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyltransferase
OsI_19753
100.0%
0.0
97.3%
Aave_3000
ferrochelatase
OsI_33935
100.0%
0.0
96.2%

So let this be a lesson to DIY genome-hackers everywhere. If you find what you think are dozens of putative horizontally transferred genes in a large genome, stop and consider: Which is more likely to occur, a massive horizontal gene transfer event involving several dozen genes crossing over into another life form, or contamination of a lab sample with bacteria? I think we all know the answer.

Many thanks to professor Jonathan Eisen at U.C. Davis for providing valuable consultation.

Thursday, August 15, 2013

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132


I made ZunZun graph the data, and it gave me back a graph that looked like this:



Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:



// What is the SVG y-value of the x-axis?
// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) {
   var cp = document.getElementsByTagName('clipPath')[0];
   var rect = cp.childNodes[1];
   var top = rect.getAttribute('y') * 1;
   var bottom = rect.getAttribute('height') * 1;
   return top + bottom;
}


// This is for use with SVG graphs produced by ZunZun,
// in which data points are described in a series of
// <use> elements. We need to get the list of <use>
// nodes, convert it to a JS array, sort data points by
// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements
   // needs to have an id of "DATA" for this to work!
   // Manually edit the <g> node's id first!
   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,
   // which is NOT the same as a JavaScript array!
   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)
   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)
          results.push( nodes[i] );

       return results;
   }

   // utility routine (another inner method)
   function compareX( a,b ) {
       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1;
   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order
   use.sort( compareX ); // presto, done

   // Main loop
   for (var i = 0; i < use.length; i++) {

       var rect =
           document.createElementNS("http://www.w3.org/2000/svg", "rect");
       var item = use[i];
       var x = item.getAttribute( "x" ) * 1;
       var y = item.getAttribute( "y" ) * 1;
       var rectWidth = 8;
       var rectHeight = GRAPH_VERTICAL_EXTENT - y;
       rect.setAttribute( "width", rectWidth.toString() );
       rect.setAttribute( "height", rectHeight.toString() );
       rect.setAttribute( "x" , x.toString() );
       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple
       rect.setAttribute( "style" ,
           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect
       item.remove(); // delete the old <use> element
   }

   return use;
}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:


   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }


And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   } 
  
BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:


Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.