24th May 2016
This is a piece of rant.
The story, all names, characters, genomes and incidents portrayed in this blog post are fictitious.
No identification with actual persons (living, dead or undead), places, companies, and processes is intended or should be inferred.
No animals were harmed in the making of this blog post.
Let’s try answering a question:
why are there many incomplete/draft bacterial genomes, and much fewer complete genomes?
The answer is simple: insufficient value/cost ratio.
This can also be summarized as the good enough principle: if something is good enough, it does not get improved.
Sample scenario 1.
Players: Principal Investigator (PI), Bacterial Genome (BG), Biologist (B), Sequencing Company (SC), (optional) Bioinformatician (oBI), Genomes Database (GD).
B is interested to work with BG, and gets PI‘s approval to sequence it.
Biomaterial is sent to SC, which sequences and even assembles the BG.
BG looks overall great and comes in just a handful fragments.
oBI is (optionally) involved, to annotate and describe the BG.
B works happily with the BG, describing and characterizing all the interesting biosynthetic features it contains.
An article is prepared, and oBI is (optionally) involved again, to prepare and submit the BG to the GD.
Preparing the BG, oBI has to answer a question if this BG contains any plasmids.
Upon closer examination, oBI finds that one of the fragments is actually the complete chromosome, and all others are just unplaced fragments of it.
oBI knows that this genome could probably be merged into a single draft scaffold
using bioinformatics tools and manual examination in maybe a few days (or a week… or two? ).
oBI also knows that with a little bit of B‘s help (a few primer walking experiments) it should be possible to have the complete BG within a month or two.
However, BG stays a draft, and is not going to be complete any time soon.
Let’s look at motivations of all the players, and see if any of the players wants the complete BG:
- PI wants publications; spending extra time/effort to make BG complete does not present any obvious benefits;
- BG wants to be left alone;
- B wants to publish exciting new findings; they are already supported by the draft BG, so there is clearly no need for a complete BG;
- SC was happy to get payment in time; SC is also proud to be able to provide genome assembly as an extra service with its (primary) sequencing offers;
- oBI has interest in finishing the BG: it will then be complete; however, there are 5 more other BGs awaiting processing, and the backlog of semi-written manuscripts only keeps growing… finishing this specific BG will not result in a perceived benefit to oBI;
- GD stores genomes; it doesn’t care much if the genome submitted could have been better.
Looks like none of the players sees benefits in actually finishing the BG,
simply because efforts spent (or time waited) does not bring any perceived benefits to any of the players.
Sample scenario 2.
Players: Bacterial Genome (BG), Biologist (B), Sequencing Company (SC), non-optional Bioinformatician (noBI), Genomes Database (GD).
This time, B (who is interested in quickly publishing a short genome announcement) asks for noBI‘s help from the moment the BG is provided by the SC.
noBI has a cursory look at the BG, and although there is a huge discrepancy between thousands of contigs on the one hand and insanely high coverage on the other,
the BG otherwise appears good enough for further work, especially after scaffolding; after all, this is just a genome announcement, not a full-blown article!
There is also some weirdness about the coverage distribution of the BG, but noBI carelessly ignores that.
The BG is worked on: annotated, examined, described, prepared for submission to the GD.
Meanwhile, the announcement article is also nearly complete.
Genome is submitted, and GD‘s response comes back: some scaffolds contain orangutan and human DNA, and some scaffolds contain known adapter sequences in the middle…
“Oh crap“, thinks noBI, “I should have checked the raw reads for adapters and contamination, in spite of having the BG assembly already…”
The GD also kindly offers an easy way out: just remove the obviously-orangutan scaffolds, and remove/mask/discard adapter sequences.
This is the easy way, leading to a quicker genome announcement, and a slight bump to the personal publication records of both B and noBI.
The right way is, of course, to clean raw reads from adapters and contamination, re-assemble, re-scaffold, re-annotate, re-describe the BG,
then prepare again for submission. This can delay the quick genome announcement by about a week,
but will highly likely result in a more contiguous and more correct BG – although still not complete.
As we have learned from Scenario 1, perceived benefits of going the right way (as opposed to the easy way) are nearly non-existent…
There was a genome I have finalized manually a few years ago.
I had some good quality data, obtained a 300-something contigs initial assembly,
then scaffolded and manually finalized to about 10 scaffolds.
There was simply not enough evidence (data) to keep merging scaffolds, so I had to stop.
Nowadays, as bacterial genome sequencing prices are akin to weekend supermarket shopping expenses,
nobody is going the extra mile to produce a better quality, more contiguous, or even a complete genome.
And this feels sad…
On the other hand, consumer markets function like that for decades.
An old water heater with a failed heating element is not repaired: it is replaced by a new water heater,
because human time cost to repair the old one is higher than just buying a new one.
Funnily, universal basic income might change that: without the need to spend 40+ hours a week at work
(and thus being unable to repair that water heater on one’s own),
one might just order that heating element and fix it – instead of buying the new one.
Would universal basic income have the same effect on draft and incomplete bacterial genomes? I have no idea.