Compressing DNA: The future plan

I’ve talked about our work early this year on compressing next generation sequencing in the first part of this series of three blogs; my second post outlined the engineering task ahead of us (mainly Guy, Rasko and Vadim). This third post is to get into the fundamental question of whether this can scale “for the foreseeable future”. By “scale” here, I mean a very simple question – can we (the ENA) store the world’s public research DNA output inside of the current fixed disk budget we spend each year currently? (this applies to ENA as an archive, and to each sequencing centre or lab-with-a-machine individually about their own output).

A note of caution – the raw disk cost is just one cost component of running an archive or a sequencing centre, and actually not the largest cost component (though it’s growth is concerning); please don’t feel that if we make the disk part of this problem “solved” that somehow all the other costs of submission management, meta-data management and output presentation – let alone appropriate analysis – are magically solved. There is a lot more to an archive – or a sequencing centre, or a lab – than just storing stuff.

At one level this is hard; the foreseeable future is … unpredictable. The breakneck speed of technology innovation in DNA sequencing might plateau; some people think that other aspects of the DNA sequencing process, perhaps the sample acquisition, or electricity of the machines, or downstream informatics will be the choke point in the future, and so the growth of DNA sequencing will slow. Alternatively storage might get radically cheaper due to your choice of technologies – holographic storage, or physical storage, or a new type of optical disk. These “risks” are all risks in our favour as archives and informaticians; we need to worry far more about whether there is a sustained increased in sequencing technology (doubling every 6 months per $ spent) for the next 5 years. As disk technology doubles between 18 to 24 months, there is a factor of around 3 fold each year we need to make up – in other words for constant cost spend, we have to “win” about 3 fold in compression technology each year.

First off – to dispel some common suggested “solutions” to this which we ourselves have seriously thought about but then (reluctantly) realised that they dont change the problem enough; Firstly discarding “old data” doesn’t change the problem enough to impact the curves. As the observed growth is exponential, one is always storing more next month than one did last month, and very quickly the old parts of the archive are not a big storage component of the total. Realistically one is always making a decision about whether one can handle the next 6 months or so set of data. Sorting data into “valuable” (eg, one-off Cancer samples) and “less valuable” (eg, repeated cell line sequencing) is going to be useful for some decisions, but the naive business of discarding the less valuable sequences only gives you a one-off win by the proportion you discard; as you’ll see below, this doesn’t actually make the problem scaleable. And, in any case, much of the data growth is in high value, one-off samples, such as Cancer samples, which have significant sample acquisition costs. It would feel very weird to be in the situation that we can sequencing large numbers of cancer genomes to help discover new aspects of cancer biology… but we can’t actually reanalyse those genomes in 5 years time because we can’t store them.

A repeated suggestion is though to store this as DNA – as the sample itself. At first glance this seems quite intriguing; DNA is a good storage medium (dense, well understood, doesn’t consume so much electricity to store), and you have the sample in ones hand by definition. One major issue is that for some of the most important samples – eg, cancers – one often exhausts the sample itself in sequencing. But there are also some gnarly practical issues to “just store the sample” – this implies to pool the analysis of a current set of cancer genomes with a set of cancer genomes in 5 years time one would have to order those samples, and fully resequence them – that’s quite a bit of logistics and sequencing machine time, and this implies not only a large amount of sample material to be able to send out to each group that wants this, but also highly time efficient and low cost resequencing in the future (not an impossible suggestion, but it’s not definitely going to happen). Then there is the practicalities of freezer management and shipping logistics. (Does anyone know the doubling time of freezer capacity in terms of fixed $ spend over time?). Doing this in a centralised way is akin to running biobanks – themselves not a trivial nor particularly cheap undertaking – or in a decentralised way one is trusting everyone to run a biobank level of service in freeze sample tracking and shipping. (or you accept that there is some proportion of irretrivable samples, but if we accept a high level of data loss, then there are quite a few other options open). Finally this would mean that only places with good, efficient sequencing machines can participate in this level of scientific data reuse – sequencing machines might be democratised, but it’s nowhere near the reach of hard-drives and CPUs. Although I think this is an interesting idea to think about, I just can’t see this as a practical and responsible way of storing our DNA data over the next 5 years.

So – can we find a better solution to this scaling problem? In particular, will data compression allow us to scale?

You can view this problem in two ways. We either need technology to provide a sustained 3 fold improvement in performance every year, year on year. This is going to be hard to have a credible solution for this with an open ended time window as the solution has to deliver a integer factor improvement each year. Or we can give ourselves a fixed time window – say 5 years, or 10 years, and ask us to manage within the current year-on-year budget for that time. We then appeal to our paymasters that a 5 year or 10 year time horizon is far enough away that many, many things can change in this time, and it’s pointless to try to second guess things beyond this horizon. This latter proposition is more bounded, though still challenging. It means over 5 years we need to have ~250 fold compression, and over 10 years in theory it’s a whopping ~60,000 fold.

(This prediction also implies that in 10 years the cost for a genome is below $1, which seems as the “there will be another choke point argument” will kick in. It is also I think pretty safe to say that there has to be shift over into healthcare as the major user at some point during the 10 year horizon, which doesn’t change the fundamental aspects of the technology mismatch between sequencing and storage, but does mean there is a whole other portion of society which will be grappling with this problem; for example, my colleague Paul Flicek reckons that healthcare will not want to store read level data due to a combination of data flow issues and liability issues. This again is a “risk in our favour”).

Setting ourselves realistic goals, my feeling is that we’ve got to be confident that we can deliver on at least 200 fold compression (worse case – this will take us to around 4 years away) and aim for 2,000 fold compression (worse case this will be somewhere near 6-7 years away, but it may well be that other factors kick in before then). This means a genome dataset will be around a Gigabyte (200 fold compression) or much below it at 2,000 fold compression, and it’s likely that in the scientific process there will be many other data files (or just even emails!) larger than the genome data.

This isn’t actually too bad. We show in the paper and all the current engineering is pointing to a 10-fold one-off drop in compression now with reference based compression on real datasets. So we have a factor of 20-fold to get to the lower bound. The upper bound though is still challenging.

In the original paper we pointed out the compression scheme gets better both with coverage and read length. The shift from medical style sequencing at between 4x – 1,000 genome style – to 20/30x personal genome style to 30-5ox cancer genome sequencing means that this density increase is occuring at the moment per sample. There is also a steady increase in read length. So both areas of technology growth “positively” impact our compression scheme. The “10 fold” number I’ve quoted above is for a 30x high coverage genome (sadly!), but improvements in read length might give us another 4 fold (this is predicting something like 500bp reads over the time window we are considering). Finally in the paper we note that both the absolute compression rate and the way the compression responds to read length is altered by the quality budget. So, if, over time, we reduce the quality budget, from say 2% of identical residues to 0.5% of identical residues, and we argue that the trend on high value genomes (Cancer samples) will be to go somewhat higher in density (50 to 60x perhaps) we might get another 4-5 fold win. 10 x 4 x 5 == 200. Lower bound reached. Phew!

But there is a fly in the ointment here. Actually the “quality budget” concept is for both identical and mismatched bases combined, and thus there is a “floor” of quality budget being the read error rate. This is below 1% in Illumina, but is definitely above 0.1%; for things like color space reads from ABI solid, the “color error” rate is far higher (the properties of color space means you don’t see these errors in standard analyses), and PacBio reads currently have a quoted error rate of around 10% or higher. Let’s leave aside questions of what is the best mix and/or cost effectiveness of different technologies, and accept that the error rate in the reads is quite likely to be higher than our desired quality budget (perhaps dramatically higher). All in all it looks like we’ve got quite a fundamental problem to crack here.

Problems of course are there to be solved, and I am confident we can do this. In many ways it’s obvious – if we’ve reached the end of where we can go with lossy compression on qualities, we’ve got to be lossy on base information as well. But which bases? And if we’re happy about losing bases, why not chuck all of them out?

Going back to my original post about video compression, lossy compression schemes ask you to think about what data you know you can discard – in particular, a good understanding of noise processes allows us rank all the information from “I am certain I don’t need this for future analysis” to “I am certain I need this for future analysis”, and think far more about the first end of this rank list – the uninformative end of it – to discard.

Because we understand sequencing quite well, we actually have a lot of apparent “information” in a sequencing output which we can be very confident is coming from the sequencing process when considered in aggregate. When we see 30 reads crossing a base pair of a normal genome, and just one read has a difference (and if that difference is low quality…) then the probability this is due to a sequencing error process is very high; the alternative hypothesis that this is something biological, is low. Note of course if I told you this was a Cancer genome you’d definitely change the probability of it being a biological process. In effect, we can do “partial error correction” on the most confident errors, leading to a more compressible dataset.

And not only can we change our prior on the “biology” side (for example, normal vs tumor samples), and indeed our aggressiveness of this error correction, but we can also borrow the same ideas from video compression of doing this adaptively on a set of DNA reads; ie, inside of one dataset we might identify areas of the genome where we have high confidence that it is a “normal, well behaving” diploid genome, and then in those regions use the most aggressive error correction. We would of course keep track of what regions we error corrected (but not of course the errors themselves; that’s the whole point of making this more compressible), and the parameters of it, but this would be a small overhead. So one might imagine in a “normal” genome one could have 3 tiers: 1% of the genome (really nasty bits) with all mismatches and many identical to reference qualities scored (quality budget of 2%); a second tier of 10% of the genome where all differences are stored with qualities (quality budget of 0.5%) and then 90% of the genome where the top 95% of confident errors are smoothed out, and biological differences (homozygous SNPs) treated as an edit string on the reference, leaving an apparent error rate of ~0.05%; still above the rate of hetreozygous sites, which are about 10-4; overall the mean “quality budget” is 0.11 ish. This might get us down to something like 0.04 bits/base, around 500 fold better than the data footprint now. One might be able to do a bit better with a clever way to encode the hetreozygous sites, and homozygous sites taking into account known haplotypes, though we will be doing very well in terms of compression if this is what we’re worrying about. To achieve 2,000 fold compression from now, we’d need to squeeze out another factor of 4, though I am still here making the assumption of 500bp reads, which might well be exceeded by then. In any case, once we’re in the framework of lossy compression on both the bases and the qualities there is a more straightforward scientific question to pose – to what extent does this loss of quality impact your analysis? This might be easier to ask on (say) medical resequencing of normal diploids, and harder to ask on hetreogenous tumor samples, but at least it’s a bounded question.

Indeed, as soon as you start thinking about the quality budget not being smooth – either between samples (due to scientific value) or inside a sample, it begs the question why don’t we look at this earlier to help mitigate whether we need to (say) discard unalignable and unassemblable reads, something which will be otherwise hard to justify. And, indeed, when you think about doing reference based compression on ABI Solid Colorspace reads, you’ve got to think about this sort of scheme from the get go, otherwise the compression just wont give you the results you want; this is not surprising – afterall discarding “confident errors” is precisely what the color space aligners are doing in their reconciliation of a color space read to a reference.

A critic might argue that all I am stating here is that we should analyse our data and store the analysis. Although there are clear parallels between “optimal analysis” and “appropriate lossy data compression” – at some limit they must meet to describe the entire dataset – it is I think a fundamental change in perspective to think about what information one can lose rather than what information should one keep. For example, not only can one rate how likely this is to be an error, but in effect weight this by how confident you are in the model itself. This encourages us to be a bit more generous about the information in regions we don’t have good models for (CNVs, Segmental duplications etc) at the expense of regions we do have good models. Nevertheless, good compression is clearly closely tied to good models of the data; the better we understand the data, the better we can compress it. No surprise.

Coming back to the here-and-now – this more aggressive error smoothing scheme is a twinkle in my eye at the moment. We ( and by that I mean really Guy, Rasko, Vadim and the rest of the ENA crew….) still have a considerable amount of entirely practical work to complete this year on the mundane business of good, solid, working code; all these things need to be critically tested in real life examples and by people both inside and outside of the EBI – the near term things which we have a path for in the next year, and the longer term things such as this error smoothing process. And indeed, we might not need such aggressive schemes, as technology might change in our favour – either slower DNA growth or faster disk growth. But I wanted to let people know our current planning. I’ve always loved this quote from General Eisenhower (apparently himself quoting from Army teaching at Westpoint):

“Plans are worthless, but planning is everything”.