Engineering around reference based compression

(this is my second blog post about DNA Compression, focusing on the practicalities of compression).

The reference DNA compression which I talked about in my previous blog post was a mixture of theoretical results and real life examples which we had taken from submitted datasets. But it was definitely a research paper, and the code that Markus wrote was Python code with some awful optimisations in the code/decode streams (codec as it’s called by the pros). So it… didn’t have a great runtime. This was one of the points we had to negotiate with the reviewers, and there’s a whole section in the discussion about this is going to work in the future with nice clean engineering.

So – what are the practicalities of reference based DNA compression? Well, first off, this has passed from my area (more research and development) to my colleague Guy Cochrane’s area. Guy runs the European Nucleotide Archive (ENA) – the EBI’s integrated resource for DNA archives that spans “traditional” assemblies and annotation (EMBL data bank for the people who remember this) and the more modern “Sequence Read Archive” (SRA) submissions. ENA has a unified namespace for all these objects and a progressively harmonised underlying data model. Guy brought in Rasko Leinonen (ENA’s technical lead) and they took on the task of creating appropriate production level code for this.

Some of the decisions taken by Guy and Rasko are; first code base in Java, aiming to compatible with the existing Picard toolkit (Picard read/writes the successful SAM/BAM formats, alongside the C based samtools) from Broad, the format copies much of BAM in terms of its header and meta-data structure, and the format has to consider all the practicalities of both external and internal references (see below). The format has to be written with an eye to indexability (so people can slice into it) and there is some discussion about having hooks for future extensibility though one has to be totally careful about not trying to bloat the format – after all the whole point of the format is that it doesn’t take up space. Vadim Zalunin is the lead engineer on this, and there has basically been an extensive transfer of knowledge from Markus to Vadim from Janurary onwards. Finally we had to hit on a good name, which we’ve chosen “cram”. (we know the phrase cram has been used in other computer contexts, eg, cramfs in linux systems, but there are so many other uses of cram in computer science that one more name clash we don’t think is a headache. And it’s too good a name…).

The CRAM toolkit is now on it’s second point release (0.2) and perhaps more importantly Vadim, Rasko and Guy feel confident that “in the summer” of 2011 there will be useful end point code which does the basic operation of compressing a read-paired sequence file with a user-defined “quality budget”. This is the point basically when other groups can try this on realistic scenarios. Quality budget is the idea that we would only be saving a proportion of qualities (between 1% and 5% realistically) from datasets. We had an interesting internal argument about whether the first target was a quality budget that could run from the command line as simple parameters (eg, save the bottom 2% of the qualities, or save all the qualities of places where there is > 2 bases different to the reference) or a more “complete control” of qualities where a structured text file provides the precise quality values per read to save – what we are now calling a “quality mask”. We’ve ended going for the latter for the first release, as we think the first group of people who want to use this will be “power users” in cancer genomics, medical genomic sequencing, and de novo sequencing who will want to critically explore the trade offs between quality budgets, downstream analysis and storage. Clearly as this matures, and used more broadly there will have to be good ‘default choices’ of quality budgets.

This brings me to the next point, which is that although CRAM is being developed as a pragmatic response to the challenge of storing DNA in archives, but we want to push CRAM both upstream and downstream in the toolchains – upstream as a potential target file for sequencers to make, and downstream for analysis groups. We’re deliberating trying to mimic the use of SAM/BAM (hence calling it CRAM, and hence targeting Picard as a potential toolkit use). There are risks here – the information that a sequencer wants to write out is not necessarily precisely what the archives want to store, nor necessarily is an archival format ideal for analysis. But the success of the BAM format – which now the archives take as a submission format, and the ENA is planning to provide access to suggests that the needs are aligned well enough that this is doable. However, this does give Guy an extra headache – he has to manage not only the engineering for the archive, but also try to keep a rather diverse community also happy. We’ve been working with the samtools/picard guys for a while now, but it’s clear that we’ll be getting closer. I know Guy is considering how best to structure the alpha/beta testing cycles and how to work the community as this is pushed out.

A key question to answer early on is to what extent can we compress “unmapped” reads onto “crash” assemblies – this is taking all the unmapped reads and not caring about making a correct assembly, but aiming for a good compression reference. Disturbingly in the paper that for an early 1,000 genome dataset (NA12878 genome high coverage trio) we find that of the ~20% of reads which don’t map to the current reference, only about 15% can be incorporated into a crash assembly. As basically there is about a 10-fold difference in compressability between truly unmappable reads and reference (either “true” reference or crash assemblies), having 15% of singleton reads means that to hold the entire dataset would mean the storage would be dominated by these reads. In the paper we discuss whether we should consider discarding these reads completely – the ultimate loss in lossy compression, and not something you want to do without proper consideration.

Thankfully this probably an overly pessimistic view, triggered by the fact that we used early datasets in which people were submitting all the reads, even the ones flagged as being probably errors. More recent datasets we’ve looked at recently – even a metagenomic microbiome dataset – and cancer datasets show far less unmapped reads at the start, and less post crash assembly- between 10% to perhaps 5% of truly unmapped reads. This would mean even in the worse case of storing all of this, it would be at least equal to the main storage. With a bit more critical analysis of these reads I feel that we might be able to push this down to sub 5% – then we don’t have to worry too much until we get into 20 fold or 50 fold compression of the main aligned datasets. But we’ve clearly got to do some more work here, and I know there are some people across the world keen to get involved in this assessment, in particular from the Cancer community.

Another concern often expressed is about complex datasets such a RNAseq – RNA-seq will have variable expression rates across genes, and spliced reads will not necessarily map well if at all. It seems hard to discard reads from RNAseq at all as it’s precisely these low count, novel splice reads. We need to do some more work here about how many unmapped reads there are in an average RNAseq, but it is worth noting that RNAseq experiments are far, far smaller than whole genome shotgun. As a proportion of the current archival storage, RNAseq is less than 10% of the data, arguing that one can tolerate perhaps 5 fold worse compression properties in RNAseq before it starts to dominate the overall storage.

Finally we have the practicalities of the “Reference” part of reference-based compression. Reference based compression implies a set of references that can be reused. People often get overly exercised about this – as we show in the paper, even in the worse case where we pack the reference inside of each data package (appropriately compressed itself) for the real datasets we examined here, it did not inflate the size too much. Guy and Rasko have made sure that the idea of a mosaic set of references can be handled (for example, the 1,000 genomes project has a reference of the Human GRCh37 with the Epstein Barr Virus – EBV – genome added – quite sensibly as the 1,000 genomes use EBV transformed cells).

There is still a complex task ahead of Guy and Rasko, and then the broader community, for making this a reality. That said, gains of even 5 fold compressability will be welcomed both for people’s disk budgets and volumes for transfer. And I think there is alot more headspace in compression way beyond 5 fold compression, into 20, 100 fold compressability. And that’s going to be the topic for my third blog post on this…