Freedom of Expression

Multicellular organisms are beautifully precise in the way their different component cells operate in different ways, despite each cell having the same genome. The main difference between cell types is what genes are expressed, and much of this is due to differential expression of RNA transcripts. The ability to measure all the transcripts simultaneously in a cell population with RNAseq has been very informative – but often complex. But this complexity is anchored: for most genes, there is one dominant transcript – and within an organism that major transcript is the same between tissues over half the time. This “glass-half-full” view of mRNA complexity was recently described by Alvis Brazma and his colleagues,   and integrated into a new value added resource at the EBI, the Expression Atlas.

The need for an Atlas

Each multicellular organism, be it a fruit fly, a person or a sequoia, depends on a precise configuration of different molecules in each of their cell types, which allows them to form tissues, and then organs (hearts, fins, roots…) and keep them working. Having a good grasp of all these molecules in each cell type is fundamental to developing a clear picture of the whole organism. Basically, you need an “atlas” of where and when each molecule is expressed.

The number of datasets that inform this process has been increasing steadily since the 1990s, when one of the first high-throughput technologies, microarrays, made it possible to assess RNA levels in different samples. This was revolutionary: microarrays gave us the first genome-wide views of gene expression, and capturing these data became the goal of the incipient ArrayExpress archive.

But there were some downsides to microarrays: for instance, most experiments were done on a “differential” basis (i.e. based on the ratio between two samples), both for technical reasons and for interpretation. Also, the measurement had to be deliberately restricted to specific regions (i.e. “probes”) that were designed to look at specific genes. Nevertheless, microarrays were the preferred genome-wide, cellular phenotyping tool used to understand basic biology in many species, from yeast to manatees, and are still used today.

A simpler experiment?

Next-generation sequencing dramatically changed the cost profile of doing a simpler experiment: counting RNA molecules using sequencing (hence the name, RNA-seq), rather than hybridising them to probes. But actually, analysing these data is at least as complex as analysing microarrays. RNA sequencing shows you all different sorts of RNA, and exposes the complexity of all the RNA processing steps taking place in a cell. We’ve known since the 1980s that the same locus can make more than one transcript through alternative splicing, in which different sets of exons are spliced together – measuring all the transcripts at once by RNA-seq has been both illuminating and a bit daunting in the transcript complexity revealed at each locus.

Another challenge is that RNA-seq is not as simple as counting transcripts: the fragments are shorter than the transcripts – short enough to introduce a surprising amount of variance between different aligners of read placement on the genome. (I don’t even want to think about variance between aligners to transcript representation.) There is also considerable complexity in converting RNA to sequenceable DNA, for example in terms of 3’ and GC bias, or choosing whether to keep the strand information.

One gene, one major transcript (most of the time)

Alvis Brazma and his colleagues have given us new insights into what is going on. They used a number of established methods to look at key RNA-seq datasets and observed that for most genes, there is one dominant transcript.

“For almost 80% of the expressed genes in primary tissues, the major transcript is at least twice as abundant as the next one.” They also showed that 65% of the time, the major transcript is the same between tissues. (Concurrently, the same group is finishing a study comparing aligners and counters, which goes into more detail about the technical variability of
these methods.)

Half a glass

This is a glass-half-full/glass-half-empty view of mRNA processing complexity. Clearly, there is a lot of complexity going on in terms of observable transcripts (e.g. 140,000 different transcripts with experimental evidence reported in the latest GenCode release for Human, which is integrated into Ensembl and other genome browsers). Yet only a minority of this complexity is heavily expressed in the tissues and cell lines surveyed. There are only rare cases of “strong switches”, in which the dominant transcript shifts from one form to another between tissues.

That said, there are well known and important shifts in transcripts with clear functional effects, such as a number of members of the sex-determination pathway in Drosophila (n.b. beautiful genetics and molecular biology chased this down long before genome-wide screens came on the scene). There are also important switches in human proteins, such as alternative exon use in troponin.

But these papers provide reassurance that for many genes (with some important exceptions), what changes is the overall expression level, rather than the details of splicing. This is useful because we are far better at robustly quantifying “gene expression” than “transcript expression”.

Wealth in the Data Mines

The Expression Atlas resource at EMBL-EBI provides pre-processed information for a number of RNA datasets, both from the ‘old-school’ microarrays and from RNA-seq. I particularly like the forthcoming Absolute Expression Atlas (with levels in the arbitrary but more comparable Fragments Per Kilobase Per Million Reads (FPKMs)), which will be rolled out this year. It lets you select different tissues and set the scale to see sets of genes above a particular FPKM, and has an easy tabular download you can use for your own analysis.

The Absolute Expression Atlas is an excellent example of a value-added resource. It builds upon the underlying experimental ENA archive (for sequence) and ArrayExpress (for microarrays), and uses the critical sample annotation provided by the BioSamples Database . It’s another case of the EBI’s refinery-like process.

I know that Alvis and his team are not stopping at RNA. They’re working with the PRIDE proteomics resource, which is developed in Henning Hermjakob’s team, and the Protein Atlas, which is developed in Matthias Ulhen’s group in Stockholm (n.b. part of Sweden’s ELIXIR node) to integrate RNA-seq, microarrays, quantitative proteomics and antibody-based tissue arrays into an integrated resource about the expression of genes. This integration will do a lot of the “heavy lifting” of consistently running different experiments (of the same assay type) through the same analysis pipeline to produce comparable results, and then coordinating the different assay types together. This means that researchers working on top of this resource will have more time to make discoveries about the biology.

Watch this space…