Advice on Big Data Experiments and Analysis, Part I: Planning

Biology has changed a lot over the past decade, driven by ever-cheaper data gathering technologies: genomics, transcriptomics, proteomics, metabolomics and imaging of all sorts. After a few years of gleeful abandon in the data generation department, analysis has come to the fore, demanding a whole new outlook and on-going collaboration between scientists, statisticians, engineers and others who bring to the table a very broad range of skills and experience.

Finding meaning in these beautiful datasets and connecting them up, particularly when they are extremely different from one another, is a detail-riddled journey fraught with perils. Innovation is happening so quickly that trusty guides are rather thin on the ground, so I’ve tried to put down some of my hard-won experience, mistakes and all, to help you plan, manage, analyse and deliver these projects successfully.

Without up-front planning, you won’t really have much of a ‘project’. Throwing yourself into data gathering just because it’s ‘cheap’ or ‘possible’ is really not the best thing to do (I’ve seen this happen a number of times – and embarrassingly I’ve done it myself). ‘Wrong’ experiments are time vampires: they will slurp up a massive amount of your time and energy, potentially exposing you to reputational risk in the event you are tempted to force a result out of a dataset.

This post, the first of three, is about having the strongest possible start for your project via good planning.

1. Buddy up

In the olden days, experimental biologists would generate a bunch of data and then ask a bioinformatician how to deal with it. Well, that didn’t work too well. At the very outset of a project, we have all learned that you need to ensure there are two PIs: one to focus on the experimental/sample-gathering side, and one to keep the analysis in their sights at all times. These two PIs must have healthy, mutual respect, and be motivated by the same overall goal. There are a few, rare individuals who can honestly be described as being both experimental and computational, but in most cases you’ll need two people to make sure both perspectives are represented in the study’s design.

Now, I’m not saying that experimentalists are strangers to analysis, or that bioinformaticians are strangers to data generation. It’s just important to acknowledge that being able to ‘talk the talk’ of another discipline does not, on its own, qualify you to manage that end of the project, with all its complexities, gotchas and signature fails.

As with anything you set out to do for a couple of years, you’ll need to make sure you are working with someone you get on with. There will be tense moments, and you’ll get past them if your co-PI shares your motivation and goal. Provided you get on and share information as you go, buddying up will save you resources in the long run.

Note to Experimental PIs: never assume you’ll be able to tack on an analytic collaboration at the end, after you’ve gathered the data. You don’t want to be caught out by not having considered some important analysis aspect.

Note to Computational PIs: Never assume you can delegate sample management and experimental details to a third party through facility technicians. You know there is a huge difference between experimental data and good experimental data – you will need a trusted experimental partner who understands all the relevant confounders and lab processes, and who can spot a serendipitous result if one pops out, if you’re going to have a successful project.

2. Outlining

The idea that you can generate datasets first and then watch your results emerge from the depths is simply misguided. It is really quite painful (and wasteful) when a dataset doesn’t have what it needs to support an analysis – it is a set-up for forcing results. Before you do anything, have a brief discussion with your co-PI about the main questions you are looking to answer and make a high-level sketch of the project. I’m not talking about a laboured series of chapter outlines – the main thing is to determine the central question. Large-scale data-gathering projects often focus on basic, descriptive things, like, “How much of phenomenon X do we see under Y or Z conditions?” Sometimes the questions are more directed, for example, “How does mitosis coordinate with chromosomal condensation?”

Outlining your hypotheses need only be as simple as, “At the end, we will have a list of proteins in the Qprocess.” If you’re hoping to test a hypothesis, aim for something straightforward, like, “I believe the B process is downstream of the Ras process.”Consider your possible hypothesis-testing modes, but avoid trying too hard to imagine where the analysis might take you; your data and analysis might not agree with your preconceptions in the end.

Also, do not commit to specific follow-up strategy too early! Your follow-up strategy should be determined after your initial analysis has been explored, or your pilot study has been performed.

3. Back-of-the-envelope ‘power calculations’

Take some of the anxiety out of the process by doing a rough calculation before getting into things too deeply. If you (or someone else) has done a similar analysis well in the past, simply use their analysis as a basis for your rough estimate. If you are on completely new ground, make sure you factor in false positives (e.g. mutation calls, miscalled allele-specific events, general messiness) and pay careful attention to frequencies (e.g. alleles, rare cell types).

Many a bad project could have been stopped in its tracks by a half hour’s worth of power analysis. Unless you really need to impress reviewers, you probably don’t need to go overboard – just make a quick sketch. But be honest with yourself! It is all too easy to fudge the numbers in a power analysis to get an answer you want. Use it as a tool for looking honestly at what sort of results you could expect.

4. Get logistical

Plan the logistics according to Sod’s Law. Assume everything that can go wrong will go wrong at least once. This is particularly important if you are scaling up, for example moving an assay from single-well/Eppendorf to an array. For assays, give yourself at least a year for scale-up in the lab (better still, do a pilot scale-up with publication before moving on to the real thing). Pad out all sample acquisition with at least three months for general monkeying around.

5. Have a healthy respect for confounders

Think about the major confounders you will encounter downstream, and randomise your experimental flow accordingly. That is, do not just do all of state X first, then progress to state Y, then Z.

Make sure you store all the known confounders (e.g. antibody batch number, day of growth). Try to work off single antibody batches/oligo batches for key reagents. If you know you will need more than one batch, remember the randomisation! You absolutely do not want the key reagent batch being confounded with your key experimental question, i.e. normal with batch 1, disease with batch 2. Disaster!

6. Plan the analysis

If possible, stagger the experimental and analysis work. See if you can have your analysis postdocs come in to the project later, ideally with some prior knowledge of the work (the best case is that they are around but on another project early, and then switch into this project about a year in). Unfortunately, because funding agencies like to have neat and tidy three-year projects, this is often quite difficult to arrange.

Determine when an initial dataset will be available, and time the data coordination accordingly. Budget at least six months (more likely 12–18 months) of pure computational work. Use early data to ‘kick the tyres’ and test different analysis schemes, but plan to have a single run of analysis that takes at least 12-18 months.

7. Replication/validation strategy

You know you’re not going to cook up the data and analysis, but will you convince the sceptical reader? Make sure you have a strategy in place.

I find it helps to think of this as two separate phases: discovery, and validation/replication. In discovery, you have plenty of freedom to try out different methods and normalisation before settling. The validation/replication phase, for a project of any size, features ‘single-shot’ experiments, which offer a minimal amount of flexibility.

Generally speaking, you should not be doing single-sample-per-state experiments; rather, you should be carrying out at least two biological replicates, which is enough to show up any problem. With five or more biological replicates, you can make good mean estimates. The one exception to the “no single sample” rule is QTL/GWAS, when it is nearly always better to sample new genotypes each time, rather than replicate data from the same genotype (i.e. maximise your genetic samples first, and then improve on per-genetic individual variance).

8. Confront multiple testing

How many tests are you going to do? If it is genome-wide project, you will do a lot, so you need to control for your multiple testing. This is partly about the power calculation, but requires some up-front thinking. Will you do permutations, or trust to the magic of p.adjust() (A wonderful R function that has a set of False Discovery Rate approaches)? What will you do if you find nothing? Is finding nothing interesting in itself?

You’ll all have agreed to try and discover something excellent, but make sure you have a serious conversation up front with your co-PI about what you’ll do if you don’t find anything interesting. Is there a fall-back plan? Traditional, outright replication of an entire discovery cohort needs as much logistical planning – if not more – as the discovery itself. You might decide to use prior data to show how yours is at least solid and good. Organise this beforehand.

9. Publishing parameters

What would you consider to be the first publishable output from this project? Could you put it into a technical publication (e.g. assay scale up, bespoke analytical methods)? At the beginning of the project, you and your co-PI should agree on the broad parameters of authorship on papers, and how multiple papers might be coordinated. For example, will you credit two first authors and two last authors, swapping in priority if there is more than one paper?

If you are a more senior partner in a collaboration, be generous with your “last last” position. Your junior PI partners need it more than you do!

Next Up

This is the first of three posts. Next up: Managing your Big Data project.