March 20, 2013

Big Data is not enough

There’s a good piece in one of the New Yorker‘s blogs  about the Human Connectome Project and the proposed Brain Activity Map.  The Connectome project has produced two terabytes of functional and structural data on the brains of 68 volunteers, and the Brain Activity Map is more or less what it says on the tin.

Almost certainly people will be able to do something useful with all this data, but some of the claims for what it means to our understanding of the brain are a bit much. As the RealClearScience blog points out (in a slightly different context), we know the complete nervous system of the nematode C. elegans.  We know every cell and all its connnections to other cells.  We still can’t use this knowledge to reliably predict the nematode’s behaviour even by brute-force simulation, let alone by sophisticated analysis.

Simply using Big Data to work out how a complex system functions requires a lot of simplifying assumptions to be true.  This isn’t because we’re not smart enough to build more complex models (though we’re not), and it isn’t because the computation is beyond us (though it is), it’s a fundamental limitation on learning without an underlying theory to help you.

The way Amazon or Netflix does prediction with all its data is to look for a large group of people who are similar to you, in relevant ways, and see what they bought or watched. That sounds easy, but the weasel words are ‘in relevant ways’.  If you have a moderately large number of variables, there are far too many ways in which people, or nerve signals, or protein concentrations could be similar, and you need to decide which ones are relevant.   This is critical, because finding the true relationships in large numbers of associations is only possible if nearly all the associations are zero; in current jargon, the model is ‘sparse’.

In order to see sparseness, you need to know how to look.    Consider the economy: if you just look at associations between measurements, everything is correlated: you see inflation, you see population growth,  you see seasonal variation.  These patterns need to be removed to get a sparse model where you’ve got some hope of disentangling cause and effect.

In really complex fields like brain activity, we don’t know enough about how to pose the problem so that Big Data will have a hope of solving it.

 

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »