May 10, 2010

Big Data

This has been posted around a bit but it doesn't hurt to keep thinking about those issues.
"Why is the Internet blowing up all our methods courses?"

First, is the Internet really blowing up soc sci methods and their courses? Or is it an uncritical adherence to Internet- and other online methods that's causing the problems here? And that's any sort of methods-related issue (technical, ethical, epistemological)that's been debated since research on the Internet, e-science, big-data, whatever the latest trend in e-research currently is, has made huge loads of data available for research consumption. Whenever I see or hear people attribute practices or other behaviors to technology, including the (fair)use of online data for research purposes, I can't help thinking that that's a technologically deterministic, and uncritical in some ways, approach to the issue that's been debated. Fortunately, though, sometimes that turns out to be just a way of speaking, a way of introducing the issue , and once the discussion is under way it's clear that it's the individuals that are being asked to question their practices, not the technology itself.

One thing I keep noticing is how much discussions about e-science and the ethics of using public online data for soc sci research are based on a premise of 'more is better'.
Do big datasets really make for better soc sci research?

First, there's issues of efficiency. Finding meaningful patterns in large amounts of data is, despite technical advances in visualization and stats software, time-consuming. I understand the allure of "hundreds of thousands of observations" but still, I find it hard to believe that that practice will scale well when applied to every other research project one is involved in.

Second, once big data research is taken for granted, that becomes a self-reinforcing trend. Across researchers, and across fields of research. Once there's an expectation to obtain that kind of scale of data for a study to be considered worthwhile in a subfield of HCI (see trend in online communities published research), then other fields, and newcomers to the current HCI fields will have but to accept the trend and go with it.

The question to ask is, what do hundreds of thousands of observations tell us that we couldn't have deduced from a few hundred ones? Aside from the fact that in a huge dataset, any and all predictors of social behavior will turn out to be statistically significant. The point of small samples was parsimony; deducing behavior reliably from few observations and with few predictors. That saved time, effort, and more importantly, challenged people to think really hard about research design prior to conducting the research study. That process is reversed with big data: first the dataset is obtained, then we begin thinking really hard what we can do with it. Hence the ethical questions regarding the privacy of subjects who might never find out they were subjects in studies conducted based on a very large dataset.

This is where I take issue with the claim that "as a sea of digital data opens up to the horizon, our problems are increasingly about specification error and not sample sizes, just as measures are increasingly unobtrusive and not self-reports"(C. Sandwig's quote). Yes, technically, the problems change as research moves to big data: it's specification and not sampling that causes headaches. Still, I wouldn't think the measures of any soc sci study are unobtrusive. However unreliable self-reports might be, they're made in one's full knowledge that one is a subject in a given study. That's harder to achieve with archival data, and consent can't be obtained retrospectively. Just because the data is archived, that does not make your observations unobtrusive. It might just as likely make them unethical.