Friday, October 4, 2013

'Big Data' doesn't just mean increasing the font size.

Title taken from today's xkcd.

I don't know about you, but I work in an industry that's super excited about "Big Data" right now. I work in Education, but I get the impression that it's not just Education that's interested in Big Data. So, let's start with defining Big Data.

There's a lot of information out there. Most of it is really hard to come to collect and understand. That's why we have scientists carefully creating controlled environments and recording data. That's why we have police detectives collecting evidence. That's why we have auditors leafing through your receipts. But for a while now, we've been accumulating data in electronic form. All electronically collected data is more or less structured.

Usually structured data refers to something like a spreadsheet of numbers while unstructured data refers to something like tweets. But let's expand that out. Even your tweets are somewhat structured. They are made up of typed words, links, and hashtags that can be categorized, collected, and analyzed by a computer. Let's think of them as structured and instead, let's think of unstructured as referring to a behavioral psychologists noting monkey grooming rituals.

A tweet is a direct recording of when and where you tweeted, what you wanted to tweet, how your friends responded to it, etc. Notes on monkey grooming are subjective, possibly missing details and being recorded using inconsistent terminology. The point is, your tweets are ripe for analysis. As is most of your electronic data.

In Education, Big Data includes attendance records, test scores, disciplinary records, socioeconomic status, graduation rates, job placement, etc. In Finance it's everything from stock prices to product reviews to weather reports, to, yes, your tweets. In business it's HR records, budgets, market data, and much much more. I'm hard pressed to think of an industry which isn't collecting structured data.

And that's just the point. We're collecting data. We're accumulating it. It needs to go somewhere and it needs to be stored in a way that makes it easy to access, secure, and safe from degradation. The movers and shakers across most industries have realized this for a while now. That's why we have "data warehouses" and "cloud storage."

Right now is probably a good time to point out that "safe from degradation" is a problem of data collection as much as data storage. Remember when we were discussing how our monkey psychologists might record data using inconsistent terminology? Maybe today our scientist recorded the monkey having, "groomed," but yesterday said, "scratched," and the day before that our scientists said "itched." And maybe two months ago, the scientist switched from taking notes in outline form to using a narrative format. These problems are more common with unstructured data recording, but can also occur in more structured contexts.

Every time these changes happen, the data either needs to be "cleaned" or the analysis of the data needs to be sophisticated enough to gloss over these issues, like how you can ask Siri to, "wake me up tomorrow at 7" and Siri knows to "set an alarm for 7am on Tuesday." That's not exactly what you said, but Siri recognized the intent of your command.

Let's retrace our steps for a moment and identify the data requirements we've encountered so far:
1. Collect the data | Practice good data structuring
2. Store the data | Ensure data is safe, secure, clean

As part of storage, most industries have recognized that you need to access the data once it's stored. Accessing data should be easy and quick. If you need an advanced degree in computer science to pull a report and if it takes two weeks to create the report, it's not ideal. Why isn't it ideal? Because the data isn't useful if you have to call your IT guy every time you want to check on something. And usually the thing you want to check on can't wait two weeks. That brings us to requirements three:
3. Access the data | Ensure it's quick and easy

We've also hinted at requirement four:
4. Make the data actionable | ???

While I'm seeing a lot of collection and storage, I've just started to see my industry realize why they need to access the data. Making data actionable is actually really hard. This is in large part because to really understand data, you need to treat it like a scientist, like a mathematician, like an analyst. It requires identifying variables, setting constants, and running statistical analysis. Sadly, most industries and technologies have barely scraped the surface of this problem. I see a lot of requirements like:
- Automatically show the latest data
- Create bar graphs
- Create pie charts
- Update the graphs and pie charts

And I see very little:
- Run t-test
- Set p value
- Set threshold

Most problematically, I as industry leaders reach towards an understanding of the need for analysis, they jump right to the outcome, skipping how to get there. I see requirements for:
- Early warning indicators
- Graphing of benchmark test scores against high stakes test scores

Do you know what you can tell by graphing two different tests against each other? Very little. It reminds me of this comic. Let's add to number four:
4. Make the data actionable | Ensure analysis is statistically sound

One more time, all four requirements together:

1. Collect the data | Practice good data structuring
2. Store the data | Ensure data is safe, secure, clean

3. Access the data | Ensure it's quick and easy
4. Make the data actionable | Ensure analysis is statistically sound

I think we'll get there, but it might take another decade.