What is Big Data?

We had a discussion today in the office today about big data, trying to pin down what it is and where you can use it. Big Data seems like a difficult thing to define, and I find most of the definitions on the internet unsatisfactory. Like this, and this.

So here’s my definition:

“If you’ve got more data than will fit in a relational database, you’ve got big data.”

In Azure this is easy to quantify, currently you can’t create a database larger than 150GB.

This means that you have to split your data into smaller databases (i.e. sharding or SQL Federations), but to really embrace big data you need to forget SQL, transactions and aggregate queries, and embrace NoSQL and eventual consistency.

In Azure you can use Blob or Table Storage to hold up to 100TB per account, and if you design your storage correctly you can get scale without compromising performance. Hadoop is also available to perform map-reduce data transformations, and Cloud Services allow you to scale out a compute farm very quickly, to crunch this data.