A few days ago I was asked to write a few ideas about “Big data” for a workshop our team had. I have to confess it was quite difficult to get use to the idea of writing a paper on a concept I always try to avoid mention it.
I would say it is not the fear of a new concept that force me to avoid mention the concept “Big data”, but is the fact that “Big data” has a different meaning for different people. I am afraid that talking about “Big data” in social science hides the fact that even with new technologies and data sets, we are still limited very limited in what we can conclude with these new data sets and technologies. I the this post (which I is not complete)I try to present a new definition of “Big data” that is more suitable for economics and marketing.
Does Big data really exist?
It is certainly true that with the Internet, new data sets are now available, and with them new computational costs have emerged. We can see these costs as the price to use these new data sets to test social hypothesis that were impossible to test before. Moreover, these data sets provide us with new information that was not known before.
According to Wikipedia “Big data” is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. This and any definition that defines “Big data” with the concept of supercomputers has the big disadvantage that it is time and spatial dependent. What is big data in 28-04-2014 in Zurich may not be “Big data” in two months, or in another latitude. Another disadvantage for marketers and economists is that we are not interested in computational problems, but we are interested on what the data can tell us. Therefore, we will stop the discussion of the idea “Big data” which is based on “computability” and we focus on the viability of the concept of “Big data” as a sort of data that allows you to make new conclusions. This does not mean that we neglect the computationally challenged that these data sets have. However, we think that in order to make the concept of “Big data” meaningful, we have to restrict the definition in a way that is useful for marketers, economists and policy makers.
If Big data exists, what is it?
The first words, or better say letters, that come to my mind when I see the word big are N and . The second thought that comes to my mind is “Determine the number of observations, N, (or iterations) needed to have an approximation …with an error term bounded by ”. A simple example is to approximate the “approval rating of Barack Obama”. In this case we can ask a random number of individuals in the US and use their answers as an approximation of the “approval rating of Barack Obama”. If we could take a sample “large enough”, then we could “conclude” that our “approximation is good”.
Here “large enough” and “good approximation” are observer and question dependent, while “conclude” is also observer dependent. For the sake of simplicity let us assume that we have agreed on what a good approximation is, i.e. our estimator and the parameter differ at most by . In this case, we are only concern with the meaning of “large enough”.
Clearly in our example, a large number, N, has to be defined as one that when we collect a sample with N or more observations our approximation will be no more than close to the true value.
A problem that emerges in social sciences is that our conclusions are always surrounded by uncertainty.
Therefore, if we agree that “Big data” is “large number observations that allows us to make new social inference”, we also have to agree that “Big data” is observer and question dependent. Let us illustrate the problem dependency with the following question “What are the network structures emerge in online games platforms?”
In an hypothetical case where we collect the data from an online game platform with millions of users and billions of interactions, we could hypothetically know all the network structure in this platform (we only have computational limitations). Unfortunately, for many network structures we have only one observation even with the millions of observations we collected, and therefore the “Big data” is reduced to only one observation. For instance, there is only one biggest connected component, maximal clique, etc. in a network.
Another path that can be follow with new data sets, such as the one collected in online communities like Facebook or Twitter, is the emergence of asymptotic properties. There has been plenty research in this direction for random graph models. For instance, what is the size of the biggest connected component. However, we have to be aware that these models are not a good approximation for many social networks.
What are the “Big data” challenged for the coming years?
It is of course pedant to try to predict the new challenges for “Big data” in the coming years, but we think that there are a few challenges that have to be addressed if we want to exploit “Big data” as much as possible.
First, with the advent of the power of computer simulations and new data sets, the wording of conclusions has been put in the backyard. There has not been clear separations between what is being meant by shown, proved, conclude, and infer. This is understandable, since even in the statistical and logical community there has not been an agreement on how to best define those words.
There are a large number of physical models that have open their path in social science, but we see little penetration of other important ideas such as impossibility theorems, incompleteness theorems, the possibility of many logical systems, the disagreement in the statistical community of what can be conclude from the data, not even the new problem of what can be conclude from a computer simulation.
The number of research on social networks has grown exponentially in recent years; today we not only observe computer simulations but also empirical studies on social platforms like Twitter or Facebook. Unfortunately, many of those studies create arbitrary rules of inference that allow them to conclude astonishing claims, which are for others difficult to believe.
Thus we believe that in the future, new research on “Big data” will be more careful on “wording” their results.