A few days ago I was asked to write a few ideas about “Big
data” for a workshop our team had. I have to confess it was quite difficult to
get use to the idea of writing a paper on a concept I always try to avoid
mention it.
I would say it is not the fear of a new concept that force
me to avoid mention the concept “Big data”, but is the fact that “Big data” has
a different meaning for different people. I am afraid that talking about “Big
data” in social science hides the fact that even with new technologies and data
sets, we are still limited very limited in what we can conclude with these new
data sets and technologies. I the this post (which I is not complete)I try to
present a new definition of “Big data” that is more suitable for economics and
marketing.
Does Big data really exist?
It is certainly true that with the Internet, new data sets
are now available, and with them new computational costs have emerged. We can
see these costs as the price to use these new data sets to test social
hypothesis that were impossible to test before. Moreover, these data sets
provide us with new information that was not known before.
According to Wikipedia “Big data” is the term for a
collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing
applications. This and any definition that defines “Big data” with the concept
of supercomputers has the big disadvantage that it is time and spatial
dependent. What is big data in 28-04-2014 in Zurich may not be “Big data” in
two months, or in another latitude. Another disadvantage for marketers and
economists is that we are not interested in computational problems, but we are interested on what the data can tell us.
Therefore, we will stop the discussion of the idea “Big data” which is based on
“computability” and we focus on the
viability of the concept of “Big data”
as a sort of data that allows you to
make new conclusions. This does not mean that we neglect the
computationally challenged that these data sets have. However, we think that in
order to make the concept of “Big data” meaningful, we have to restrict the
definition in a way that is useful for marketers, economists and policy makers.
If Big data exists, what is it?
The first words, or better say letters, that come to my mind
when I see the word big are N and .
The second thought that comes to my mind is “Determine the number of observations, N, (or iterations) needed to have
an approximation …with an error term bounded by ”.
A simple example is to approximate the “approval
rating of Barack Obama”. In this case we can ask a random number of individuals
in the US and use their answers as an approximation of the “approval rating of
Barack Obama”. If we could take a sample “large enough”, then we could “conclude” that our “approximation is good”.
Here “large enough”
and “good approximation” are observer and question dependent,
while “conclude” is also observer
dependent. For the sake of simplicity let us assume
that we have agreed on what a good approximation is, i.e. our estimator and the
parameter differ at most by .
In this case, we are only concern with the meaning of “large enough”.
Clearly in our example, a large number, N, has to be defined as one that when we collect a sample with N or
more observations our approximation will be no more than close
to the true value.
A problem that emerges in social sciences is that our
conclusions are always surrounded by uncertainty.
Therefore, if we
agree that “Big data” is “large number observations that allows us to make new
social inference”, we also have to agree that “Big data” is observer and question
dependent. Let us illustrate the problem dependency with the following question
“What are the network structures emerge
in online games platforms?”
In an hypothetical case where we collect the data from an
online game platform with millions of users and billions of interactions, we
could hypothetically know all the network structure in this platform (we only
have computational limitations). Unfortunately, for many network structures we
have only one observation even with the millions of observations we collected,
and therefore the “Big data” is reduced
to only one observation. For instance, there is only one biggest connected
component, maximal clique, etc. in a network.
Another path that can be follow with new data sets, such as
the one collected in online communities like Facebook or Twitter, is the
emergence of asymptotic properties. There has been plenty research in this
direction for random graph models. For instance, what is the size of the
biggest connected component. However, we have to be aware that these models are
not a good approximation for many social networks.
What are the “Big data” challenged for
the coming years?
It is of course pedant to try to predict the new challenges
for “Big data” in the coming years, but we think that there are a few
challenges that have to be addressed if we want to exploit “Big data” as much
as possible.
First, with the advent of the power of computer simulations
and new data sets, the wording of conclusions has been put in the backyard. There
has not been clear separations between what is being meant by shown, proved,
conclude, and infer. This is understandable, since even in the statistical and
logical community there has not been an agreement on how to best define those words.
There are a large number of physical models that have open
their path in social science, but we see little penetration of other important
ideas such as impossibility theorems, incompleteness theorems, the possibility
of many logical systems, the disagreement in the statistical community of what can be conclude from the data,
not even the new problem of what can be
conclude from a computer simulation.
The number of research on social networks has grown
exponentially in recent years; today we not only observe computer simulations
but also empirical studies on social platforms like Twitter or Facebook.
Unfortunately, many of those studies create arbitrary rules of inference that
allow them to conclude astonishing claims, which are for others difficult to
believe.
Thus we believe that in the future, new research on “Big
data” will be more careful on “wording” their results.