Big Data and New Opportunities

2.1. Big Data and New Opportunities

The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data life cycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.

Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are often made available to researchers who organize the project and often they are made available in repositories for the general public to further investigate.

The availability of administrative/organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say medical data that have been collected for healthcare purposes; in other words, administrative data collected for reasons that don’t directly stem from the question of interest can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, posts on social media, and online network of friends and acquaintances, and can be quite complex.

When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies or essentially a census. This over-reach is referred to as the “big data hubris” [Lazer et al., 2014]. Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. One well-known example is the Google Flu Trends tracking system.