Instruments and Protocols

2.3. Instruments and Protocols

When we consider the scope of the data, we also consider the instrument being used to take the measurements and the the procedure for taking measurement, which we call the protocol. For a survey, the instrument is typically the questionnaire that an individual in the sample answers. The protocol for a survey includes how the sample is chosen, how nonrespondents are followed up, interviewer training, protections for confidentiality, etc.

Good instruments and protocols are important to all kinds of data collection. If we want to measure a natural phenomenon, such as the speed of light, we need to quantify the accuracy of the instrument. The protocol for calibrating the instrument and taking measurements is also vital to obtaining accurate measurements. Instruments can go out of alignment and measurements can drift over time leading to poor, highly inaccurate measurements (see Section 2.5.2).

Protocols are also critical in experiments. Ideally, any factor that can influence the outcome of the experiment is controlled. For example, temperature, time of day, confidentiality of a medical record, and even the order of taking measurements need to be kept consistent to rule out potential effects from these factors from getting in the way.

With digital traces, the algorithms used to support online activity are dynamic and continually re-engineered. For example, Google’s search algorithms are continually tweaked to improve user service and advertising revenue. Changes to the search algorithms can impact the data generated from the searches, which in turn impact systems built from these data, such as the Google Flu Trend tracking system (see Section sec:scope_bigdata). This changing environment can make it untenable to maintain data collection protocols and difficult to replicate findings.

Many data science projects involve linking data together from multiple sources. Each source should be examined through this data-scope construct and any difference across sources considered. Additionally, matching algorithms used to combine data from multiple sources need to be clearly understood so that populations and frames from the sources can be compared.

Measurements from an instrument taken to study a natural phenomenon can also be cast in the Venn diagram of a target, access frame, and sample (see Section sec:scope_construct). This approach is helpful in understanding their accuracy.