Comment by Lynn Werner
The preceding lectures during this summer semester have provided us with intriguing insights into the uncertainties in infrastructure across various research fields. The final lecture of this semester, delivered by Lindsay Poirier, prompted us to contemplate the uncertainties involved in constructing datasets and how to incorporate them into critical data literacy.
To commence the lecture, Poirier introduced us to her work as a Professor at Smith College, where she instructs interdisciplinary data science courses tailored for both humanities and STEM students. In one of her courses, the students delineated the following dichotomy between “good” and “bad” datasets:
Good datasets | Bad datasets |
Clean | Messy |
Unbiased | Biased |
Complete | Incomplete |
Neutral | Partial |
Objective | Subjective |
Examining these attributions, “good” datasets leave no room for indeterminacies. For instance, the toy datasets commonly utilized in data science courses, such as the Palmer Penguins dataset for R, represent such “good” datasets. These are well-structured datasets that contain no missing cases and are designed to promote an understanding of data science formats. For the students, these datasets are presented as impartial and apolitical, with no consideration given to the circumstances of their construction.
In her lecture, Poirier urged us to challenge the notion of “good” and “bad” datasets and explore how advocacy and indeterminacies can be identified in all datasets. This necessitates an examination of datasets as cultural artefacts and an exploration of how the sites of data construction and the indeterminacies in the data production process are manifested in them.
The lecture focused on Poirier’s research in the field of disclosure datasets, which “aggregate information produced and reported by the same institutions they are meant to hold accountable” (Poirier, 2022, p. 1446). This kind of data is particularly vulnerable to institutionalized incentives for data manipulation, commonly known as “Juking the Stats” or “Cooking the Books.” These practices involve not only falsifying data but also employing deceptive accounting (deliberately misleading and vague standards and laws) and phantom reductions. All of these practices are rooted in uncertainties in the data production process.
Considering these various practices of “Juking the Stats,” there is not merely a dichotomy between “good” and “bad” datasets, but rather data distortions that occur at different stages of dataset construction. Poirier argues that addressing these distortions requires a more expansive and nuanced discourse on how to evaluate datasets, as well as more robust ways of teaching critical data literacy. This includes, amongst other things, evaluating how data contribute to particular social structures, and allowing students to work with real, complex datasets that pique their interest.
The subsequent discussion following the lecture focused on the concept of “data distortions” and the extent to which these alterations during the data construction process are deliberate or unintentional. During this discussion, Poirier emphasised the social and systemic influences that affect data production processes and clarified that distortions can also be understood as manifestations that do not align with people’s experiences.
Relating the conclusions of this lecture to our thematic focus on “Infrastructuring Indeterminacies” for the term, an important aspect of critical data literacy is to acknowledge indeterminacies as a part of the infrastructures of data production.
References:
Poirier, L. (2022). Accountable Data: The Politics and Pragmatics of Disclosure Datasets. In: FAccT ’22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1446-1456.