Tidy Data For Humanists

Two students with laptops

This year, Price Lab’s week-long digital humanities training institute Dream Lab was canceled due to safety concerns around COVID-19. We created this series of podcasts not as a replacement, but rather to introduce you to some of the people who make Dream Lab such a great experience!

We’re all familiar with tidying up our living spaces, but how can data be tidied up? Matt Lincoln is a research software engineer and Digital Humanities developer at Carnegie Mellon University in Pittsburgh. His class, Tidy Data focuses on combating data anxiety in the humanities by teaching humanists how to handle complex relationships and uncertainty in data, and format their information tidily so it that can be reshaped to drive databases, websites, analyses, and visualizations.

Resources:
Tidy Data –  Annotated Readings and Lessons

Course Description:
Many tools and tutorials promise to help you clean up your messy data, which is an essential step before doing any kind of network, text, spatial, or quantitative analysis or visualization. But how do we even figure out what “clean” means when it comes to complex humanities knowledge, especially when we may not yet know what kind of analysis we eventually want to do? Participants will come out of this class understanding how to create a data plan to capture the parts of their sources that are going to be important for their research questions, handle complex relationships and uncertainty, and format that information into tidy data that can then be reshaped as needed to drive databases, websites, analyses, and visualizations. We will also cover the practical side of using software such as Google Sheets, OpenRefine, and Palladio to collect, tidy, and make exploratory visualizations of humanistic data. Additionally, we’ll learn about using Linked Open Data to interconnect our research databases with objects, documents, and authority lists maintained by institutions such as archives, libraries, and museums, focusing on pragmatic steps that real-life researchers can take to get the most out of connecting their newly-created knowledge (and the data that come with it) back into the larger ecosystem on which we all depend. This course assumes no prior knowledge of databases or coding, and will use freely-available open source tools. We will work with some sample data sets over the course of the week, but participants are encouraged to bring their own data, or sources that they are potentially trying to transform into data, for group “data therapy” sessions in order to apply lessons learned each day to their own work and research.

Instructor:
Dr. Matthew Lincoln is a research software engineer at Carnegie Mellon University Libraries, where he focuses on computational approaches to the study of history and culture, and on making library and archives collections tractable for data-driven research. His current book project with Getty Publications, co-authored with Dr. Sandra van Ginhoven, uses data-driven modeling, network analysis, and textual analysis to mine the Getty Provenance Index Databases for insights into the history of collecting and the art market. He earned his PhD in Art History at the University of Maryland, College Park, and has held positions at the Getty Research Institute and the National Gallery of Art. He is an editorial board member of The Programming Historian.