In 2011 Peter Turchin and his collaborators created Seshat, a “global history databank” that

systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time. This massive collection of historical information allows us and others to rigorously test different hypotheses about the rise and fall of large-scale societies across the globe and human history.

The project is named after the ancient Egyptian goddess of wisdom, knowledge, and writing.

Turchin, et al. recently published a brief overview of the project, “An Introduction to Seshat”. The Seshat databank supports the study of cliodynamics, a term coined by Turchin. Per Wikipedia, cliodynamics is a “transdisciplinary area of research that integrates cultural evolution, economic history/cliometrics, macrosociology, the mathematical modeling of historical processes during the longue durée, and the construction and analysis of historical databases.”

The ideas are fascinating, although the implementation appears to suffer from the usual maladies of digital humanities projects: a dearth of hands-on computer science and information technology labor, sustainability challenges and the boom-and-bust of grant funding, etc.

On the implementation side I’m particularly interested in the “Data Collection” section of the “Introduction” paper. Creating and curating data and metadata at the scale of the Seshat databank is expensive in terms of experts’ time and effort. The paper describes a template-based approach to collecting data from primary and secondary sources:

Prior to inputting data on a topic, we develop a conceptual scheme through Seshat workshops. The goal is to create a quantitative variable (e.g., polity population) or multiple proxy variables capturing various aspects of a more complex characteristic (e.g. well-being). Seshat research assistants (RAs) then code several test cases in consultation with experts, continually refining the variables. RAs are trained and supervised by teams of advanced (postdoctoral or professorial) social scientists and historians.

Once a coding scheme is operationalized to test theories, data collection begins. First, RAs search the most up-to-date and relevant scholarship (with expert guidance and direct supervision by Seshat’s more senior researchers), sourcing both primary and secondary material, and enter preliminary data. Second, RAs compile lists of questions on values that cannot be coded unambiguously, or on which information in the published sources is lacking, and seek help from the experts on the polity. Finally, we ask experts to review the data to check coding decisions made by RAs and help us fill gaps. The coding process is never “complete,” as Seshat data are constantly checked by various stakeholders, and new relevant data may appear, or novel insights may alter the understanding of known documents and material.