Model the source first! Towards source modelling and source criticism 2.0

This article presents a proposal for data collection from textual resources in history and the social sciences. The data model and data collection practice we propose is based on detailed, yet flexible semantic encoding of the original natural-language syntactic structure and wording, literally translating texts line by line into structured data while preserving all of their vagaries, complexities, conflicting testimonies and the like. Our use case is the study of medieval Christian dissent and inquisition. We propose a thorough way of modelling the sources in order to make them accessible to any thinkable kind of quantitative and computational analysis. We frame our approach as "serial and scalable reading". Representing a new variety of "serial history", it allows us to understand and model texts as never before, and helps bridge the gap between quantitative and qualitative research in non-trivial ways.

For a citable and paginated version, please refer to Zenodo:
Zbíral, David, Shaw, Robert L. J., Hampejs, Tomáš, & Mertel, Adam. (2021). Model the source first! Towards source modelling and source criticism 2.0. Zenodo. https://doi.org/10.5281/zenodo.5218926.

15 Aug 2021

David Zbíral • Robert L. J. Shaw • Tomáš Hampejs • Adam Mertel

Since the very beginning of the Dissident Networks Project (DISSINET) in 2019, we have been working on a data model and workflow for the collection of complex structured data from textual sources: principally, in our case, the records of inquisition trials. By doing so, we aim to deploy various quantitative approaches to the identification, understanding, and explanation of the social, spatial, and discursive patterns of medieval religious dissent and inquisition.

Our solution to this need is founded on one key observation: that, in historical research, we often deal with sources that are fundamentally relational in what they convey: that is to say, sources which contain extensive information concerning relationships and interactions among and between people, places, physical objects, events, meanings, etc. These connections can be transformed quite naturally into structured data.

In DISSINET, we developed an approach that we call source modelling on this basis. It also represents a maximalistic approach to data collection, aiming to accurately render every sentence of the source as relational data. It is therefore more time-intensive than simply extracting the data that is most obviously pertinent to a particular research problem, but it has immense rewards. Above all, it allows us to preserve a very high level of nuance and supporting detail: not just the “positive information”, but also the language of our sources (specific terminology in original languages) and the conditions of production of information (the full chain of information flow, inquisitors’ questions, partial or complete denials by suspects etc.).

Systematically representing such information in the collected data not only allows us to do justice to the intricate interweaving of voices in inquisitorial records, but also to pave the way to a new practice of a computer-assisted source criticism.

In this piece, we will first discuss in greater depth why we bother with such a maximalistic approach to data collection. After that, we will propose how best to model inquisition records, presenting what we have developed over the last two years and a half in DISSINET.

Why structured data?

The main rationale for the use of structured data is the discovery of patterns which could easily escape even very close unassisted reading. This is mostly because our normal practices of reading tend to prefer the richest anecdotal narratives, and patterns we already suspect to find or ones we want to question. Computer-assistance in reading, however, can provide generalization, visualization, and summarization, allowing the researcher to “zoom” in and out and discover significant patterns that are otherwise unanticipated.

For instance, our normal reading practices often uncover apparent differences between patterns of religious engagement within two dissident groups, or patterns of investigation of two different inquisitors, etc. But it is quite hard to assess more than two or three sets of phenomena simultaneously. Similarly, we cannot accurately measure the proportion of these differences. The collection and analysis of structured data offers precisely such possibilities.

To illustrate this, let us look at the broad picture of the importance of men vs. women in Kent heresy trials in 1511–12 and in the trial against the Guglielmites in Milan in 1300 in our data (still quite preliminary in this case). In the following “parallel coordinates plot”, each line represents a person involved in dissidence; the red ones are women, the blue ones are men. The line shows the scores of each individual on some of the standard measures of importance of an individual within a social network.

“Serial history is “less interested in individual facts than in elements capable of integration into a homogeneous series”.”
Pierre Chaunu, ‘L’histoire sérielle’, 297.

By “scalable reading”, we mean a reading which transcends the dichotomy between close and distant reading, neither of which describe well our procedure in DISSINET. Our reading and semantic encoding is definitely “close reading” in the sense of detail; in fact, we could even speak about “(very) slow reading”, since coding sources our way as described further on easily takes 3-8 hours per standard page. At the same time, we are “close-reading” our sources in such a way exactly to achieve some of the aims of Franco Moretti’s “distant reading”, while still preserving the quality control of a historian who spends countless hours on each couple of pages. Overall, however, we want to achieve “reading at all distances”, providing new perspectives both for the “parachutists” and the “truffle hunters” (to borrow Emmanuel Le Roy Ladurie’s proverbial image), as well as everyone in between. The ideal balance for today’s historians is indeed to read across scales, to see the paths that lead from small-scale relational patterns all the way up to big questions. With scalable reading, we seek to transcend the notions of “close” and “distant”, “macro” and “micro” by zooming in and out.

From exceptional anecdote to serial complexity

A new kind of serial history is especially relevant in the study of inquisitorial records. In this field, scholars have often made the equation between rich exceptional narrative and reliability. In fact, however, there is nothing in the formulaicity of our sources which would make them automatically unreliable, and detail and richness does not yet necessarily equate to truthfulness.

Indeed, sources that appear formulaic in language and topic can actually contain unexpected complexity. If superficially repetitive, the way that small scale features combine and relate to one another may be far less regular and hide unanticipated depths. Therefore, we strive to bring the study of inquisitorial records beyond exceptional anecdotes and instead focus on their serial complexity. Such an approach offers otherwise inaccessible insights and is by no means devoid of the nuance and beauty that historians appreciate.

And where has source criticism gone?

Nevertheless, in historical study, we usually deal with very complex sources full of vagaries, uncertainties, fuzziness, conflicting testimonies and so forth. Do computational approaches not run the risk of blinding us to such complexities?

In our experience, computational/digital history seems strongly biased towards fact-oriented data collection: that is to say, isolating key details from their original context in order to render them as data. And of course, such fact-oriented data collection has problems. Firstly, many important decisions on what represents a ‘fact’ have to be often taken ad hoc at the moment of entering a particular datum. Secondly, the data usually come without significant indication of the conditions of their production. This is not to say that researchers are unaware of the issues this creates: these conditions are often cited, usually in opening or concluding remarks. Nevertheless, they will never be adequately represented in the results themselves.

Take a sentence fairly typical of inquisitorial records which could read: “Asked by the inquisitor, Peter said that he had seen Bernarda adore heretics in Lanta some twenty years ago.” For the purposes of computational study, it is tempting to transform this directly into “Bernarda adored heretics in Lanta in ca. 1225”. But there is a substantial loss of information here. Who is speaking? To whom and in what context? What is the time span between the testimony and the reported event? All this is lost if we simply construct Bernarda’s adoration of heretics around 1225 as a fact rather than a claim within a source.

In DISSINET, we think this is a major problem. We deal with sources which offer precious glimpses at the conditions of their production: sometimes we have a trace of the questions (often leading) as well as the answers within an interrogation, sometimes of the use of coercive procedures (e.g. detention or, more rarely, torture) used to help extract information, and sometimes of conflicting testimonies of the same events. In fact they are often much more visible in trial records than in most medieval sources: these conditions should thus provide the essential context for our analyses.

Since these conditions (e.g. interactions at trial) can also be neatly expressed as relations, our solution is to model them in the same way as the rest of our data, and indeed as part of the same data.Therefore we at first model the source itself rather than only isolating the details of immediate interest, literally transforming it line by line into structured data. This then allows us to inscribe source criticism at the heart of the analysis.

Source modelling: a syntactic approach

Having made the case for modelling sources in their entirety as structured, relational data, let us now turn to how we seek to achieve this. Our process of source modelling produces data which are extremely close to the original but at the same time come enhanced and formalized.

Our approach is strongly syntactic: it is based on the manual formalization of the sentences of our sources into data statements with subject(s), predicate(s) and two objects. We have preferred this basic structure – a quadruple or, more colloquially, a “quad” – because even a very simple and omnipresent type of sentence such as “Peter received a gift from Elisabeth” requires four “slots”: subject, predicate, and two objects rather than one. Such basic data statements are almost always extended through various modifiers (representing adjectives, adverbs, etc.), which we call property statements (or “props”). Those property statements also come in the quadruple structure, and allow the same flexibility and complexity as standard statements.

These statements, with their syntactic structure and rich, explicit semantic relationships, are woven together into a knowledge representation known as knowledge graph. Our knowledge graph is close to the notion of “Linked Data”, but its structure comes even closer to representing natural language.

DISSINET data model in a nutshell

While it is not possible to present every detail of the DISSINET data model in just a few paragraphs, its basics can be broken down as below.

Statements and other Entities: the pieces of the semantic jigsaw

As already stated, our source modelling is founded on data statements.

In our data model these are primarily represented by Statements
A Statement is an Entity type, with a unique identifier.

Statements have the purpose of relating other Entities. The other main Entity types are:

Action type
Concept
Person
Group
(physical) Object
Location
Event
Value

(N. B. “source modelling” is not bound to just one specific ontology, and other projects can easily adopt a different classification of entity types while preserving the overall approach.)

Each of these Entities also has a unique identifier (URI).

Wherever possible, we render Entities in the original source language.

If our text says that somebody “adored” (adoravit) the heretics, we do not, upon data collection, transform this into any modern interpretive concept, but rather keep the original Latin verb.
Our entity lists thus contain entries in Latin, Middle English, Occitan, etc., with modern English serving only as an analytical metalanguage where necessary.
This is especially important for Action Type entities, which define the predicate of Statements, and Concepts, which serve to characterise many other entities (see below).

This allows Statements to relate entities in a way that not only matches the meaning of the original sentence, but closely mirrors its language.

Modelling complex semantics through Statements and Entities

At its most basic level, a Statement is built around the quadruple structure mentioned above: a predicate (an instance of an Action Type), which connects any number of entities in up to three actant positions – subject, first object and second object. The number of actant positions depends on the predicate’s valency. Some predicates have no actants (e.g., “it rains”), some only have the subject (e.g., “John came”), some have subject and one object (e.g., “John brought some books”), and still others have subject and two object positions, each of which with different semantics (e.g., “John brought some books to Adelaide”).

To take an example in modern English:

David borrowed 10 Czech crowns and 50 hallers from Adam and Tomáš at the end of the year 2019.

At its simplest level:

“David” is the subject (a Person);
the predicate is “borrowed” (an Action Type);
object 1 is “Adam” and “Tomáš” (two Persons); and
object 2 is “10 Czech crowns and 50 hallers” (an Object).

But there is much more complexity to model in this example.

For instance, how to make clear that the Object labelled “10 Czech Crowns and 50 hallers” is money, and has a financial value of 10.50 in CZE currency?
How do we record the time at which the action took place (at the end of 2019?)

This is where properties (‘props’) come into play: these are effectively data statements of the same quadruple structure that serve to expand the mother Statement it in various ways.

‘Props’ apply a “property type” (the kind of property to be attached) and a “property value” (the content of that property) to their subject.
The subject of the prop can be any of the entities involved in the mother Statement (typically with adjectives), or the Statement itself (typically with adverbs and adverbial clauses).

The diagram below illustrates how properties are applied to both the actants and the predicate to frame the aforementioned example.

‘Props’ and their uses

The uses of ‘props’ include:

Modelling adjectives concerning actants.
- These are modelled through Concepts (for property types and qualitative property values) and Values (for instance, for numerical property values”).
- Crucially, the properties remain attached to the textual context from which they derive, rather than the descriptors simply being recorded as facts about the Entities in question.
Instantiating Entities to parent types. For example, it is not enough to create an object for an “apple” without also recording that it is an instance of a wider class of things (in this case, the generic Concept of apple). As with adjectives, this is done by relating them of Concepts. These Concepts themselves are often arranged in conceptual hierarchies inside our data model (e.g. “apple” is also a subset of “fruit”, and “fruit” of “food”).
Defining time and place of action as far as possible.
- This can be in absolute or relative terms (e.g. in relation to another Statement).
- It can utilize fuzzy descriptors as well as more defined values.
Recording other adverbials, for example those concerning manner of action, circumstances, causes or consequences of action. It is possible not only to relate Statements to adverbial Concepts and Values, in the same manner as adjectives. It is also possible to relate the statement to other statements through properties. For instance, this is often essential in matters of causation: in coding a sentence beginning “Because of this deed...”, it will be necessary to create a link back to the “deed” described in a preceding statement.

Statement perspectives: Modality, Epistemic level, and Certainty

To further characterize the claim denoted by a Statement, we carefully record three different aspects of perspective: modality, epistemic level, and certainty.

Modality is intrinsic to the text and describes in what semantic mode the statement is formulated. This allows us, for instance, to differentiate positive from negative assertions (“Peter was there” vs “Peter was not there”), and assertions from questions (“Was Peter there?”), statements expressing desirability (“May Peter go there”), conditionality (“Were Peter to go there…”) and so on.
Epistemic level describes the position from which a Statement is formulated. We differentiate three levels: textual (that is, an explicit claim of the source), interpretive (that is, our interpretation but still close to the text), and inferential (that is, our inference external to the source). Importantly, we are able to mark epistemic levels for both whole statements and their individual parts. (This comes in handy for example if the text provides a property value, e.g. “baker”, but the property type, e.g. “profession”, is an editorial classification.)
Certainty is, in our understanding, the editorial judgement on how the statement in question is reliable.

Modelling textual order and information flow

Statements can be like a jigsaw puzzle in and of themselves. But, as Entities, they also form part of a bigger picture, informative for the way in which information was created and communicated.

Crucially, our data model preserves the relation of any statement not only to the entire Text, but also its specific part (e.g., a specific document within a specific trial which is in turn a part of an inquisitorial register). This relational structure serves to model the embeddedness of documents. We always provide Text parts with metadata concerning the roles of Persons in relation to them. Thus it is immediately clear in whose deposition in front of which inquisitor a specific piece of information appears. Within this structure, the exact textual order of the information is also preserved by the order of the Statements. All these elements are connected to Resources, which are representations of Texts, e.g. a specific reproduction of a manuscript and a specific scan of a printed edition.

Our line-by-line coding, meanwhile, records the full cascade of information flow.

Our sources contain many semantics that concern the flow of information. Questions (“After being asked whether...”), responses (“He responded...”), admissions, claims, and denunciations (“He said...”, “He confessed...”, “He accused...”).
These are critical to situating the origin of the information offered by our sources: they must be recorded in their own time, place, and context.

Our structure allows for chains to be formed between multiple statements, in order to capture such flows.

E.g.Sed in crastinum audivit [Petrus Pictavini] prædictum Raymundum Petri referentem sibi quod illa nocte prædicti hæretici hæreticaverant dictum infirmum [Raynaldetum de Soricino].

But next day he heard the aforesaid Raymond Peter telling him that that night the aforesaid heretics had hereticated the said sick man [Raynaldetus of Soricino].

Carefully nesting any piece of information in the appropriate lowest-level Text part and preserving the information flow gives our data a layering that is still extremely rare in historical datasets. Both these elements help us to complete our goal of producing data that is highly contextualized within the discursive qualities of the evidence. The pitfalls of “fact-oriented” data collection are almost completely avoided, and crucially, such source modelling helps us not only to achieve an extremely high level of source-critical nuance to the quantitative study of medieval dissidence, but to make the information flows, including the process of trial and recording, an object of study in itself.

Interface

The tour of our data model makes clear our maximalistic approach to data collection and its possibilities. We preserve the utmost detail of the original, and at the same time we enhance it with a lot of additional information. It allows us maximum flexibility in terms of data projections, exploratory data analysis, and asking otherwise unapproachable questions. We can dig out much more than if our data collection was very hypothesis-driven, and therefore very selective.

It is also, as previously stated, inevitably quite time consuming and intensive work. Interface development has thus run parallel to our efforts to hone our data model, and indeed informed it.

In developing an interface, we started simple. We used a set of connected Google Sheets tables: for statements, “coding sheets” where each row represents one statement and the columns serve to place the different entities in the correct “boxes”; for each type of entity such as persons, locations etc., another dedicated table.

It turned out that starting simple and building up was actually almost the only possible way of starting at all. Using this sort of ‘nuts and bolts’ interface to attempt source coding allowed us to refine not only the requirements for a future front-end, but, most crucially, the needs of the data model itself: The data model as it stands is testament to this approach of trying to structure ever more perfectly real source data in an ad-hoc interface of columns: if we had started from a pre-existing tool or standard for data collection, inevitably tied to somebody else’s ontology, it would not have attained its present flexibility. Constant feedback between our emergent coding practices and the challenges we discovered in the sources – and crucially the constant team discussion of these issues – has shaped our entire approach to source modelling.

We are now transitioning to a more coder-friendly data collection interface of our own creation: InkVisitor. This allows us to perform exactly this kind of rich semantic coding, but in a much more convenient way than Google Sheets. In spite of its name, InkVisitor is a very general data collection interface, with potential application beyond the coding of inquisitorial records: both the interface and the underlying data model are appropriate to other source categories.

Wrapping up

This article can only skim the surface of our source modelling and its rationales. We hope, nevertheless, that the potential of rich, structured data for answering exciting questions about inquisition and dissidence stands clear. Our modelling of sources enables “serial and scalable reading”: by seeing many data points together, we are able to appreciate unanticipated complexity even in formulaic texts and to track the connections between small-scale patterns and the bigger picture. We can not only re-approach dissent and inquisitorial activities from these perspectives, but also the sources themselves. If computational history sometimes appears less critical in its approach to sources, our maximalist approach to data collection allows for what might be called “source criticism 2.0”. The classic source critical focus on the conditions of information production and transmission can now be bolstered by systematic data on those conditions and by analytical techniques which allow us to face the challenges of our sources as never before.

To answer questions on dissidence, inquisition and the sources themselves, DISSINET uses multiple quantitative methodologies. While we began with a focus on social network analysis and geographic information science as extremely promising tools for the exploration of medieval inquisition registers, the evolution of our source modelling approach has helped us to recognise possibilities far beyond these boundaries. This presents both thematically – we explore discursive, as well as social and spatial, patterns – and methodologically: the features captured by our data model can be explored at multiple levels through all manner of computational techniques ranging from “classical” regression modelling through spatial statistics to formally modelling path dependencies in deposition narratives.

The crucial pay-off of source modelling is that it allows us to represent the sources in their entirety, and make them themselves the objects of systematic computational analysis. Instead of falling prey to first impressions or rich but anecdotic narratives, we are able to defer some decisions on the specific data projection to the moment when we have already closely inspected the source at various scales. We thus make rigorously informed decisions regarding our analyses. They are informed not just by closely reading the source in a classical sense, but by understanding it at a new level, one which is virtually inaccessible to unassisted reading practices.

Acknowledgements:

The research presented in this article is a part of the “Dissident Networks Project” (DISSINET, https://dissinet.cz) and received funding from the Czech Science Foundation (project No. GX19-26975X “Dissident Religious Cultures in Medieval Europe from the Perspective of Social Network Analysis and Geographic Information Systems”).

Cite as:

Zbíral, David, Shaw, Robert L. J., Hampejs, Tomáš, & Mertel, Adam. (2021). Model the source first! Towards source modelling and source criticism 2.0. Zenodo. https://doi.org/10.5281/zenodo.5218926.