The EMLO project contains dozens of correspondence collections centered around different historical figures. Each collection is maintained either by a single institute, or is a merging of smaller collections maintained across multiple institutions.
The metadata of the correspondences has been mapped to a single schema.
Making a comparison of different sets of correspondences, at different scales, draws the focus on different aspects of comparison. At the same time, it brings to the surface some differences in how the digital collections were shaped by selection criteria.
At a small scale, it is easy to see for instance that a collection around a historical figure, e.g. Samuel Hartlib or Françoise de Graffigny has not only letters authored by or addressed to that figure, but also some letters between the correspondents in their networks. When working with many correspondence collections with thousands or tens of thousands of letters, this is a detail that is easily lost in overviews of metadata records and most summary statistics.
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
Let's first load the data into a dataframe and inspect a number of rows so we get an idea of what is in there.
# read the merged letters file into a Pandas dataframe
merged_letters_file = '../data/emlo_letters.csv'
df = pd.read_csv(merged_letters_file, sep='\t')
df
# show the nnnumber of authors and addressees
print('number of distinct authors:', df['author'].nunique())
print('number of distinct addressees:', df['addressee'].nunique())
# The correspondence collections in the dataset with the number of letters they contain.
df.collection.value_counts()
# The authors in the dataset with the number of letters they sent.
df.author.value_counts()
# The addressee in the dataset with the number of letters they received.
df.addressee.value_counts()
# Adjust the default size for figures so that placing two plots
# next to each other in a sub plot are still big enough.
plt.rcParams['figure.figsize'] = [15, 5]
# create a plot canvas with two adjacent subplots
plt.subplot(1,2,1)
# Distribution of number of letters per author
# Sub-plot 1 shows the number of letters by each letter author on normal scaled axes
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')
plt.subplot(1,2,2)
# Sub-plot 1 shows the number of letters by each letter author on a log scaled y-axis
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')
plt.yscale('log')
plt.show()
plt.subplot(1,2,1)
# Number of letters by each letter addressee
df['addressee'].value_counts().hist(bins=100)
plt.xlabel('Number of letters received')
plt.ylabel('Number of addressees')
# Distribution of number of letters per addressee
plt.subplot(1,2,2)
df['addressee'].value_counts().hist(bins=100)
plt.ylabel('Number of addressees')
plt.xlabel('Number of letters received')
plt.yscale('log')
plt.show()
plt.subplot(1,2,1)
# Number of letters by each letter addressee
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')
plt.yscale('log')
# Distribution of number of letters per addressee
plt.subplot(1,2,2)
df['addressee'].value_counts().hist(bins=100)
plt.ylabel('Number of addressees')
plt.xlabel('Number of letters received')
plt.yscale('log')
plt.show()
from collections import Counter
author_dist = Counter([count for count in df['author'].value_counts()])
x_author, y_author = zip(*author_dist.items())
plt.subplot(1,2,1)
plt.scatter(x_author, y_author)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Number of letters authored')
plt.ylabel('Number of authors')
plt.subplot(1,2,2)
addressee_dist = Counter([count for count in df['addressee'].value_counts()])
x_addressee, y_addressee = zip(*addressee_dist.items())
plt.scatter(x_addressee, y_addressee)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Number of letters received')
plt.ylabel('Number of addressees')
plt.show()
The plots show typical skewed distributions. The vast majority of correspondents author and/or receive only one or a few letters (the hight bar on the left of each figure represents all authors/addressees authoring or receiving only one letter). Only a handful of people author or receive more than a thousand letters.
Who are the most prolific authors?
df['author'].value_counts().head(20)
In the list above, most of the authors are the central figure or eponym of one of the EMLO collections.
Exceptions are:
These are prolific authors in collections centred on someone else.
We first look at the letters of August II. Which collections are they part of?
print("Collections with August II of Braunschweig-Wolfenbüttel's letters:")
df[df['author'] == 'August II of Braunschweig-Wolfenbüttel, 1579-1666']['collection'].value_counts()
Next, we look at who these letters are addressed to:
print("Addressees of August II of Braunschweig-Wolfenbüttel's letters:")
df[df['author'] == 'August II of Braunschweig-Wolfenbüttel, 1579-1666']['addressee'].value_counts()
These two queries reveal a typical pattern in these collections. August II has 582 letters in the collection of Johann Valentin Andreae, of which 581 are also addressed to Andreae. Letters in a collection around a certain person tend be either authored or addressed to this person, which makes sense from a recordkeeping perspective. But there is one letter addressed to someone else, i.e. Eberhard III von Württemberg.
Now, let us look at the same queries for John Dury's letters:
print("Collections with John Dury's letters:")
df[df['author'] == 'Dury, John, 1596-1680']['collection'].value_counts()
John Dury has letters in eight different collections, but in seven of those, it is only a handful of letters. We can also see who he addressed those letters to:
print("Addressees of John Dury's letters:")
df[df['author'] == 'Dury, John, 1596-1680']['addressee'].value_counts()
Now we see a differennt pattern. Samuel Hartlib is by far the most frequent addressee of John Dury's letters in these collections. But looking at the two sets of counts above, we note that John Dury authored 837 letters in the Samuel Hartlib collections, of which only 528 are addressed to Samuel Hartlib. Who are the other 309 letters in the Samuel Hartlib collection addressed to?
print("Addressees of John Dury's letters in the Samuel Hartlib:")
df[(df['author'] == 'Dury, John, 1596-1680') & (df['collection'] == 'Hartlib, Samuel')]['addressee'].value_counts()
Apparently, some collections also contains hundreds of letters that are not authored by or addressed to the collection eponym.
df['addressee'].value_counts().head(20)
In the list above, most of the addressees are the central figure or eponym of one of the EMLO collections.
Exceptions are:
These are prolific authors in collections centred on someone else.
print("Collections with letters to Nicolaas Reigersberch:")
df[df['addressee'] == 'Reigersberch, Nicolaas, 1584-1654']['collection'].value_counts()
print("Authors of letters to Nicolaas Reigersberch:")
df[df['addressee'] == 'Reigersberch, Nicolaas, 1584-1654']['author'].value_counts()
print("Collections with letters to Willem de Groot:")
df[df['addressee'] == 'Groot, Willem de, 1597-1662']['collection'].value_counts()
print("Authors of letters to Willem de Groot:")
df[df['addressee'] == 'Groot, Willem de, 1597-1662']['author'].value_counts()
Again, we see some letters between persons who are not the central figure in any of the EMLO collections.
How many letters in each collection do not involve the eponym as either author or addressee?
First, we map the name of the collection to the name as used as author or addressee:
eponyms = list(df['collection'].unique())
authors = list(df['author'].unique())
author_counts = df['author'].value_counts()
authors
best_map = {}
eponym_map = {}
for eponym in eponyms:
#print(eponym)
for author in authors:
if not isinstance(author, str) or ';' in author:
continue
if eponym == 'Fermat, Pierre de' and author == 'Fermat, Pierre, 1601-1665':
eponym_map[eponym] = author
if eponym == 'Comenius, Jan Amos' and author == 'Komenský, Jan Amos, 1592-1670':
eponym_map[eponym] = author
if eponym in author[:len(eponym)]:
if eponym not in best_map or author_counts[author] > best_map[eponym]:
best_map[eponym] = author_counts[author]
eponym_map[eponym] = author
if eponym not in eponym_map:
print(eponym)
print("Collection:\t\t\t\t\t\tAll letters\tNon-eponym letters")
print("----------------------------------------------------------------------------------------")
for eponym in eponym_map:
epo_df = df[df['collection'] == eponym]
#print(eponym, '\t', eponym_map[eponym])
non_epo_df = df[(df['collection'] == eponym) & (df['author'] != eponym_map[eponym]) & (df['addressee'] != eponym_map[eponym])]
perc = non_epo_df.shape[0] / epo_df.shape[0]
print(f"{eponym: <50}\t{epo_df.shape[0]}\t\t{non_epo_df.shape[0]}\t({perc:.2f})")
Most collection have almost exclusively letters involving the eponym, but some collections are very different. In the Peter Paul Rubens collection, the majority (59%) of letters are between other people than Rubens.
df[df['collection'] == 'Rubens, Peter Paul'][['collection','author','addressee']].head(10)
The metadata is fairly minimal when considering just the fields that are in the dataset. But there are more things that can be done.
The names of senders and recipients have the birth and death years (in most cases), so we could use these to group persons by age at death, or birth decade.
The dates that the letters were sent are often exact down to the specific day, but sometimes only a month was known or an earliest and latest probable dates. We can normalise those dates to get an insight in when letters were sent, in which year or month.
At a small scale, there is no need to normalize data, as the researcher can do that mentally while working with the materials.
At an intermediate scale of hundreds or thousands of documents, the variations in names of persons and places, ways in which dates are recorded are becoming a hurdle to analysis. For topical analysis, this is also an issue, as many connections between documents are hard to bring to the surface because of morphological and spelling variations.
At a large scale with hundreds of thousands or millions of documents, the textual variations become less of a hurdle, as there is enough data to identify and map variants.
At a very large scale with tens or hundreds of millions of documents, the textual variations become meaningful and allow measuring contextual nuance in how word variants are used to convey different aspects.
df[df.collection == 'Groot, Hugo de'].date.value_counts()
The are 4067 different values for the dates, with the most common date being 9 January 1638. There are also 14 unknown dates.
import re
def is_day_month_year(sent_date):
return re.match(r'^\d+ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w* \d{4}$', sent_date) != None
def is_month_year(sent_date):
return re.match(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w* \d{4}$', sent_date) != None
def is_year(sent_date):
return re.match(r'^\d{4}$', sent_date) != None
def is_day_month(sent_date):
return re.match(r'^\d+ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*$', sent_date) != None
def is_month(sent_date):
return re.match(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*$', sent_date) != None
def is_day_year(sent_date):
return re.match(r'^\d+ \d{4}$', sent_date) != None
def get_year(sent_date):
if is_day_month_year(sent_date) or is_year(sent_date) or is_day_year(sent_date) or is_month_year(sent_date):
return int(sent_date[-4:])
else:
return None
def get_month(sent_date):
if is_day_month_year(sent_date) or is_month(sent_date) or is_day_month(sent_date) or is_month_year(sent_date):
match = re.match(r'.*((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*).*', sent_date)
return match.group(1)
else:
return None
def get_date_type(sent_date):
if is_day_month_year(sent_date):
return 'day_month_year'
if is_month_year(sent_date):
return 'month_year'
if is_year(sent_date):
return 'year'
if is_day_month(sent_date):
return 'day_month'
if is_month(sent_date):
return 'day_month'
if is_day_year(sent_date):
return 'day_year'
if 'Between' in sent_date:
return 'range_between'
if 'On or before' in sent_date:
return 'range_before'
if 'On or after' in sent_date:
return 'range_after'
if 'Unknown date' in sent_date:
return 'unknown'
else:
return 'invalid format'
#df['date_type'] = df.date.apply(get_date_type)
df['date_year'] = df.date.apply(get_year)
df['date_month'] = df.date.apply(get_month)
df.date_type.value_counts()
df.date_year.max() - df.date_year.min() + 1
df.date_year.hist(bins=322)#.value_counts().sort_index()
df.date_month.value_counts()
df_rubens = df[df.collection == 'Rubens, Peter Paul']
df_rubens[['collection','author','addressee']].head(10)
g = df_rubens.groupby(['author', 'addressee']).size()
u = g.unstack('author')
plt.imshow(u, cmap='hot', interpolation='nearest')
g.sort_values()
How many connections are there between collections? This is easy with two collections, but becomes more difficult when there are many collections.
Which persons appear in multiple collections?
# which authors occur in multiple collections
df[(df[['collection', 'author']].duplicated(keep='first') == False)]['author'].value_counts()
df[df['author'] == 'Oldenburg, Henry, 1619-1677']['collection'].value_counts()
Samuel Hartlib is in the top 20 of addressees but not in the top 20 of authors:
print('Samuel Hartlib\n')
print(f'\tnumber of letters sent:', df[df['author'] == 'Hartlib, Samuel, 1600-1662'].shape[0])
print(f'\tnumber of letters received:', (df[df['addressee'] == 'Hartlib, Samuel, 1600-1662'].shape[0]))
# Number of letters authored by Hugo de Groot per year
hugo = 'Groot, Hugo de, 1583-1645'
df['year'] = df['date'].str.extract('(\d\d\d\d)', expand=False)
df_hugo = df[df['author'] == hugo]
df_hugo['year'].value_counts().sort_index().plot()
plt.show()
df_hugo['addressee'].value_counts()
Some collections include letters between correspondents of the collection creator, while others only contains letters where the collection creator is the author or addressee of the letter.
E.g. the collection of correspondence of Hugo de Groot includes letters between his brother and his brother-in-law.
df_christiaan = df[(df['author'] == 'Huygens, Christiaan, 1629-1695') | (df['addressee'] == 'Huygens, Christiaan, 1629-1695')]
df_christiaan = df[df['collection'] == 'Huygens, Christiaan']
df_christiaan['author'].value_counts()
df_constantijn = df[df['collection'] == 'Huygens, Constantijn']
df_constantijn['addressee'].value_counts()
df['author_freq'] = df.groupby(['author'])['id'].transform('count')
df['addressee_freq'] = df.groupby(['addressee'])['id'].transform('count')
df['correspondents_freq'] = df.author_freq + df.addressee_freq
df.groupby(['id', 'author', 'addressee']).size()
ids = list(df.sort_values('correspondents_freq').id)
auths = list(df.sort_values('correspondents_freq').author)
addrs = list(df.sort_values('correspondents_freq').addressee)
auth_freqs = list(df.sort_values('correspondents_freq').author_freq)
addr_freqs = list(df.sort_values('correspondents_freq').addressee_freq)
auths = [auth if isinstance(auth, str) else None for auth in auths]
addrs = [addr if isinstance(addr, str) else None for addr in addrs]
corrs = [{'id': id, 'auth': auth, 'addr': addr, 'author_freq': auth_freq, 'addr_freq': addr_freq} for id, auth, addr, auth_freq, addr_freq in zip(ids, auths, addrs, auth_freqs, addr_freqs)]
corrs[0]
from collections import OrderedDict
queued = {}
fetch = OrderedDict()
seen = {}
for corr in corrs:
if corr['auth'] not in queued and corr['addr'] not in queued:
queued[corr['auth']] = corr['id']
queued[corr['addr']] = corr['id']
fetch[corr['id']] = corr
print(len(fetch.keys()))
print(len(queued.keys()))
for corr in corrs:
if corr['auth'] not in queued:
queued[corr['auth']] = corr['id']
fetch[corr['id']] = corr
elif corr['auth'] not in queued:
queued[corr['auth']] = corr['id']
fetch[corr['id']] = corr
print(len(fetch.keys()))
print(len(queued.keys()))
for corr_id in fetch:
url = f'http://emlo.bodleian.ox.ac.uk/profile/work/{corr_id}'
print(url)
break
import requests
from bs4 import BeautifulSoup as bsoup
df[df.id == 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a']
#response = requests.get(url)
def get_relation_info(rel_type, detail_soup):
rel_type_soup = detail_soup.find_all(class_=rel_type)
if len(rel_type_soup) == 0:
return None
relation_soup = rel_type_soup[0].find_all(class_='relations')[0]
return {
'relation_type': rel_type.split(' '),
'relation_text': [string for string in relation_soup.stripped_strings]
}
def get_provenance(page_soup):
prov_soup = page_soup.find_all(class_='provenance')[0]
prov = prov_soup.text
return prov.replace('Source of data: ','')
def get_page_details(corr_id, page_soup):
page_details = {
'correspondence_id': corr_id,
'relations': [],
'provenance': get_provenance(page_soup)
}
detail_soup = page_soup.find(id='details')
if detail_soup:
rel_types = ['people authors', 'people recipients', 'locations origin', 'locations destination']
relation_info = [get_relation_info(rel_type, detail_soup) for rel_type in rel_types]
page_details['relations'] = [relation for relation in relation_info if relation != None]
return page_details
def get_correspondence_page(corr_id):
url = f'http://emlo.bodleian.ox.ac.uk/profile/work/{corr_id}'
response = requests.get(url)
page_soup = bsoup(response.content)
return get_page_details(corr_id, page_soup)
corr_id = 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a'
#detail_doc = get_page_details(corr_id, page_soup)
detail_index = 'emlo_page_details'
from elasticsearch import Elasticsearch
es = Elasticsearch()
#es.index(index=detail_index, doc_type='page_detail', id=detail_doc['correspondence_id'], body=detail_doc)
import time
headers = {
'user-agent': 'DataScopesAnalyzer (https://marijnkoolen.github.io/Data-Scopes-Developers-2018/)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-gb',
}
fetch[corr_id]
#time.sleep(10)
from elasticsearch import exceptions
skip = 0
for ci, corr_id in enumerate(fetch):
if es.exists(index=detail_index, id=corr_id):
#print('skip', corr_id)
skip += 1
if skip % 1000 == 0:
print('skipped', skip)
continue
#print('fetching page for', corr_id)
detail_doc = get_correspondence_page(corr_id)
try:
detail_doc['author'] = fetch[corr_id]['auth']
detail_doc['addressee'] = fetch[corr_id]['addr']
except TypeError:
print(fetch[corr_id])
raise
try:
es.index(index=detail_index, doc_type='page_detail', id=detail_doc['correspondence_id'], body=detail_doc)
except exceptions.RequestError:
print(detail_doc)
raise
time.sleep(10)
if (ci+1) % 100 == 0:
print(ci+1, 'correspondence pages fetched')