Analyzing and Comparing EMLO Collections

The EMLO project contains dozens of correspondence collections centered around different historical figures. Each collection is maintained either by a single institute, or is a merging of smaller collections maintained across multiple institutions.

The metadata of the correspondences has been mapped to a single schema.

Making a comparison of different sets of correspondences, at different scales, draws the focus on different aspects of comparison. At the same time, it brings to the surface some differences in how the digital collections were shaped by selection criteria.

At a small scale, it is easy to see for instance that a collection around a historical figure, e.g. Samuel Hartlib or Françoise de Graffigny has not only letters authored by or addressed to that figure, but also some letters between the correspondents in their networks. When working with many correspondence collections with thousands or tens of thousands of letters, this is a detail that is easily lost in overviews of metadata records and most summary statistics.

In [1]:
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt

Let's first load the data into a dataframe and inspect a number of rows so we get an idea of what is in there.

In [19]:
# read the merged letters file into a Pandas dataframe
merged_letters_file = '../data/emlo_letters.csv'
df = pd.read_csv(merged_letters_file, sep='\t')

df
Out[19]:
Unnamed: 0 id type collection date author addressee origin destination repository
0 0 577c226e-adfa-43ed-8db2-95792b73f3c3 Letter Bayle, Pierre 21 February 1662 Bayle, Jacob, 1644-1685 Bayle, Jean, 1609-1685 Puylaurens, Occitanie, France Carla-Bayle, Occitanie, France Collection d'E. Labrousse\n \n \n ...
1 1 0ccd18c5-de76-4c66-a7b5-fbc344a9dcde Letter Bayle, Pierre 17 June 1662 Bayle, Jacob, 1644-1685 Bayle, Jean, 1609-1685 Puylaurens, Occitanie, France Carla-Bayle, Occitanie, France 2 printed editions
2 2 2e189892-8105-42e9-876c-02aaa97fb877 Letter Bayle, Pierre 25 August 1662 Bayle, Jacob, 1644-1685 Bayle, Jean, 1609-1685 Puylaurens, Occitanie, France Carla-Bayle, Occitanie, France Collection d'E. Labrousse\n \n \n ...
3 3 da35bb30-cfe1-4b4a-957d-42742088e693 Letter Bayle, Pierre 5 March 1663 Bayle, Jacob, 1644-1685 Bayle, Pierre, 1647-1706 Puylaurens, Occitanie, France Carla-Bayle, Occitanie, France 1 printed edition
4 4 c5fefa45-f197-4bc3-bf52-1272f7f1d573 Letter Bayle, Pierre 7 April 1665 Bayle, Jacob, 1644-1685 Bayle, Jean, 1609-1685 Puylaurens, Occitanie, France Carla-Bayle, Occitanie, France Bibliothèque de la Société de l'Histoire du P...
... ... ... ... ... ... ... ... ... ... ...
132205 23 a9088a3f-3eb2-4039-aeaa-8ad98b9e6f35 Letter Beeckman, Isaac 12 April 1632 Croix, Jacques, 1579-1655 Beeckman, Isaac, 1588-1637 Delft, South Holland, (United Provinces) Nethe... Dordrecht, South Holland, (United Provinces) N... NaN
132206 24 1759be67-e3d3-4e30-a987-b0f84b212f03 Letter Beeckman, Isaac 17 May 1632 Beeckman, Isaac, 1588-1637 Rivet, André, 1572-1651 Dordrecht, South Holland, (United Provinces) N... The Hague, South Holland, Netherlands NaN
132207 25 c1c3e71d-9ca1-4953-968a-b5a8e5d3c00b Letter Beeckman, Isaac 30 May 1633 Beeckman, Isaac, 1588-1637 Mersenne, Marin, 1588-1648 Dordrecht, South Holland, (United Provinces) N... Paris, Île-de-France, France NaN
132208 26 1b82d6d1-8b0c-4bc2-aa57-c856c73c2e66 Letter Beeckman, Isaac 22 August 1634 Descartes, René, 1596-1650 Beeckman, Isaac, 1588-1637 Amsterdam, North Holland, (United Provinces) N... Dordrecht, South Holland, (United Provinces) N... NaN
132209 27 06dc5c13-cec1-4fcd-972e-0d6052350645 Letter Beeckman, Isaac 13 February 1635 Beeckman, Isaac, 1588-1637 Beeckman, Abraham, fl. 1635 Dordrecht, South Holland, (United Provinces) N... Amsterdam, North Holland, (United Provinces) N... NaN

132210 rows × 10 columns

In [20]:
# show the nnnumber of authors and addressees
print('number of distinct authors:', df['author'].nunique())
print('number of distinct addressees:', df['addressee'].nunique())
number of distinct authors: 14619
number of distinct addressees: 7649
In [24]:
# The correspondence collections in the dataset with the number of letters they contain.
df.collection.value_counts()
Out[24]:
Bodleian card catalogue     48668
Groot, Hugo de               8034
Huygens, Constantijn         7120
Hartlib, Samuel              4719
Andreae, Johann Valentin     3696
                            ...  
Beeckman, Isaac                28
Dudley, Anne                   27
Vernon, Margaret               21
Baxter, Richard                 8
Culpeper, Cheney                3
Name: collection, Length: 93, dtype: int64
In [21]:
# The authors in the dataset with the number of letters they sent.
df.author.value_counts()
Out[21]:
Groot, Hugo de, 1583-1645                                     4912
Huygens, Constantijn, 1596-1687                               3951
Plantin, Christophe, 1520-1589                                2331
Vossius, Gerardus Joannes, 1577-1649                          2292
Peiresc, Nicolas-Claude Fabri de, 1580-1637                   2111
                                                              ... 
Farley (Mr), fl. 1775; Rose (Mr), fl. 1775                       1
Newton, James, fl. 1710-1713                                     1
Martinus, William, fl. 1621                                      1
Council of State of the Republic of the United Netherlands       1
Montagu, Edward, 1562-1644                                       1
Name: author, Length: 14619, dtype: int64
In [22]:
# The addressee in the dataset with the number of letters they received.
df.addressee.value_counts()
Out[22]:
Huygens, Constantijn, 1596-1687                                     4737
Hearne, Thomas, 1678-1735                                           3795
Vossius, Gerardus Joannes, 1577-1649                                3498
Hartlib, Samuel, 1600-1662                                          3388
Groot, Hugo de, 1583-1645                                           3233
                                                                    ... 
Blount, Edward, fl. 1724                                               1
Baert, Pieter J., fl. 1676-1691                                        1
Werndeley, (Reverend Mr), fl. 1711; Werndeley, sons of, fl. 1711       1
Vossius, Gerardus, 1619-1640                                           1
Heidelberg, ministers in, fl. 1655                                     1
Name: addressee, Length: 7649, dtype: int64
In [ ]:
 
In [2]:
# Adjust the default size for figures so that placing two plots 
# next to each other in a sub plot are still big enough.
plt.rcParams['figure.figsize'] = [15, 5]

# create a plot canvas with two adjacent subplots
plt.subplot(1,2,1)
# Distribution of number of letters per author
# Sub-plot 1 shows the number of letters by each letter author on normal scaled axes
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')

plt.subplot(1,2,2)
# Sub-plot 1 shows the number of letters by each letter author on a log scaled y-axis
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')
plt.yscale('log')

plt.show()
number of distinct authors: 14619
number of distinct addressees: 7649
In [ ]:
 
In [4]:
plt.subplot(1,2,1)
# Number of letters by each letter addressee
df['addressee'].value_counts().hist(bins=100)
plt.xlabel('Number of letters received')
plt.ylabel('Number of addressees')


# Distribution of number of letters per addressee
plt.subplot(1,2,2)
df['addressee'].value_counts().hist(bins=100)
plt.ylabel('Number of addressees')
plt.xlabel('Number of letters received')
plt.yscale('log')
plt.show()
In [5]:
plt.subplot(1,2,1)
# Number of letters by each letter addressee
df['author'].value_counts().hist(bins=100)
plt.ylabel('Number of authors')
plt.xlabel('Number of letters authored')
plt.yscale('log')


# Distribution of number of letters per addressee
plt.subplot(1,2,2)
df['addressee'].value_counts().hist(bins=100)
plt.ylabel('Number of addressees')
plt.xlabel('Number of letters received')
plt.yscale('log')
plt.show()
In [19]:
from collections import Counter

author_dist = Counter([count for count in df['author'].value_counts()])
x_author, y_author = zip(*author_dist.items())
plt.subplot(1,2,1)
plt.scatter(x_author, y_author)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Number of letters authored')
plt.ylabel('Number of authors')

plt.subplot(1,2,2)
addressee_dist = Counter([count for count in df['addressee'].value_counts()])
x_addressee, y_addressee = zip(*addressee_dist.items())
plt.scatter(x_addressee, y_addressee)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Number of letters received')
plt.ylabel('Number of addressees')
plt.show()

The plots show typical skewed distributions. The vast majority of correspondents author and/or receive only one or a few letters (the hight bar on the left of each figure represents all authors/addressees authoring or receiving only one letter). Only a handful of people author or receive more than a thousand letters.

Who are the most prolific authors?

In [6]:
df['author'].value_counts().head(20)
Out[6]:
Groot, Hugo de, 1583-1645                      4912
Huygens, Constantijn, 1596-1687                3951
Plantin, Christophe, 1520-1589                 2331
Vossius, Gerardus Joannes, 1577-1649           2292
Peiresc, Nicolas-Claude Fabri de, 1580-1637    2111
Oldenburg, Henry, 1619-1677                    2092
Scaliger, Joseph Justus, 1540-1609             1847
Huygens, Christiaan, 1629-1695                 1423
Hearne, Thomas, 1678-1735                      1382
Wallis, John (Dr), 1616-1703                   1290
Smith, Thomas (Dr), 1638-1710                  1114
Dury, John, 1596-1680                          1003
Bayle, Pierre, 1647-1706                        938
Bourignon, Antoinette, 1616-1680                887
Montagu, Mary Wortley (Lady), 1689-1762         860
Aubrey, John, 1626-1697                         854
Heinsius, Nicolaas, 1620-1681                   833
Descartes, René, 1596-1650                      796
Vossius, Isaac (Dr), 1618-1689                  793
Brett, Thomas, 1667-1744                        742
Name: author, dtype: int64

In the list above, most of the authors are the central figure or eponym of one of the EMLO collections.

Exceptions are:

  • John Dury (1596-1650): Preacher and ecumenist
  • August II of Braunschweig-Wolfenbüttel (1579-1666): Duke (Herzog) of Braunschweig-Wolfenbüttel

These are prolific authors in collections centred on someone else.

We first look at the letters of August II. Which collections are they part of?

In [7]:
print("Collections with August II of Braunschweig-Wolfenbüttel's letters:")
df[df['author'] == 'August II of Braunschweig-Wolfenbüttel, 1579-1666']['collection'].value_counts()
Collections with August II of Braunschweig-Wolfenbüttel's letters:
Out[7]:
Andreae, Johann Valentin                        582
Kircher, Athanasius                              21
Bodleian card catalogue                           1
Braunschweig-Wolfenbüttel, Sophia Hedwig von      1
Name: collection, dtype: int64

Next, we look at who these letters are addressed to:

In [8]:
print("Addressees of August II of Braunschweig-Wolfenbüttel's letters:")
df[df['author'] == 'August II of Braunschweig-Wolfenbüttel, 1579-1666']['addressee'].value_counts()
Addressees of August II of Braunschweig-Wolfenbüttel's letters:
Out[8]:
Andreae, Johann Valentin, 1586-1654                        581
Kircher, Athanasius, 1601-1680                              21
Württemberg, Eberhard III von, 1614-1674                     1
Braunschweig-Lüneburg, Georg, 1582-1641                     1
Braunschweig-Wolfenbüttel, Sophia Hedwig von, 1592-1642      1
Name: addressee, dtype: int64

These two queries reveal a typical pattern in these collections. August II has 582 letters in the collection of Johann Valentin Andreae, of which 581 are also addressed to Andreae. Letters in a collection around a certain person tend be either authored or addressed to this person, which makes sense from a recordkeeping perspective. But there is one letter addressed to someone else, i.e. Eberhard III von Württemberg.

Now, let us look at the same queries for John Dury's letters:

In [9]:
print("Collections with John Dury's letters:")
df[df['author'] == 'Dury, John, 1596-1680']['collection'].value_counts()
Collections with John Dury's letters:
Out[9]:
Hartlib, Samuel                837
Bodleian card catalogue        149
Ussher, James                    6
Mede, Joseph                     3
Huygens, Constantijn             2
Culpeper, Cheney                 2
Boyle, Robert                    2
Vossius, Gerardus Joannes        1
Bisterfeld, Johann Heinrich      1
Name: collection, dtype: int64

John Dury has letters in eight different collections, but in seven of those, it is only a handful of letters. We can also see who he addressed those letters to:

In [10]:
print("Addressees of John Dury's letters:")
df[df['author'] == 'Dury, John, 1596-1680']['addressee'].value_counts()
Addressees of John Dury's letters:
Out[10]:
Hartlib, Samuel, 1600-1662                      528
Roe, Thomas (Sir), 1581-1644                     31
Culpeper, Cheney, 1601-1663                      11
Borthwick, Eleazar, fl. 1633-1642                 7
St Amand, Joseph, fl. 1636-1643                   7
                                               ... 
House of Commons (1641-1712)                      1
St Gallen and Appenzell, Clergy in, fl. 1654      1
Cecil, Elizabeth, fl. 1640                        1
Coysh, Joseph, fl. 1652                           1
Rusdorf, Johann Joachim von, 1589-1640            1
Name: addressee, Length: 135, dtype: int64

Now we see a differennt pattern. Samuel Hartlib is by far the most frequent addressee of John Dury's letters in these collections. But looking at the two sets of counts above, we note that John Dury authored 837 letters in the Samuel Hartlib collections, of which only 528 are addressed to Samuel Hartlib. Who are the other 309 letters in the Samuel Hartlib collection addressed to?

In [11]:
print("Addressees of John Dury's letters in the Samuel Hartlib:")
df[(df['author'] == 'Dury, John, 1596-1680') & (df['collection'] == 'Hartlib, Samuel')]['addressee'].value_counts()
Addressees of John Dury's letters in the Samuel Hartlib:
Out[11]:
Hartlib, Samuel, 1600-1662           528
Roe, Thomas (Sir), 1581-1644          30
Culpeper, Cheney, 1601-1663            9
Waller, William (Sir), 1598-1668       7
Borthwick, Eleazar, fl. 1633-1642      7
                                    ... 
Ames, William, 1576-1633               1
Ancelin, fl. 1660                      1
Bedell, William, 1572-1642             1
Figulus, Petr, 1619-1670               1
Palmer, Herbert, 1601-1647             1
Name: addressee, Length: 127, dtype: int64

Apparently, some collections also contains hundreds of letters that are not authored by or addressed to the collection eponym.

Analyzing the Addressees

In [12]:
df['addressee'].value_counts().head(20)
Out[12]:
Huygens, Constantijn, 1596-1687           4737
Hearne, Thomas, 1678-1735                 3795
Vossius, Gerardus Joannes, 1577-1649      3498
Hartlib, Samuel, 1600-1662                3388
Groot, Hugo de, 1583-1645                 3233
Lhwyd, Edward, 1659-1709                  3226
Charlett, Arthur (Reverend), 1655-1722    3066
Andreae, Johann Valentin, 1586-1654       2953
Noble, Mark (Reverend), 1754-1827         2809
Sancroft, William, 1617-1693              2797
Kircher, Athanasius, 1601-1680            2209
Oldenburg, Henry, 1619-1677               2127
D'Orville, Jacques Philippe, 1696-1751    2066
Brett, Thomas, 1667-1744                  1767
Lister, Martin, 1639-1712                 1704
Vossius, Isaac (Dr), 1618-1689            1690
Solms-Braunfels, Amalia von, 1602-1675    1660
Smith, Thomas (Dr), 1638-1710             1637
Wood, Anthony, 1632-1695                  1547
Scaliger, Joseph Justus, 1540-1609        1512
Name: addressee, dtype: int64

In the list above, most of the addressees are the central figure or eponym of one of the EMLO collections.

Exceptions are:

  • Nicolaas Reigersberch (1584-1654): brother-in-law of Hugo de Groot; Jurist
  • Willem de Groot (1597-1662): brother of Hugo de Groot (1583-1645); Dutch jurist

These are prolific authors in collections centred on someone else.

In [13]:
print("Collections with letters to Nicolaas Reigersberch:")
df[df['addressee'] == 'Reigersberch, Nicolaas, 1584-1654']['collection'].value_counts()
Collections with letters to Nicolaas Reigersberch:
Out[13]:
Groot, Hugo de               881
Vossius, Gerardus Joannes      8
Bodleian card catalogue        6
Name: collection, dtype: int64
In [14]:
print("Authors of letters to Nicolaas Reigersberch:")
df[df['addressee'] == 'Reigersberch, Nicolaas, 1584-1654']['author'].value_counts()
Authors of letters to Nicolaas Reigersberch:
Out[14]:
Groot, Hugo de, 1583-1645               862
Reigersberch, Maria, 1589-1653           18
Vossius, Gerardus Joannes, 1577-1649     14
Groot, Willem de, 1597-1662               1
Name: author, dtype: int64
In [15]:
print("Collections with letters to Willem de Groot:")
df[df['addressee'] == 'Groot, Willem de, 1597-1662']['collection'].value_counts()
Collections with letters to Willem de Groot:
Out[15]:
Groot, Hugo de               732
Vossius, Gerardus Joannes      3
Bodleian card catalogue        1
Name: collection, dtype: int64
In [16]:
print("Authors of letters to Willem de Groot:")
df[df['addressee'] == 'Groot, Willem de, 1597-1662']['author'].value_counts()
Authors of letters to Willem de Groot:
Out[16]:
Groot, Hugo de, 1583-1645                     726
Vossius, Gerardus Joannes, 1577-1649            4
Groot, Johan Hugo de, 1554-1640                 4
Groot van Kraayenburg, Dirck de, 1618-1661      2
Name: author, dtype: int64

Again, we see some letters between persons who are not the central figure in any of the EMLO collections.

How many letters in each collection do not involve the eponym as either author or addressee?

First, we map the name of the collection to the name as used as author or addressee:

In [17]:
eponyms = list(df['collection'].unique())
authors = list(df['author'].unique())
author_counts = df['author'].value_counts()
authors

best_map = {}
eponym_map = {}
for eponym in eponyms:
    #print(eponym)
    for author in authors:
        if not isinstance(author, str) or ';' in author:
            continue
        if eponym == 'Fermat, Pierre de' and author == 'Fermat, Pierre, 1601-1665':
            eponym_map[eponym] = author
        if eponym == 'Comenius, Jan Amos' and author == 'Komenský, Jan Amos, 1592-1670':
            eponym_map[eponym] = author
        if eponym in author[:len(eponym)]:
            if eponym not in best_map or author_counts[author] > best_map[eponym]:
                best_map[eponym] = author_counts[author]
                eponym_map[eponym] = author
    if eponym not in eponym_map:
        print(eponym)
Bodleian card catalogue
In [18]:
print("Collection:\t\t\t\t\t\tAll letters\tNon-eponym letters")
print("----------------------------------------------------------------------------------------")
for eponym in eponym_map:
    epo_df = df[df['collection'] == eponym]
    #print(eponym, '\t', eponym_map[eponym])
    non_epo_df = df[(df['collection'] == eponym) & (df['author'] != eponym_map[eponym]) & (df['addressee'] != eponym_map[eponym])]
    perc = non_epo_df.shape[0] / epo_df.shape[0]
    print(f"{eponym: <50}\t{epo_df.shape[0]}\t\t{non_epo_df.shape[0]}\t({perc:.2f})")
Collection:						All letters	Non-eponym letters
----------------------------------------------------------------------------------------
Bayle, Pierre                                     	1791		133	(0.07)
Sirleto, Guglielmo                                	1438		15	(0.01)
Seidenbecher, Georg Lorenz                        	47		0	(0.00)
Swammerdam, Jan                                   	172		4	(0.02)
Fermat, Pierre de                                 	121		6	(0.05)
Ortelius, Abraham                                 	467		0	(0.00)
Reneri, Henricus                                  	61		0	(0.00)
Spinoza, Baruch                                   	58		1	(0.02)
Lister, Martin                                    	1212		2	(0.00)
Wallis, John                                      	1998		232	(0.12)
Ussher, James                                     	681		17	(0.02)
Groot, Hugo de                                    	8034		280	(0.03)
Franckenberg, Abraham von                         	85		1	(0.01)
Bourignon, Antoinette                             	940		0	(0.00)
Peiresc, Nicolas-Claude Fabri de                  	1939		11	(0.01)
Hartlib, Samuel                                   	4719		1221	(0.26)
Braunschweig-Wolfenbüttel, Sophia Hedwig von      	169		1	(0.01)
Solms-Braunfels, Amalia von                       	1184		3	(0.00)
Comenius, Jan Amos                                	571		41	(0.07)
Selden, John                                      	355		10	(0.03)
Culpeper, Cheney                                  	3		0	(0.00)
Montagu, Mary Wortley                             	963		0	(0.00)
Nierop, Dirck Rembrantsz van                      	80		4	(0.05)
Plantin, Christophe                               	3030		324	(0.11)
Oldenburg, Henry                                  	3176		34	(0.01)
Pontanus, Johannes Isacius                        	321		2	(0.01)
Dodington, John                                   	571		8	(0.01)
Stuart, Arbella                                   	118		2	(0.02)
Vives, Juan Luis                                  	195		19	(0.10)
Jungius, Joachim                                  	506		18	(0.04)
Coccejus, Johannes                                	515		2	(0.00)
Opitz, Martin                                     	110		0	(0.00)
Agustín, Antonio                                  	579		5	(0.01)
Andreae, Johann Valentin                          	3696		81	(0.02)
Anhalt-Dessau, Henriette Amalia von               	1352		1	(0.00)
Thomson, Richard                                  	78		2	(0.03)
Schott, Caspar                                    	180		0	(0.00)
Permeier, Johann                                  	89		3	(0.03)
Vossius, Isaac                                    	1703		2	(0.00)
Pascal, Blaise                                    	49		6	(0.12)
Magini, Giovanni Antonio                          	100		1	(0.01)
Vossius, Gerardus Joannes                         	3430		14	(0.00)
Bernegger, Matthias                               	435		0	(0.00)
Scaliger, Joseph Justus                           	3338		8	(0.00)
Mengoli, Pietro                                   	40		0	(0.00)
Plot, Robert                                      	108		6	(0.06)
Hilchen, David                                    	98		22	(0.22)
Sidney, Philip                                    	380		5	(0.01)
Boyle, Robert                                     	1759		9	(0.01)
Jurin, James                                      	701		0	(0.00)
Sachs von Löwenheim, Philipp Jakob                	143		0	(0.00)
Kepler, Johannes                                  	883		57	(0.06)
Beverland, Hadriaan                               	305		0	(0.00)
Reland, Adriaan                                   	211		0	(0.00)
Bisterfeld, Johann Heinrich                       	121		3	(0.02)
Baxter, Richard                                   	8		0	(0.00)
Lhwyd, Edward                                     	2128		74	(0.03)
Euler, Leonhard                                   	811		35	(0.04)
Halley, Edmond                                    	245		0	(0.00)
Aubrey, John                                      	1073		10	(0.01)
Conway, Anne                                      	296		78	(0.26)
Mersenne, Marin                                   	1904		784	(0.41)
Milton, John                                      	66		0	(0.00)
Rabus, Pieter                                     	30		0	(0.00)
Mede, Joseph                                      	441		9	(0.02)
Pennant, Thomas                                   	508		1	(0.00)
Hobbes, Thomas                                    	223		10	(0.04)
Beale, Robert                                     	101		27	(0.27)
Huygens, Christiaan                               	3080		393	(0.13)
Claude, Jean                                      	117		4	(0.03)
Ruar, Martin                                      	100		17	(0.17)
Gray, Thomas                                      	651		7	(0.01)
Oranje-Nassau, Albertine Agnes van                	782		48	(0.06)
Schurman, Anna Maria van                          	244		1	(0.00)
Rich, Penelope                                    	42		2	(0.05)
Newton, Isaac                                     	1140		135	(0.12)
Clifford, Margaret                                	131		0	(0.00)
Kircher, Athanasius                               	2693		3	(0.00)
Dudley, Anne                                      	27		1	(0.04)
Ashmole, Elias                                    	764		27	(0.04)
Vernon, Francis                                   	275		0	(0.00)
Collins, John                                     	273		88	(0.32)
Descartes, René                                   	727		7	(0.01)
Huygens, Constantijn                              	7120		6	(0.00)
Bacon, Anne                                       	197		2	(0.01)
Gruter, Jan                                       	136		0	(0.00)
Worthington, John                                 	174		8	(0.05)
Brahe, Tycho                                      	505		28	(0.06)
Rubens, Peter Paul                                	940		558	(0.59)
Hutton, Charles                                   	133		6	(0.05)
Vernon, Margaret                                  	21		0	(0.00)
Beeckman, Isaac                                   	28		1	(0.04)

Most collection have almost exclusively letters involving the eponym, but some collections are very different. In the Peter Paul Rubens collection, the majority (59%) of letters are between other people than Rubens.

In [19]:
df[df['collection'] == 'Rubens, Peter Paul'][['collection','author','addressee']].head(10)
Out[19]:
collection author addressee
131088 Rubens, Peter Paul Moretus, Balthasar, 1574-1641 Rubens, Philip, 1574-1611
131089 Rubens, Peter Paul Rubens, Philip, 1574-1611 Rubens, Peter Paul, 1577-1640
131090 Rubens, Peter Paul Albert VII, Archduke of Austria, 1559-1621 Richardot, Jean, 1570-1614
131091 Rubens, Peter Paul Gonzaga, Vincenzo I, 1562-1612 Damasceni Peretti, Alessandro, 1571-1623
131092 Rubens, Peter Paul Damasceni Peretti, Alessandro, 1571-1623 Gonzaga, Vincenzo I, 1562-1612
131093 Rubens, Peter Paul Arrigoni, Lelio, b.1541 Chieppio, Annibal, 1563-1623
131094 Rubens, Peter Paul Rubens, Philip, 1574-1611 Rubens, Peter Paul, 1577-1640
131095 Rubens, Peter Paul Arrigoni, Lelio, b.1541 Chieppio, Annibal, 1563-1623
131096 Rubens, Peter Paul Richardot, Jean, 1570-1614 Gonzaga, Vincenzo I, 1562-1612
131097 Rubens, Peter Paul Arrigoni, Lelio, b.1541 Chieppio, Annibal, 1563-1623

Normalization and Classification for Creating and Comparing Groups

The metadata is fairly minimal when considering just the fields that are in the dataset. But there are more things that can be done.

  • The names of senders and recipients have the birth and death years (in most cases), so we could use these to group persons by age at death, or birth decade.

  • The dates that the letters were sent are often exact down to the specific day, but sometimes only a month was known or an earliest and latest probable dates. We can normalise those dates to get an insight in when letters were sent, in which year or month.

Normalizing and scale

At a small scale, there is no need to normalize data, as the researcher can do that mentally while working with the materials.

At an intermediate scale of hundreds or thousands of documents, the variations in names of persons and places, ways in which dates are recorded are becoming a hurdle to analysis. For topical analysis, this is also an issue, as many connections between documents are hard to bring to the surface because of morphological and spelling variations.

At a large scale with hundreds of thousands or millions of documents, the textual variations become less of a hurdle, as there is enough data to identify and map variants.

At a very large scale with tens or hundreds of millions of documents, the textual variations become meaningful and allow measuring contextual nuance in how word variants are used to convey different aspects.

In [27]:
df[df.collection == 'Groot, Hugo de'].date.value_counts()
Out[27]:
9 January 1638      14
Unknown date        14
12 December 1643    12
20 June 1643        11
6 February 1644     11
                    ..
4 April 1634         1
3 November 1628      1
9 June 1628          1
20 August 1627       1
11 January 1609      1
Name: date, Length: 4067, dtype: int64

The are 4067 different values for the dates, with the most common date being 9 January 1638. There are also 14 unknown dates.

In [76]:
import re

def is_day_month_year(sent_date):
    return re.match(r'^\d+ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w* \d{4}$', sent_date) != None

def is_month_year(sent_date):
    return re.match(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w* \d{4}$', sent_date) != None

def is_year(sent_date):
    return re.match(r'^\d{4}$', sent_date) != None

def is_day_month(sent_date):
    return re.match(r'^\d+ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*$', sent_date) != None

def is_month(sent_date):
    return re.match(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*$', sent_date) != None

def is_day_year(sent_date):
    return re.match(r'^\d+ \d{4}$', sent_date) != None

def get_year(sent_date):
    if is_day_month_year(sent_date) or is_year(sent_date) or is_day_year(sent_date) or is_month_year(sent_date):
        return int(sent_date[-4:])
    else:
        return None
    
def get_month(sent_date):
    if is_day_month_year(sent_date) or is_month(sent_date) or is_day_month(sent_date) or is_month_year(sent_date):
        match = re.match(r'.*((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*).*', sent_date)
        return match.group(1)
    else:
        return None
    
    
def get_date_type(sent_date):
    if is_day_month_year(sent_date):
        return 'day_month_year'
    if is_month_year(sent_date):
        return 'month_year'
    if is_year(sent_date):
        return 'year'
    if is_day_month(sent_date):
        return 'day_month'
    if is_month(sent_date):
        return 'day_month'
    if is_day_year(sent_date):
        return 'day_year'
    if 'Between' in sent_date:
        return 'range_between'
    if 'On or before' in sent_date:
        return 'range_before'
    if 'On or after' in sent_date:
        return 'range_after'
    if 'Unknown date' in sent_date:
        return 'unknown'
    else:
        return 'invalid format'

#df['date_type'] = df.date.apply(get_date_type)
df['date_year'] = df.date.apply(get_year)
df['date_month'] = df.date.apply(get_month)
In [77]:
df.date_type.value_counts()
Out[77]:
day_month_year    114636
unknown             5748
month_year          3778
year                3103
day_month           3053
range_between       1700
range_before         104
range_after           44
day_year              37
invalid format         7
Name: date_type, dtype: int64
In [90]:
df.date_year.max() - df.date_year.min() + 1
Out[90]:
322.0
In [89]:
df.date_year.hist(bins=322)#.value_counts().sort_index()
Out[89]:
<AxesSubplot:>
In [91]:
df.date_month.value_counts()
Out[91]:
March        11031
August       10507
July         10364
May          10335
April        10239
January      10139
September     9957
June          9951
October       9884
February      9855
November      9640
December      9565
Name: date_month, dtype: int64
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [24]:
df_rubens = df[df.collection == 'Rubens, Peter Paul']

df_rubens[['collection','author','addressee']].head(10)

g = df_rubens.groupby(['author', 'addressee']).size()

u = g.unstack('author')

plt.imshow(u, cmap='hot', interpolation='nearest')

g.sort_values()
Out[24]:
author                                       addressee                                      
Albert VII, Archduke of Austria, 1559-1621   Gonzaga, Vincenzo I, 1562-1612                      1
Rubens, Peter Paul, 1577-1640                Vosbergen, Josias van, 1593-1628                    1
Mello, Francisco Manuel de, 1608-1666        Chambre des Comptes (Spanish Netherlands)           1
Mennes, John (Sir), 1599-1671                Admiralty Court, England                            1
Moncada, Francisco, 1586-1635                Philip IV, King of Spain, 1605-1665                 1
                                                                                                ..
Rubens, Peter Paul, 1577-1640                Fabri, Palamède, 1582-1645                         18
                                             Olivares, Gaspar de Guzmán, Conde de, 1587-1645    25
Ferdinand, archiduc d'Autriche, 1609-1641    Philip IV, King of Spain, 1605-1665                36
Rubens, Peter Paul, 1577-1640                Dupuy, Pierre, 1582-1651                           71
Peiresc, Nicolas-Claude Fabri de, 1580-1637  Rubens, Peter Paul, 1577-1640                      85
Length: 301, dtype: int64

Connections between collections

How many connections are there between collections? This is easy with two collections, but becomes more difficult when there are many collections.

Which persons appear in multiple collections?

In [54]:
# which authors occur in multiple collections
df[(df[['collection', 'author']].duplicated(keep='first') == False)]['author'].value_counts()
Out[54]:
Unknown                                                                                                                                                     37
Oldenburg, Henry, 1619-1677                                                                                                                                 16
Huygens, Constantijn, 1596-1687                                                                                                                             14
Mersenne, Marin, 1588-1648                                                                                                                                  13
Gronovius, Johann Frederick, 1611-1671                                                                                                                      12
Groot, Hugo de, 1583-1645                                                                                                                                   11
Leibniz, Gottfried Wilhelm, 1646-1716                                                                                                                       11
Vossius, Gerardus Joannes, 1577-1649                                                                                                                        11
Descartes, René, 1596-1650                                                                                                                                  10
Huygens, Christiaan, 1629-1695                                                                                                                              10
Rivet, André, 1572-1651                                                                                                                                     10
Saumaise, Claude de, 1588-1653                                                                                                                              10
Hevelius, Johannes, 1611-1687                                                                                                                                9
Boyle, Robert, 1627-1691                                                                                                                                     9
Hartlib, Samuel, 1600-1662                                                                                                                                   9
Aubrey, John, 1626-1697                                                                                                                                      9
Digby, Kenelm (Sir), 1603-1665                                                                                                                               9
Gassendi, Pierre, 1592-1655                                                                                                                                  9
Wallis, John (Dr), 1616-1703                                                                                                                                 9
Dury, John, 1596-1680                                                                                                                                        8
Lipsius, Justus, 1547-1606                                                                                                                                   8
Komenský, Jan Amos, 1592-1670                                                                                                                                8
Bernegger, Matthias, 1582-1640                                                                                                                               8
Newton, Isaac (Sir), 1642-1727                                                                                                                               8
Peiresc, Nicolas-Claude Fabri de, 1580-1637                                                                                                                  8
Kircher, Athanasius, 1601-1680                                                                                                                               8
Boulliau, Ismaël, 1605-1694                                                                                                                                  8
Heinsius, Daniel, 1580-1655                                                                                                                                  8
Sorbière, Samuel, 1615-1670                                                                                                                                 8
Christina, Queen of Sweden, 1626-1689                                                                                                                        7
                                                                                                                                                            ..
Manwaring, Robert, fl. 1637                                                                                                                                  1
Smith, John, 1630-1679                                                                                                                                       1
Cort, Christiaan de, 1611-1669                                                                                                                               1
Aerssen, Cornelis van, 1600-1662                                                                                                                             1
Cottereau, N., 1641-1706                                                                                                                                     1
Conway, Anne, 1631-1679                                                                                                                                      1
Chieppio, Annibal, 1563-1623                                                                                                                                 1
Honywood, Robert (Sir), 1601-1686                                                                                                                            1
Doublet, Philips, d.1647                                                                                                                                     1
Beeck, Anna, fl. 1649                                                                                                                                        1
Montagu, Edward Wortley, 1678-1761                                                                                                                           1
Martinengo da Barco, Ascanio, 1539-1600                                                                                                                      1
Ghilde, Johan Flud van, fl. 1685                                                                                                                             1
Thomas, Robert, b.1681                                                                                                                                       1
Brassard, Marie, fl. 1685                                                                                                                                    1
Dalrymple, James, 1619-1695                                                                                                                                  1
Leslie, John (Sir), 1766-1832                                                                                                                                1
Bacon, Arthur, fl. 1652                                                                                                                                      1
Standfast, William (Reverend), 1683-1754                                                                                                                     1
Hoorn, magistrate of, fl. 1616-1617                                                                                                                          1
Clenche, Andrew, d.1692                                                                                                                                      1
Bachcroft, Thomas, 1571-1662; Bainbridge, Thomas, 1574-1646; Brownrigg, Ralph (Dr), 1592-1659; Collins, Samuel (Dr), 1576-1651; Love, Richard, 1596-1661     1
Higgins, Obadiah, 1663-1741                                                                                                                                  1
Techmannus, Arnoldus, 1594-1666                                                                                                                              1
Klage, Thomas, 1598-1664                                                                                                                                     1
Bodecher, J. W., fl. 1643                                                                                                                                    1
Bachacius, Martinus, 1539-1612                                                                                                                               1
Morgan, Anthony, fl. 1654                                                                                                                                    1
Finch, John, fl. 1653                                                                                                                                        1
Ellenmeier, Johann, fl. 1635                                                                                                                                 1
Name: author, dtype: int64
In [56]:
df[df['author'] == 'Oldenburg, Henry, 1619-1677']['collection'].value_counts()
Out[56]:
Oldenburg, Henry                      1524
Wallis, John                           142
Boyle, Robert                          107
Huygens, Christiaan                    105
Lister, Martin                          61
Hartlib, Samuel                         46
Newton, Isaac                           16
Swammerdam, Jan                          8
Milton, John                             7
Sachs von Löwenheim, Philipp Jakob       5
Vossius, Isaac                           4
Hobbes, Thomas                           2
Ashmole, Elias                           1
Comenius, Jan Amos                       1
Vossius, Gerardus Joannes                1
Coccejus, Johannes                       1
Name: collection, dtype: int64

Samuel Hartlib is in the top 20 of addressees but not in the top 20 of authors:

In [105]:
print('Samuel Hartlib\n')
print(f'\tnumber of letters sent:', df[df['author'] == 'Hartlib, Samuel, 1600-1662'].shape[0])
print(f'\tnumber of letters received:', (df[df['addressee'] == 'Hartlib, Samuel, 1600-1662'].shape[0]))
Samuel Hartlib

	number of letters sent: 401
	number of letters received: 3388
In [108]:
# Number of letters authored by Hugo de Groot per year
hugo = 'Groot, Hugo de, 1583-1645'

df['year'] = df['date'].str.extract('(\d\d\d\d)', expand=False)

df_hugo = df[df['author'] == hugo]

df_hugo['year'].value_counts().sort_index().plot()

plt.show()
In [110]:
df_hugo['addressee'].value_counts()
Out[110]:
Reigersberch, Nicolaas, 1584-1654                                862
Groot, Willem de, 1597-1662                                      726
Oxenstierna, Axel (Count), 1583-1654                             587
Camerarius, Ludwig, 1573-1651                                    347
Vossius, Gerardus Joannes, 1577-1649                             339
Marin, Charles, d.1651                                           122
Oxenstierna, Johan Axelsson, 1611-1657                           109
Salvius, Johan Adler, 1590-1652                                  106
Wicquefort, Joachim van, 1600-1670                                99
Heinsius, Daniel, 1580-1655                                       89
Aubery du Maurier, Benjamin, 1566-1636                            85
Appelboom, Harald Andersson, 1612-1674                            83
Christina, Queen of Sweden, 1626-1689                             77
Schmalz, Peter Abel, fl. 1635-1638                                61
Uyttenbogaert, Johannes (Dr), 1557-1644                           51
Jaski, Israel, 1573-1642                                          39
Bielke, Sten Svantesson, 1598-1638                                39
Spiring Silvercrona, Petter, 1600-1652                            39
Groot, Johan Hugo de, 1554-1640                                   37
Lingelsheim, Georg Michael, 1556-1636                             36
Unknown                                                           33
Bernegger, Matthias, 1582-1640                                    29
Müller, Georg, d.1639                                             29
Sprecher von Bernegg, Fortunatus, 1585-1647                       24
Casaubon, Isaac, 1559-1614                                        23
Grubbe, Lars, 1601-1642                                           22
Camerarius, Joachim, 1603-1687                                    22
Meursius, Johannes, 1579-1639                                     22
Vossius, Isaac (Dr), 1618-1689                                    20
Otto II, 1578-1637                                                19
                                                                ... 
Skytte, Bengt, 1614-1683                                           1
Gomaer, François, 1563-1641                                        1
Barclay-Debonnaire, Louise, 1585-1652                              1
Emporagrius, Erik Gabrielsson, 1606-1674                           1
Cappel, Louis, 1585-1658                                           1
Jack, Gilbert, 1577-1628                                           1
Vossius, Gerardus, 1619-1640                                       1
Aligre, Étienne, 1550-1635                                         1
Höpfner, Heinrich, 1583-1642                                       1
Aa, Anthony Willemsz., 1582-1638                                   1
Rigault, Nicolas, 1577-1654                                        1
Brederode, Reinoud, 1567-1633                                      1
Jungermann, Gottfried, 1577-1610                                   1
Wertheim de Rochefort, Johann Dietrich Löwenstein, 1585-1657       1
L'Empereur, Constantine, 1591-1648                                 1
Stringe, Johan, fl. 1622                                           1
Forstner, Christoph von, 1598-1667                                 1
Voisin, Joseph, 1610-1685                                          1
Orléans, administration of the German Nation at                    1
Sweerts, Pierre François, 1567-1629                               1
Pas, Isaac Manassès de, 1590-1640                                  1
Mesmes, Henri, d.1650                                              1
Gardie, Magnus Gabriel de la, 1622-1686                            1
Wertheim de Virneburg, Friedrich Ludwig Löwenstein, 1598-1657      1
Bogislaw, Ernst, 1620-1684                                         1
Hohenlohe-Langenburg, Philipp Ernst von, 1584-1628                 1
Oldenbarnevelt, Willem van, 1590-1638                              1
Beauharnais, François, d.1651                                      1
Vair, Guillaume, 1566-1621                                         1
Menasseh, Ben Israel, 1604-1657                                    1
Name: addressee, dtype: int64

Some collections include letters between correspondents of the collection creator, while others only contains letters where the collection creator is the author or addressee of the letter.

E.g. the collection of correspondence of Hugo de Groot includes letters between his brother and his brother-in-law.

In [122]:
df_christiaan = df[(df['author'] == 'Huygens, Christiaan, 1629-1695') | (df['addressee'] == 'Huygens, Christiaan, 1629-1695')]

df_christiaan = df[df['collection'] == 'Huygens, Christiaan']
df_christiaan['author'].value_counts()
Out[122]:
Huygens, Christiaan, 1629-1695                         1345
Huygens, Constantijn, 1628-1697                         175
Oldenburg, Henry, 1619-1677                             105
Huygens, Constantijn, 1596-1687                          75
Sluse, René François de, 1622-1685                       72
Chapelain, Jean, 1595-1674                               70
Moray, Robert (Sir), 1608-1673                           68
Schooten, Frans van, 1615-1660                           58
Boulliau, Ismaël, 1605-1694                              54
Leibniz, Gottfried Wilhelm, 1646-1716                    42
Bruno, Henrick, 1617-1664                                40
Huygens, Susanna, 1637-1725                              32
Heinsius, Nicolaas, 1620-1681                            28
Petit, Pierre, 1598 or before-1677                       24
Medici, Leopoldo de', 1617-1675                          24
Doublet, Philips, 1633-1707                              23
Hudde, Johannes, 1628-1704                               21
L'Hôpital, Guillaume François Antoine de, 1661-1704      21
Wallis, John (Dr), 1616-1703                             20
Mersenne, Marin, 1588-1648                               20
Fatio de Duillier, Nicolas, 1664-1753                    20
Saint-Vincent, Grégoire de, 1584-1667                    19
Kinner von Löwenthurn, Gottfried Alois, b.1610           18
Mylon, Claude, 1618-1660                                 18
Fermat, Pierre, 1601-1665                                17
Hire, Philippe de la, 1640-1718                          16
Gent, Pieter, b.1640                                     16
Graaf, Jan, 1673-1697                                    16
Hevelius, Johannes, 1611-1687                            15
Thévenot, Melchisédech, 1620-1692                        13
                                                       ... 
Hobbes, Thomas, 1588-1679                                 1
Mathion, Oded Louis, 1620-1700                            1
Louise Hollandine, Countess Palatine, 1622-1709           1
Regnauld, André, d.1702                                   1
Benoit, Antoine, 1632-1717                                1
Limojon, Alexandre-Toussaint de, 1630-1689                1
Leeuwenhoek, Antoni van, 1632-1723                        1
Gillet, Pierre François, 1648-1720                        1
Nassau-Siegen, Hendrik of, 1611-1652                      1
Boecler, Johann Heinrich, 1611-1672                       1
Lely, Peter, 1618-1680                                    1
Christina, Queen of Sweden, 1626-1689                     1
Kircher, Athanasius, 1601-1680                            1
Vossius, Isaac (Dr), 1618-1689                            1
Molyneux, Thomas (Sir), 1661-1733                         1
Briou, fl. 1675                                           1
Douw, Simon, 1620-1663                                    1
Court of Holland, Zeeland and West Friesland,             1
Alberghetti, Sigismondo, d.1702                           1
Placentius, Johann, d.1683                                1
Wijk, Johan van der, 1625-1679                            1
Magalotti, Lorenzo, 1637-1712                             1
Varignon, Pierre, 1654-1722                               1
Cock, Christopher, fl. 1684                               1
Dodart, Denis, 1634-1707                                  1
Holmes, Robert (Sir), 1622-1692                           1
Vallot, Antoine, 1594-1671                                1
Bilberg, Johann, 1650-1717                                1
Loménie, Henri-Auguste, 1595-1666                         1
Smethwick, Francis, d.1682                                1
Name: author, dtype: int64
In [124]:
df_constantijn = df[df['collection'] == 'Huygens, Constantijn']
df_constantijn['addressee'].value_counts()
Out[124]:
Huygens, Constantijn, 1596-1687                                          4252
Solms-Braunfels, Amalia von, 1602-1675                                    768
Huygens, Christiaan, 1551-1624                                             94
Barlaeus, Caspar, 1584-1648                                                81
Sauzin, Jean                                                               68
Heinsius, Daniel, 1580-1655                                                63
Lionne, Hugues de, 1611-1671                                               44
Rivet, André, 1572-1651                                                    44
Hooft, Pieter Cornelius, 1581-1647                                         40
William III and II, King of England, Scotland, and Ireland, 1650-1702      39
Beringhen, Henri, 1603-1692                                                33
Chièze, Sebastien, 1625-1679                                               29
Unknown                                                                    28
Langes de Montmirail, Frédéric de, 1630-1697                               26
Leu de Wilhem, David le, 1588-1658                                         25
Chambrun, Jacques Pineton, 1635-1689                                       23
Dohna, Frederick von, 1621-1688                                            21
Council of the Prince, fl. 1656-1664                                       20
Loménie, Henri Louis, 1635-1698                                            20
Swann-Ogle, Utricia, 1611-1674                                             19
Ban, Jan Albert, 1598-1644                                                 19
Jermyn, Henry, 1605-1684                                                   18
Puteanus, Erycius, 1574-1646                                               18
Nassau-Siegen, Hendrik of, 1611-1652                                       18
Westerbaen, Jacob, 1599-1670                                               17
Cusance, Béatrix, 1614-1663                                               17
Schurman, Anna Maria van, 1607-1678                                        17
Boxhorn, Marcus Zuerius van, 1612-1653                                     15
Bennet, Henry, 1618-1685                                                   15
Le Tellier, François Michel, 1641-1691                                     14
                                                                         ... 
Wierts, Joan, fl. 1650-1692                                                 1
Muelen, Andries, 1591-1654                                                  1
Molino, Domenico, 1573-1635                                                 1
Burg, fl. 1662                                                              1
Frédéric-Armand, comte de Schomberg, 1615-1690                              1
Fürstenberg, Ferdinand von, 1626-1683                                       1
Colvius, Andreas, 1594-1671                                                 1
Sprecher von Bernegg, Fortunatus, 1585-1647                                 1
Sipenesse, Cornelis, d.1635                                                 1
Santen, Jan, fl. 1635-1649                                                  1
Caron, Suzette, fl. 1669-1689                                               1
Amat, Angélique, fl. 1666                                                   1
Brancas, Marie, fl. 1613-1662                                               1
Nicholas, Edward (Sir), 1593-1669                                           1
Beauvais, Charles, b.1590                                                   1
Sohier, Nicolaas, fl. 1638                                                  1
Zuerius (Miss), fl. 1667                                                    1
Sylvius, Jean, fl. 1662-1666                                                1
Does, Jacob, 1641-1680                                                      1
Huygen, Johan, fl. 1632-1640                                                1
Coesmans, Jan, fl. 1643-1656                                                1
Brederode, Juliana, 1622-1678                                               1
Zuylen van Nyevelt, Mechtelt, fl. 1666                                      1
Berckel, Clemens, fl. 1613                                                  1
Cotton, John (Sir), 1621-1702                                               1
Schagen, Diederik, fl. 1660                                                 1
Magerus, Petrus, 1609-1653                                                  1
Petit, Pierre, 1598 or before-1677                                          1
Nanteuil, Robert, 1623-1678                                                 1
Enclos, Anne de l', 1620-1705                                               1
Name: addressee, dtype: int64
In [3]:
df['author_freq'] = df.groupby(['author'])['id'].transform('count')
df['addressee_freq'] = df.groupby(['addressee'])['id'].transform('count')
df['correspondents_freq'] = df.author_freq + df.addressee_freq
In [4]:
df.groupby(['id', 'author', 'addressee']).size()
Out[4]:
id                                    author                                             addressee                              
0000ab7c-f54a-493a-a066-ee929eedd1e3  Johnson, John, 1662-1725                           Charlett, Arthur (Reverend), 1655-1722     1
0000bd85-9139-4fec-b362-78b3f4a6b9c2  Boywer, F., fl. 1737                               Rawlinson, Richard (Dr), 1690-1755         1
0000d067-74f1-46ed-b4bc-e9d2765091e6  Alciatus, Francesco (Cardinal), 1522-1580          Aytta, Viglius Zuichemius ab, 1507-1577    1
0002ce4c-db77-4863-a1ec-e8be9b9d121b  Buffon, George Louis Leclerc de, 1707-1788         Jurin, James, 1684-1750                    1
0002dbb4-2785-4b05-aa59-2bbb3f654802  Sandford, Daniel (Reverend), 1729-1770             Ballard, George, 1705-1755                 1
                                                                                                                                   ..
fffdf05b-cba6-442e-a563-18598314a021  Howel, John, fl. 1760-1781                         Gough, Richard, 1735-1809                  1
fffe4dd7-c074-4de0-b422-dc1c635eca0f  Villiers, Christophe de, fl. 1633-1639             Mersenne, Marin, 1588-1648                 1
fffe9c50-5132-476e-9173-9b680d0f916e  August II of Braunschweig-Wolfenbüttel, 1579-1666  Andreae, Johann Valentin, 1586-1654        1
fffefd34-9784-4673-920e-1696883ef2a2  Zapata, Rodrigo, fl. 1574                          Agustín, Antonio, 1517-1586                1
ffffcc85-6dbc-4a57-8942-40323c4d18fc  Willis, Browne, 1682-1760                          Rawlinson, Richard (Dr), 1690-1755         1
Length: 127418, dtype: int64
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [5]:
ids = list(df.sort_values('correspondents_freq').id)
auths = list(df.sort_values('correspondents_freq').author)
addrs = list(df.sort_values('correspondents_freq').addressee)
auth_freqs = list(df.sort_values('correspondents_freq').author_freq)
addr_freqs = list(df.sort_values('correspondents_freq').addressee_freq)

auths = [auth if isinstance(auth, str) else None for auth in auths]
addrs = [addr if isinstance(addr, str) else None for addr in addrs]
In [6]:
corrs = [{'id': id, 'auth': auth, 'addr': addr, 'author_freq': auth_freq, 'addr_freq': addr_freq} for id, auth, addr, auth_freq, addr_freq in zip(ids, auths, addrs, auth_freqs, addr_freqs)]

corrs[0]
Out[6]:
{'addr': 'Jones, Robert (Reverend), fl. 1698',
 'addr_freq': 1.0,
 'auth': 'Meare, John, 1649-1710',
 'author_freq': 1.0,
 'id': 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a'}
In [7]:
from collections import OrderedDict
queued = {}
fetch = OrderedDict()
seen = {}
for corr in corrs:
    if corr['auth'] not in queued and corr['addr'] not in queued:
        queued[corr['auth']] = corr['id']
        queued[corr['addr']] = corr['id']
        fetch[corr['id']] = corr

print(len(fetch.keys()))
print(len(queued.keys()))
1942
3873
In [8]:
for corr in corrs:
    if corr['auth'] not in queued:
        queued[corr['auth']] = corr['id']
        fetch[corr['id']] = corr
    elif corr['auth'] not in queued:
        queued[corr['auth']] = corr['id']
        fetch[corr['id']] = corr

print(len(fetch.keys()))
print(len(queued.keys()))
13827
15758
In [9]:
for corr_id in fetch:
    url = f'http://emlo.bodleian.ox.ac.uk/profile/work/{corr_id}'
    print(url)
    break
http://emlo.bodleian.ox.ac.uk/profile/work/aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a
In [10]:
import requests
from bs4 import BeautifulSoup as bsoup

df[df.id == 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a']

#response = requests.get(url)
Out[10]:
Unnamed: 0 id type collection date author addressee origin destination repository author_freq addressee_freq correspondents_freq
73349 17320 aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a Letter Bodleian card catalogue 30 August 1698 Meare, John, 1649-1710 Jones, Robert (Reverend), fl. 1698 Oxfordshire, England NaN Bodleian Library, University of Oxford: MS Bal... 1.0 1.0 2.0
In [93]:
def get_relation_info(rel_type, detail_soup):
    rel_type_soup = detail_soup.find_all(class_=rel_type)
    if len(rel_type_soup) == 0:
        return None
    relation_soup = rel_type_soup[0].find_all(class_='relations')[0]
    return {
        'relation_type': rel_type.split(' '),
        'relation_text': [string for string in relation_soup.stripped_strings]
    }

def get_provenance(page_soup):
    prov_soup = page_soup.find_all(class_='provenance')[0]
    prov = prov_soup.text
    return prov.replace('Source of data: ','')

def get_page_details(corr_id, page_soup):
    page_details = {
        'correspondence_id': corr_id,
        'relations': [],
        'provenance': get_provenance(page_soup)
    }
    detail_soup = page_soup.find(id='details')
    if detail_soup:
        rel_types = ['people authors', 'people recipients', 'locations origin', 'locations destination']
        relation_info = [get_relation_info(rel_type, detail_soup) for rel_type in rel_types]
        page_details['relations'] = [relation for relation in relation_info if relation != None]
    return page_details

def get_correspondence_page(corr_id):
    url = f'http://emlo.bodleian.ox.ac.uk/profile/work/{corr_id}'
    response = requests.get(url)
    page_soup = bsoup(response.content)
    return get_page_details(corr_id, page_soup)

corr_id = 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a'
#detail_doc = get_page_details(corr_id, page_soup)
detail_index = 'emlo_page_details'
from elasticsearch import Elasticsearch

es = Elasticsearch()

#es.index(index=detail_index, doc_type='page_detail', id=detail_doc['correspondence_id'], body=detail_doc)
In [15]:
import time

headers = {
    'user-agent': 'DataScopesAnalyzer (https://marijnkoolen.github.io/Data-Scopes-Developers-2018/)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-gb',
}

fetch[corr_id]
#time.sleep(10)
Out[15]:
{'addr': 'Jones, Robert (Reverend), fl. 1698',
 'addr_freq': 1.0,
 'auth': 'Meare, John, 1649-1710',
 'author_freq': 1.0,
 'id': 'aabeebf7-4c5b-4bc2-a2ec-8ace326cfa7a'}
In [94]:
from elasticsearch import exceptions

skip = 0

for ci, corr_id in enumerate(fetch):
    if es.exists(index=detail_index, id=corr_id):
        #print('skip', corr_id)
        skip += 1
        if skip % 1000 == 0:
            print('skipped', skip)
        continue
    #print('fetching page for', corr_id)
    detail_doc = get_correspondence_page(corr_id)
    try:
        detail_doc['author'] = fetch[corr_id]['auth']
        detail_doc['addressee'] = fetch[corr_id]['addr']
    except TypeError:
        print(fetch[corr_id])
        raise
    try:
        es.index(index=detail_index, doc_type='page_detail', id=detail_doc['correspondence_id'], body=detail_doc)
    except exceptions.RequestError:
        print(detail_doc)
        raise
    time.sleep(10)
    if (ci+1) % 100 == 0:
        print(ci+1, 'correspondence pages fetched')
skipped 1000
skipped 2000
skipped 3000
skipped 4000
skipped 5000
skipped 6000
skipped 7000
skipped 8000
skipped 9000
skipped 10000
skipped 11000
skipped 12000
12100 correspondence pages fetched
12200 correspondence pages fetched
12300 correspondence pages fetched
12400 correspondence pages fetched
12500 correspondence pages fetched
12600 correspondence pages fetched
12700 correspondence pages fetched
12800 correspondence pages fetched
12900 correspondence pages fetched
13000 correspondence pages fetched
13100 correspondence pages fetched
13200 correspondence pages fetched
13300 correspondence pages fetched
13400 correspondence pages fetched
13500 correspondence pages fetched
13600 correspondence pages fetched
13700 correspondence pages fetched
13800 correspondence pages fetched
In [ ]: