View on GitHub

scale

A set of notebooks providing insights in the scale of data and how this affects data scopes

Data Scopes and Scale

This site contains a set of Jupyter notebooks that investigate the relationship between the scale of a dataset and the resulting data scope for different research questions.

Datasets

The notebooks use two datasets, one of correspondences of historical figures, and one of online book reviews.

The correspondence dataset contains metadata for around 110,000 letters from the Early Modern Letters Online (EMLO) project. The metadata consists of sending date, sender, receiver, location of sender and location of receiver.

The online review dataset contains review text and metadata for 15 million book reviews from Goodreads and 51 million book reviews from Amazon. The metadata consists of review author, review date, rating, reviewed book and author and the platform on which it was published (i.e. Amazon or Goodreads).

Notebooks:

Distributions

Datasets with multiple records have elements that can be analysed across all or a subset of records (dates, senders and receivers, authors, book titles, ratings). Values in certain fields or columns can occur multiple times, resulting in a distribution. Analysing these distributions and understanding their shapes can tell us a lot about the underlying processes by which the data was generated.

Notebooks:

Getting an Overview at Different Scales

One of the biggest challenges with datasets as they become larger in scale is to get an overview of what is in the data. What are the most salient characteristics? What are data axes or dimensions along which the datasets can be split into subsets that are meaningful?

For correspondences, one dimensions is the date of correspondence which can be sorted into periods, or the types of people sending and receiving letters which can be sorted in job types or gender.

For reviews, there dimensions such as book, author, genre, rating, review date, publication platform.

Notebooks:

Extracting and Structuring Information at different Scales

Extracting information such as topics and social networks is affected by the amount of available data.

References

The Amazon review data was originally used in:

The Goodreads review data was originally used in :

The EMLO dataset is described in:

Tool Criticism Workshops

An overview of related Tool Criticism workshops and materials.

About Data Scopes

Data Scopes is a project by Marijn Koolen and Rik Hoekstra, both at the Humanities Cluster of the Royal Netherlands Academy of Arts and Sciences.

Data Scopes workshop websites: