Quantitative Humanities *NEW*

Quantitative Humanities

"Applying data science methods in humanities research"

Convenor: Professor David De Roure (University of Oxford's e-Research Centre and Alan Turing Institute)

Hashtags: #quanthum and #DHOxSS

Computers: please bring your own laptop (no tablets please)

Abstract

Data Science brings an exciting new range of methods to digital humanities research, using statistical and computational techniques. Through a series of introductory sessions and expert talks, this workshop will demonstrate the use of data science methods in humanities scholarship and equip participants to apply these methods in their own work. The techniques address data preparation, analysis and presentation, using tools for statistical analysis and social network analysis, and an introduction to computational techniques including simulation and machine learning to gain insights into historical data. We will use datasets provided by the Bodleian Library and other partners, with experts on the datasets also visiting the workshop. This workshop has been designed in collaboration with colleagues at The Alan Turing Institute, the “From Text to Tech” workshop and the Software Sustainability Institute.

Please install the following software on your own laptops prior to the start of the Summer School:

Gephi - https://gephi.org/
NetLogo - https://ccl.northwestern.edu/netlogo/

We will be using the following data:

Oxford English Dictionary (OED) http://www.oed.com/

Oxford Dictionary of National Biography (ODNB) http://www.oxforddnb.com/

Bodleian First Folio of Shakespeare's plays http://firstfolio.bodleian.ox.ac.uk/

Early Modern Letters Online (EMLO) http://www.culturesofknowledge.org/

Early English Books Online Text Creation Partnership (EEBO-TCP) http://blogs.bodleian.ox.ac.uk/eebotcp/

William Godwin Diaries http://godwindiary.bodleian.ox.ac.uk/index2.html

British Book Trade Index http://bbti.bodleian.ox.ac.uk/

Background Reading

Agent-Based Modeling and Historical Simulation by Michael Gavin
http://www.digitalhumanities.org/dhq/vol/8/4/000195/000195.html

Data used in case studies: William Godwin Diaries - http://godwindiary.bodleian.ox.ac.uk/index2.html

Convenor

David De Roure is Professor of e-Research at the University of Oxford's e-Research Centre. Focused on advancing digital scholarship, David works closely with multiple disciplines including social sciences (studying social machines), humanities (computational musicology and experimental humanities), engineering (Internet of Things), and computer science (large scale distributed systems and social computing). He has extensive experience in hypertext, Web Science, Linked Data, and Internet of Things. Drawing on this broad interdisciplinary background he is a frequent speaker and writer on the future of digital scholarship and scholarly communications.

Professor De Roure is also a Visiting Researcher at the Alan Turing Institute, working at the intersection of data science with libraries and GLAM (Gardens, Libraries and Museums at the University of Oxford), and a Visiting Professor at Goldsmiths, University of London.

NEW!

SOLD OUT

TIMETABLE

Programme

Monday

Tuesday

Wednesday

Thursday

Friday

Link to overview of the week's timetable including evening events.

Monday 2nd July

Introduction, with examples of numerical and computational techniques in humanities, including working with large scale data. For example we will explore and discuss the Google ngram viewer, and look at a simple algorithm with applications from intertextuality to musical analysis.

08.15-09.15

Registration (Sloane Robinson building)

Tea and coffee (ARCO building)

09.30-10.30

Opening Keynote (Sloane Robinson lecture theatre)

10.30-11.00

Refreshment break (ARCO building)

11.00-12.30

Introductions and joint keynote with Introduction to Digital Humanities strand

Professor David De Roure

12.30-14.00

LUNCH (Dining Hall)

14:00-16:00

Introduction to quantitative humanities

Professor David De Roure

16:00-16:30

Refreshment break (ARCO Building)

16:30-17:30

Hands-on session: digging into a dataset

Tuesday 3rd July

The day takes us through data cleaning, analysis, and presentation. This will primarily use spreadsheets. Several datasets will be provided, with experts on data analysis and visualisation on hand to assist. Participants can work individually or in small groups focused around the datasets. (Assisted by the Software Sustainability Institute)

09.00-10.30

Introduction and data preparation

Professor David De Roure, Dr Alfie Abdul-Rahman and Iain Emsley

10.30-11.00

Refreshment break (ARCO building)

11.00-13.00

Analysis and Visualization

Professor David De Roure, Dr Alfie Abdul-Rahman and Iain Emsley

13:00-14.30

LUNCH (Dining Hall)

14:30-15:30

Invited speaker: Dr Alessandro Vatri: "From (raw) text to annotated corpus: an Ancient Greek example" (shared talk with From Text to Tech workshop)

Abstract

This talk will show how Word files containing Ancient Greek texts can be transformed into annotated XML through scripting and Python coding. Methods and concepts demonstrated in the talk include regular expressions, Python string manipulation, XPath, XML manipulation, Python dictionaries and data structures, error handling and logging, and output generation.

15:30-16:00

Refreshment break (ARCO Building)

16:00-17:00

Lectures (various venues)

Wednesday 4th July

Social Network Analysis is a technique with many applications, from correspondence networks to social media analytics. We will use the gephi tool (which is freely available on windows and mac) to explore sample datasets, including the William Godwin Diaries (1788-1836).

09.00-10.30

Introduction to social networks analysis and the Gephi tool

Professor David De Roure

10.30-11.00

Refreshment break (ARCO building)

11.00-13.00

Hands-on session with data

Professor David De Roure

13:00-14.30

LUNCH (Dining Hall)

14:30-15:30

Invited speaker: Giovanni Colavizza: “Reference parsing from scholarly literature in the arts and humanities” (shared talk with From Text to Tech workshop)

Abstract:

The task of reference parsing entails the detection, extraction and annotation of references found in scholarly literature. It is part of the broader category of sequence labelling tasks in NLP, which includes named entity recognition and part of speech tagging. Reference parsing is helpful to build citation indexes such as Google Scholar. This task has proven to be particularly challenging in the arts and humanities, witness the very partial coverage of these disciplines in commercial citation indexes. We will introduce and motivate the task, particularly with respect to the possibility of interlinking disparate collections (libraries, archives, museums) via citations, explore its challenges and discuss ongoing technical work in order to overcome them. We will pay particular attention to the NLP dimensions of this task, such as the importance to capture features at multiple linguistic levels (morphology, syntax, semantics) in order to automatically learn a variety of referencing styles.

15:30-16:00

Refreshment break (ARCO Building)

16:00-17:00

Lectures (various venues)

Thursday 5th July

An emerging technique in humanities is to simulate historical systems in order to better understand the evidence we consider today - close reading by digital prototyping. We will use a software tool (netlogo) to explore some simple simulations, focusing on early kinds of communications, and look at what-if scenarios. For those who wish, this tool can be used to develop new simulations.

09.00-10.30

Introduction to netlogo and example simulations

Professor David De Roure

10.30-11.00

Refreshment break (ARCO building)

11.00-13.00

Hands-on session with simulations, and discussion

Professor David De Roure

13:00-14.30

LUNCH (Dining Hall)

14:30-15:30

Invited speaker: Thomas Wood: 'Forensic stylometry' (shared talk with From Text to Tech workshop)

Abstract

This talk introduces an approach to "forensic stylometry", that is, identifying the author of a text, based on a corpus of documents. This field made headlines in 2013 when two professors of computational linguistics proved that JK Rowling was the author of a detective series which she had written under a pseudonym. Traditionally this would have been done with a hand engineered sequence of components for removing stopwords, lemmatising words, and constructing a bag of words model. However recent advances in deep learning software have made it simple to build text classifiers with almost no feature engineering. In a few hours we can build a classifier to identify authorship. It can be trained in a few minutes and will run on a regular laptop.

15:30-16:00

Refreshment break (ARCO Building)

16:00-17:00

Lectures (various venues)

Friday 6th July

A look at emerging techniques and to the future. We particularly focus on machine learning (which is commonly described today as AI) for image classification and search, with humanities examples. We will use online tools developed at University of Oxford.

09.00-10.30

Image classification and search

Professor David De Roure, Giles Bergel

10.30-11.00

Refreshment break (ARCO building)

11.00-13.00

Looking to the future- discussion with experts

Professor David De Roure

13:00-14.00

LUNCH (Dining Hall)

14:00-15:00

Report back and wrap up

Professor David De Roure

15:00-16:00

Closing keynote (O'Reilly lecture theatre)

Speaker biographies

Alfie Abdul-Rahman completed her PhD in Computer Science at Swansea University, focusing on the physically-based rendering and algebraic manipulation of volume models. She was then a Research Associate at the University of Oxford e-Research Centre and joined King's College London in March 2018 as a Lecturer. Her projects include Quill, ViTA: Visualization for Text Alignment, and Poem Viewer. Before joining Oxford, she worked as a Research Engineer in HP Labs Bristol on document engineering, and then as a software developer in London, working on multi-format publishing. Her research interests include visualization, computer graphics, and human-computer interaction.

Giles Bergel is Digital Humanities Research Ambassador in the Visual Geometry Group at the University of Oxford. As well as computer vision, his interests include text encoding, linked data and the study of early printed books.

Giovanni Colavizza is a Data Scientist at The Alan Turing Institute. He did his PhD in Technology Management at the Digital Humanities Laboratory of the EPFL in Lausanne, working on methods for text mining and citation analysis of scholarly publications. He was for two years the operations manager of the Venice Time Machine, a large-scale digitisation and indexation project based at the Archives of Venice, and is cofounder of Odoma, a start-up offering customised machine learning techniques in the cultural heritage domain.

Giovanni is interested in how data science can contribute to the scientific endeavour, and to society at large. He is particularly passionate about the humanities and the study of the past through an interplay of quantitative and qualitative methods.

Prior to joining the Turing, Giovanni has been a researcher at the University of Leiden (Centre for Science and Technology Studies), the Leibniz Institute of European History in Mainz, and the University of Oxford. He studied computer science (BSc) and history (BA, MA) in Udine, Milan, Padua and Venice in Italy.

Iain Emsley is a PhD student in Digital Media at the University of Sussex. He worked for the Oxford e-Research Centre on various Digital Humanities projects, such as Fusing Audio and Semantic Technologies (FAST) and Workset Creation for Scholarly Analysis (WCSA), and the Square Kilometre Array. His research interests include sustainability and sonification.

Alessandro Vatri (DPhil Oxon) is Junior Research Fellow of Wolfson College, University of Oxford. He is mainly interested in communication in the ancient Greek world and, in particular, in the connection between the linguistic form of texts and the original socio-anthropological circumstances of their production and reception. His recent and forthcoming publications focus on ancient textual practices, ancient rhetoric, Greek oratory, and the reconstruction of native language comprehension. His areas of interest also include ancient Greek synchronic and historical linguistics, ancient literary criticism, corpus linguistics, and the digital humanities. His first monograph Orality and Performance in Classical Attic Prose. A Linguistic Approach (Oxford University Press) was published in 2017.

Tom Wood studied physics as his first degree and then got interested in natural language processing. He did a Masters at Cambridge University in Computer Speech, Text and Internet Technology, and since then he has worked in machine learning and AI in various companies, including computer vision and designing dialogue systems (think of Siri), in the UK, Spain and Germany. Most recently he has been working as a data scientist at CV-Library, one of the UK's largest job boards. This involves looking at the many years of job hunting data and CV documents that CV-Library has collected, and trying to find patterns and smart ways to use the information. For example recommending a job to a candidate based on past behaviour, or categorising candidates based on CVs.