From Text to Tech

I'm a paragraph. Click here to add your own text and edit me. It's easy.

“Corpus and computational linguistics for powerful text processing in the Humanities”

Convenors: Barbara McGillivray (Alan Turing Institute and University of Cambridge) and Dr Gard Jenset

Hashtag: #text2tech and #DHOxSS

Computers: Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

Abstract

Digitization efforts provide increasingly large amounts of text to answer old and new research questions in the Humanities. The technical skills needed to automatically process and analyse texts at such large scale are in high demand. The workshop will address this need by imparting the basic skills to automatically process and mine textual data, and assumes no prior knowledge of programming.

The workshop will build on the experience gathered by the organizers in previous successful and fully booked editions of the "From Text to Tech" workshop at DHOxSS 2015–2017. It will ensure beginners, as well as people with some previous experience in programming, will be able to take full advantage of the content presented.

We will use Python, a very flexible programming language widely used in Humanities research. The workshop will take a hands-on, stepwise approach. It will offer a very basic introduction to Python programming, corpus linguistics, and Natural Language Processing, and will cover the process of cleaning texts and adding automatic linguistic annotation to them (lemmatization, part-of-speech tagging, syntactic parsing). It will also teach the basics of semantic analysis and topic modelling.

The workshop will also include research talks by invited speakers, that will demonstrate concretely how the skills acquired in the lectures can be applied to answer questions in a range of humanistic disciplines.

Computers will be provided for this workshop.

Convenors

Barbara McGillivray is a computational linguist, and works as a research fellow at The Alan Turing Institute and the University of Cambridge. She holds a PhD in computational linguistics form the University of Pisa. Her research interests include: Language Technology for Cultural Heritage, Latin computational linguistics, quantitative historical linguistics, and computational lexicography.

Gard B. Jenset has a PhD in English linguistics from the University of Bergen. He currently works with deep learning and language technology in industry. Among his research interests are corpus linguistics and quantitative methods in historical linguistics.

"Every single discussion was relevant to me, so I am very pleased I was able to attend"

DHOxSS 2017 participant

Only 1 place remaining

SOLD OUT

TIMETABLE

Timetable

Link to overview of the week's timetable including evening events.

Monday 2nd July

08:15-09:30

Registration (Sloane Robinson building)

Tea and coffee (ARCO building)

09:30-10:30

Opening Keynote (Sloan Robinson O'Reilly lecture theatre)

10:30-11:00

Refreshment break (ARCO building)

11:00-12:30

Introduction to programming in Python

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data. (Gard Jenset)

12:30-14:00

Lunch (Dining Hall)

14:00-16:00

Introduction to programming in Python (continued)

16:00-16:30

Refreshment break (ARCO building)

16:30-17:30

Introduction to Corpora

The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities. (Barbara McGillivray)

Tuesday 3rd July

09:00-10:30

Basic text processing with Python

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data. (Gard Jenset)

10.30-11:00

Refreshment break (ARCO building)

11:00-13:00

Data structures in Python

This session will cover basic data structures like lists and dictionaries in Python, with practical examples. (Gard Jenset)

13:00-14:30

Lunch (Dining Hall)

14:30-15:30

Invited speaker: Dr Alessandro Vatri: "'From (raw) text to annotated corpus: an Ancient Greek example "

Abstract

This talk will show how Word files containing Ancient Greek texts can be transformed into annotated XML through scripting and Python coding. Methods and concepts demonstrated in the talk include regular expressions, Python string manipulation, XPath, XML manipulation, Python dictionaries and data structures, error handling and logging, and output generation.

15:30-16:00

Refreshment break (ARCO building)

16:00 - 17:30

Lectures (various venues)

Wednesday 4th July

09:00-10:30

Introduction to Natural Language Processing (NLP) in Python

This session introduces the NLTK library and shows how it can be used for tasks such as stemming, part-of-speech tagging, and lemmatization with Python. (Barbara McGillivray)

10.30 -11:00

Refreshment break (ARCO building)

11:00-13:00

Going further with NLTK

Practical exercises on using NLTK to perform stemming, part-of-speech tagging, and lemmatization in Python. (Barbara McGillivray)

13:00-14:30

Lunch (Dining Hall)

14:30 - 15:30

Invited speaker: Giovanni Colavizza: “Reference parsing from scholarly literature in the arts and humanities”

Abstract:

The task of reference parsing entails the detection, extraction and annotation of references found in scholarly literature. It is part of the broader category of sequence labelling tasks in NLP, which includes named entity recognition and part of speech tagging. Reference parsing is helpful to build citation indexes such as Google Scholar. This task has proven to be particularly challenging in the arts and humanities, witness the very partial coverage of these disciplines in commercial citation indexes. We will introduce and motivate the task, particularly with respect to the possibility of interlinking disparate collections (libraries, archives, museums) via citations, explore its challenges and discuss ongoing technical work in order to overcome them. We will pay particular attention to the NLP dimensions of this task, such as the importance to capture features at multiple linguistic levels (morphology, syntax, semantics) in order to automatically learn a variety of referencing styles.

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00

Lectures (various venues)

Thursday 5th July

09:00-10:30

Parsing HTML and XML documents

Python can be used to extract text from structured documents like HTML and XML. This session gives an introduction to how this is done. (Barbara McGillivray/Gard Jenset)

10.30-11:00

Refreshment break (ARCO building)

11:00-13:00

Extracting semantic information from text

The session gives an introduction to how Python and the NLTK library can be used to extract semantic information from unstructured text and measure similarity between documents.(Barbara McGillivray)

13:00-14:30

Lunch (Dining Hall)

14:30 - 15:30

Invited speaker: Thomas Wood: 'Forensic stylometry'

Abstract

This talk introduces an approach to "forensic stylometry", that is, identifying the author of a text, based on a corpus of documents. This field made headlines in 2013 when two professors of computational linguistics proved that JK Rowling was the author of a detective series which she had written under a pseudonym. Traditionally this would have been done with a hand engineered sequence of components for removing stopwords, lemmatising words, and constructing a bag of words model. However recent advances in deep learning software have made it simple to build text classifiers with almost no feature engineering. In a few hours we can build a classifier to identify authorship. It can be trained in a few minutes and will run on a regular laptop.

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00

Lectures (various venues)

Friday 6th July

09:00-10:30

Topic Modelling

This session gives a non-technical introduction to topic modelling along with examples of Python code. (Gard Jenset)

10.30-11:00

Refreshment break (ARCO building)

11:00-13:00

Problem solving session

The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance. (Gard Jenset / Barbara McGillivray)

13:00-14:00

Lunch (Dining Hall)

14:00-15:00

Looking after your Python code and sharing it

The session will introduce best practices in programming, including data sharing and code reproducibility. (Gard Jenset / Barbara McGillivray)

15:00-16:00

Closing plenary (O'Reilly lecture theatre)

Speaker biographies

Giovanni Colavizza is a Data Scientist at The Alan Turing Institute. He did his PhD in Technology Management at the Digital Humanities Laboratory of the EPFL in Lausanne, working on methods for text mining and citation analysis of scholarly publications. He was for two years the operations manager of the Venice Time Machine, a large-scale digitisation and indexation project based at the Archives of Venice, and is cofounder of Odoma, a start-up offering customised machine learning techniques in the cultural heritage domain.

Giovanni is interested in how data science can contribute to the scientific endeavour, and to society at large. He is particularly passionate about the humanities and the study of the past through an interplay of quantitative and qualitative methods.

Prior to joining the Turing, Giovanni has been a researcher at the University of Leiden (Centre for Science and Technology Studies), the Leibniz Institute of European History in Mainz, and the University of Oxford. He studied computer science (BSc) and history (BA, MA) in Udine, Milan, Padua and Venice in Italy.

Alessandro Vatri (DPhil Oxon) is Junior Research Fellow of Wolfson College, University of Oxford. He is mainly interested in communication in the ancient Greek world and, in particular, in the connection between the linguistic form of texts and the original socio-anthropological circumstances of their production and reception. His recent and forthcoming publications focus on ancient textual practices, ancient rhetoric, Greek oratory, and the reconstruction of native language comprehension. His areas of interest also include ancient Greek synchronic and historical linguistics, ancient literary criticism, corpus linguistics, and the digital humanities. His first monograph Orality and Performance in Classical Attic Prose. A Linguistic Approach (Oxford University Press) was published in 2017.

Tom Wood studied physics as his first degree and then got interested in natural language processing. He did a Masters at Cambridge University in Computer Speech, Text and Internet Technology, and since then he has worked in machine learning and AI in various companies, including computer vision and designing dialogue systems (think of Siri), in the UK, Spain and Germany. Most recently he has been working as a data scientist at CV-Library, one of the UK's largest job boards. This involves looking at the many years of job hunting data and CV documents that CV-Library has collected, and trying to find patterns and smart ways to use the information. For example recommending a job to a candidate based on past behaviour, or categorising candidates based on CVs.

Tuesday

Monday

Wednesday

Thursday

Friday

From Text to Tech

“Corpus and computational linguistics for powerful text processing in the Humanities”

Abstract

"Every single discussion was relevant to me, so I am very pleased I was able to attend"

Only 1 place remaining

TIMETABLE

​

Link to overview of the week's timetable including evening events.

​

Monday 2nd July

​

08:15-09:30

Registration (Sloane Robinson building)

Tea and coffee (ARCO building)

​

09:30-10:30

Opening Keynote (Sloan Robinson O'Reilly lecture theatre)

Refreshment break (ARCO building)

​

Introduction to programming in Python

Lunch (Dining Hall)

​

14:00-16:00

Introduction to programming in Python (continued)

16:00-16:30

Refreshment break (ARCO building)

​

16:30-17:30

Introduction to Corpora

Tuesday 3rd July

​

09:00-10:30

Basic text processing with Python

10.30-11:00

Refreshment break (ARCO building)

11:00-13:00

​

Data structures in Python

​

13:00-14:30

​

Lunch (Dining Hall)

​

14:30-15:30

​

Invited speaker: Dr Alessandro Vatri: "'From (raw) text to annotated corpus: an Ancient Greek example "

​

15:30-16:00

​

Refreshment break (ARCO building)

16:00 - 17:30

Lectures (various venues)

​

Wednesday 4th July

09:00-10:30

Introduction to Natural Language Processing (NLP) in Python

10.30 -11:00

Refreshment break (ARCO building)

​

11:00-13:00

Going further with NLTK

​

13:00-14:30

​

Lunch (Dining Hall)

​

14:30 - 15:30

Invited speaker: Giovanni Colavizza: “Reference parsing from scholarly literature in the arts and humanities”

15:30-16:00

Refreshment break (ARCO building)

​

16:00-17:00

Lectures (various venues)

09:00-10:30

​

Parsing HTML and XML documents

10.30-11:00

Refreshment break (ARCO building)

​

13:00-14:30