“Shakespeare in the Vectorian Age”: An evaluation of different word embeddings and NLP parameters for the detection of Shakespeare quotes
In this paper we describe an approach for the computer-aided identification of Shakespearean intertextuality in a corpus of contemporary fiction. We present the Vectorian, which is a framework that implements different word embeddings and various NLP...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
In this paper we describe an approach for the computer-aided identification of Shakespearean intertextuality in a corpus of contemporary fiction. We present the Vectorian, which is a framework that implements different word embeddings and various NLP parameters. The Vectorian works like a search engine, i.e. a Shakespeare phrase can be entered as a query, the underlying collection of fiction books is then searched for the phrase and the passages that are likely to contain the phrase, either verbatim or as a paraphrase, are presented in a ranked results list. While the Vectorian can be used via a GUI, in which many different parameters can be set and combined manually, in this paper we present an ablation study that automatically evaluates different embedding and NLP parameter combinations against a ground truth. We investigate the behavior of different parameters during the evaluation and discuss how our results may be used for future studies on the detection of Shakespearean intertextuality.
|
Export in Literaturverwaltung |
|
From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline
While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In order to address these challenges, we have developed an...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In order to address these challenges, we have developed an end-to-end OCR pipeline named Origami. This pipeline is part of a current project on the digitization and quantitative analysis of the German newspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses existing open source OCR components and on top offers a new configurable architecture for layout detection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a new robust implementation for document dewarping. In this paper we describe the different stages of the workflow and discuss how they meet the above-mentioned challenges posed by historical newspapers.
|
Export in Literaturverwaltung |
|