The University of Innsbruck hosted the Technical meeting of the Europeana Newspapers project the 17th and 18th of September. The Europeana Newspapers project, funded under EC’s CIP 2007 – 2013, aims at the aggregation and refinement of 10 million digitised newspaper pages for The European Library and Europeana. In addition, the project addresses challenges particularly linked with digitised newspapers that will make the final user experience unique in the way they can search and retrieve from a vast amount of digitised newspapers from all overEurope and from an extensive period in history. The meeting in Innsbruck brought the libraries who are the content providers and the technical partners together to discuss the actual situation of the project, the next steps to be taken and the corresponding workflows.
Thorsten Siegmann (Berlin State Library) introduced the meeting agenda
The project uses refinement methods to enhance search and presentation functionalities for Europeana customers. The refinement methods used are Optical Character Recognition (OCR), article segmentation and Named Entity Recognition (NER). These methods will allow the user to find individual articles, perform keyword searches and display the full text online. Clemens Neudecker from the KB, leader of the refinement work-package, says:
“Workshop participants were presented with a first analysis of the newspapers selected by the libraries for refinement during the project with regard to characteristics such as font style, language, time period covered etc. and the overall volume of data to be processed. This again highlighted the ambitious goals of the project – since the 10M pages of newspapers to be refined in the course of the project correspond to a data set in the three-digit Terabyte range. A live demonstration of several tools that have been developed by the University of Innsbruck to facilitate the delivery of data from the libraries to the technical partners followed. These tools are going to be distributed to the libraries participating in the project who will use them for some essential in-house pre-processing tasks, such as validating metadata and file delivery structure – both aspects are key to coordinating such large-scale processing tasks. The session closed with a lively discussion of technical questions related to binarisation, the requirements for creating viewing copies and named entities recognition.”
To facilitate the aggregation of 10 million refined digitised newspaper pages from different European libraries, their metadata should be aligned according to the Europeana Data Model (EDM). The presentation of the digitised newspapers deals with the user experience. The content browser will improve the way a user can search through the digitised newspapers and retrieve from the Europeana portal. Alastair Dunning from The European Library (TEL), leader of the work-package that is concerned with the aggregation and presentation of digitised newspapers, says:
“The meeting contained various in depth technical discussions relating to metadata and OCR, but the most involved discussions were on how the newspapers will be presented to users within Europeana. How will the interface look? What will happen to our library’s newspapers? How will users be lead to each library’s own newspaper site?”
The University of Innsbruck presented a first draft of a slim METS/ALTO profile that will be used in the project for exchanging metadata together with content files and to deliver these Information packages to The European Library.
With the content refinement activities making good progress, it was also time to kick off a number of tasks related to evaluation and performance analysis which will later on provide objective results on the quality and usefulness of the enriched material. Being a Best Practice Network project, Evaluation and Quality Assessment constitutes an important aspect within the scope of Europeana Newspapers. Stefan Pletschacher from the University of Salford presented a number of tools and systems which in part build on resources created in the very successful EU-funded project IMPACT (Improving Access to Text) and which are currently being developed further in order to accommodate the specific requirements on digitised newspapers. A web-based image and ground truth repository was introduced and will serve as the central point of reference in the future. Ground truth, in this context, is considered the perfect result which recognition and/or refinement methods should ideally produce. With regard to OCR, for instance, this would be the 100% correct text as it is found on the scanned page. Related to this, all library partners were asked to commence a selection process in order to create a truly representative evaluation dataset. Besides tools and datasets, there was a third major topic at the heart of the discussions: Use Scenarios. Knowledge on how the enriched material is going to be used and presented to end users plays an important role when evaluating the success of automated and/or semi-automated text recognition systems. In order to gather the views of different stakeholders, a questionnaire on what use scenarios and corresponding technical features are considered important was prepared and is currently being circulated for feedback. It is envisaged to continuously follow any new developments in this direction during the course of the project and also to include the views of institutions outside the consortium in future.
The Europeana Newspapers project will build upon some of the successful evaluation tools implemented in the IMPACT project and develop them further into an evaluation and quality assessment infrastructure. Other IMPACT tools used by the Europeana Newspapers project are the Named Entity Recognition tools, ABBYY’s Fine Reader OCR that is capable of recognising historical fonts and layout, and the Functional Extension Parser (FEP) from the UIBK, which can be used to automatically detect and tag the structural metadata of the scanned newspaper pages. In the meeting actions were taken to start with the quality assessment of the use scenarios through a survey to all content providers. Furthermore all content providers will provide the University of Salford with images and metadata for the evaluation of the datasets.
From a digitised newspaper in a local libraries’ repository to an apparition in the users’ search is a long and complex chain of processes. The technical meeting certainly provided the partners with a forum to explain the different elements of this process, discuss the best practices and take decisions as a consortium.
Once the production process will be up and running it is foreseen to also involve libraries from all over Europe which may be interested in the refinement of their already digitised newspapers as well as in a standardised aggregation process towards Europeana.