Techniques for refining and assessing the quality of digitised historical newspapers were the main topic of discussion at a recent Europeana Newspapers workshop at the University Library “Svetozar Markovic” in Belgrade, Serbia.
Workshop participants were given a warm and personal welcome by Dr. Alexandar Jerkov, director of the host library.
After this official welcome, the workshop began with an introduction from Marieke Willems of LIBER. She gave a brief overview of the project, and explained the structure of the two-day workshop.
Hans-Jörg Lieder from the Berlin State Library (coordinator of the Europeana Newspapers Project) spoke about the challenges involved in digitising and refining historical newspapers. The relatively poor quality of the paper and ink used to print newspapers, for example, means that they often deteriorate faster than other materials. He also explained how the project would add value to historical newspapers by making it easier to search through the content and pinpoint material of interest. Leider closed his presentation by explaining future ways in which the project might aggregate and refine more content, and extend its network.
Clemens Neudecker from the National Library of The Netherlands explained the complex process of newspaper refinement. He noted that the Europeana Newspaper project would refine some 10 million pages from libraries across Europe. This work would be done by two project partners: the University of Innsbruck and CCS.
On video: Refinement – Clemens Neudecker
This was followed by an explanation of how to establish the quality of newspaper refinement, from Stefan Pletschacher and Christian Clausner from PrimaResearch of the University of Salford. Quality assessment is important because the poor state of historical newspapers, and the use of fonts that are difficult to read, often leads to errors in the digital version. This can be corrected by using a printed newspaper as the “ground truthed” version of its digital equivalent. By comparing the two versions, the accuracy of the OCR process could be measured and this was in turn helpful when setting requirements for the outsourcing of OCR work.
On video: Aletheia Demo – Christian Clausner (PrimaResearch)
Meet & Greet
After discussing these technical issues, a ‘Meet & Greet’ session allowed participants to speak with one another in more depth. In small groups, they started off by introducing themselves as persons. Then they spoke about the organisations they represent, their experience with historical newspapers and their role in the Europeana Newspapers Project. This session welcomed the particpants in the Best Practice Newtork and facilitated the networking among participants by giving each other a peek into their organisations.
With their heads full of new insights and technical knowledge, participants enjoyed a sunny walk to the NikolaTeslaMuseum, where they learned about his famous inventions such as the induction motor and the remote-controlled boat. There was also a connection to the Europeana Newspapers project because Nikola Tesla gathered newspaper clippings that are now digitised by the museum to make them fully searchable. Read the joint paper from the NikolaTeslaMuseum and the University Library “Svetozar Markovic” here: http://www.inforum.cz/pdf/2013/filipi-matutinovic-stela.pdf
The second day of the workshop was dedicated to refinement. Lotte Wilms from the National Library of The Netherlands kicked off the morning by explaining Named Entities. She demonstrated the rules that the project had decided to use for the naming of geographical places, people and organisations. She also noted that name tagging should be done in a consistent way and common sense should be used to serve the majority of search commands.
With this explanation of named entities, Lotte Wilms and Clemens Neudecker invited particpants to tag named entities themselves in the hands-on session.
Claus Gravenhorst from CCS, the only private partner in the Europeana Newspapers Project, showed how the OCR-program “DocWorks” can bring structure to previously unstructured text.
On video: DocWorks – Claus Gravenhorst
Gunter Mühlberger explained how the project needed to align various types of metadata, in order to make aggregation via The European Library possible. The Europeana Newspapers Project uses METS/ALTO metadata, and Mühlberger explained its features to the participants. He closed his session with food for thought about structural metadata: what is a headline or an advertisement? What can be categorised as “opinion”?
On video: Doc-Works hands-on – CCS and UIBK
The various presentations and social events of the Europeana Newspapers Project certainly facilitated the networking between all network partners, and encouraged people to share their best practices and views on future project work. We look forward to seeing everyone again, as well as new participants, at the next workshop. It will focus on Aggregation and Presentation and will be held at the “Promoting innovation in Europe” conference, organised by The European Library on the 16th of September 2013 in Amsterdam.
Register here: http://www.eventbrite.nl/org/3891830439?s=14727265
Want to know more?
Lotte Wilms wrote a blog on this workshop for KBNLResearch: http://researchkb.wordpress.com/2013/06/20/europeana-newspapers-refinement-aggregation-workshop/
Pictures of the workshop can be seen on flickr: http://www.flickr.com/photos/enewspapers/with/9024170781/