The Europeana Newspapers project has developed a number of free and open source software tools. These tools can help libraries and other institutions to refine their digitised historical newspapers and to evaluate the quality of their refinement work.
The PAGE format and related tools were developed by PRImA research over the course of several years and have been partially supported by externally funded projects such as IMPACT and Succeed as well as Work Package 3 “Quality Assessment”.
- The PAGE Viewer is a stand-alone application for viewing page layout and text content of segmentation ground truth and results of page recognition/OCR systems. The natively supported file format is PAGE XML. However, ALTO XML, FineReader XML, and HOCR can be opened as well. It is available for Windows, Linux and OSX and is now available as open source on GitHub.
- The PAGE Metadata Scanner is a Java command line tool that scans a single PAGE XML file (document page layout and text content) and outputs its properties/statistics as comma-separated values. It is also available open source on GitHub.
- The Java PAGE Libraries are a comprehensive set of Java classes that can be easily integrated to produce valid PAGE instances from within your tool. The Java PAGE Libraries have been released as open source on GitHub.
The following tools were developed in Work Package 2 “Refinement” by the National Library of the Netherlands.
- The Named Entities Recognition Tool takes container documents (MPEG21-DIDL, METS), parses all references to ALTO files and tries to find named entities in the pages (with most models: Location, Person, Organisation). The aim is to keep the coordinates of the location on the page available throughout the whole process to be able to highlight the results in a viewer. Read more about the tool on the KBNLresearch blog and get the open source on GitHub.
- The Entity Disambiguation Tool is a simple Python library and webservice which allows named entity disambiguation against a label database. Use a Solr query to filter possible candidates and use the more detailed analysis on string similarity, number of inlinks and entity type to select the “best” candidate. It contains code to handle (multi-lingual) DBpedia dumps and load them into a Solr backend. It also contains helper code for the annotation of ALTO 2.1 files that are used in the context of the Europeana Newspapers project. Get the open source on GitHub.
The following tools were developed in Work Package 2 “Refinement” by the University of Innsbruck.
- The Structify tool is a graphical tool for viewing and editing METS files and creating or correcting structural metadata. It requires Java and is available for Windows, Linux and OSX.
You can download the most recent version here.
- The File Rename Tool (FRT): Deliveries on day level are obvious for newspapers to have the possibility for a date search later on. If a newspaper is not available in day folders, the ‘File Renaming Tool’ can help to bring them into the right structure and support libraries in renaming and reordering their images according to the Europeana Newspapers project specifications. The main idea of FRT is that images, which may be stored on year level, can be quickly ordered on the basis of issues and publishing date. Source code available on GitHub.
- The Binarization and Conversion Tool (BCT):The BCT tool can be used to produce JPEG2000 or JPEG files of newspaper master images for presentation on the web. JPEG2000 is the preferred format in the Europeana Newspapers project for displaying images in the TEL browser, as it allows for zooming and various other display modifications. This tool calls two other tools, a binarization method from Basilis Gatos which is optimised for OCR, and Kakadu, a software development kit for creating JPEG2000 images. Therefore, in order to fully use the features of BCT, both tools must be installed and licensed. However, feel free to call your own tools from BCT. Source code available on GitHub.
- The File Analyzer Tool (FAT): Main purpose of this tool is to guarantee the delivery of data from the libraries to the technical partners according to the specifications (i.e. valid and readable files) for further processing. This check prevents dwell time during the recognition process and hence is very important to ensure that massive amounts of newspapers can be refined efficiently. Moreover, the tool collects metadata of the images which are then used in later stages for producing the final METS/ALTO (ENMAP). FAT checks the files and reads metadata (width, height, resolution, etc.), calculates the checksum of each file and allows for the input of languages and text type of the newspaper images. At the end, a FAT-XML is produced which contains all necessary information for further refinement of the images. Source code available on GitHub.
Pingback: Numériser la presse : le contrôle | Épitomé