Every month we highlight one partner of the Europeana Newspapers Project. These articles will give you the ‘inside story’ about our partners, their specific role within the project and the various challenges that arise with the refinement and aggregation of historical newspapers. This month we feature the Bibliothèque nationale de France. The french version of this article is available here.
The Bibliothèque nationale de France (BnF) collects, preserves and makes the documentary heritage of France available to a large audience on its site and online. The BnF’s collections are unique in the world: 14 million books and printed documents, manuscripts, prints, photographs, maps and plans, scores, coins, medals, sound documents, video and multimedia documents, etc. covering all disciplines. The BnF’s digital library, Gallica, provides access to over two and a half million documents.
Because of the wide range of activities linked to the Europeana Newspapers Project, a unique and tailor made team with colleagues coming from no less than 5 departments of the library is currently working together.
Which content does the BnF provide to the Europeana Newspapers Project?
With more than 5 million of newspapers digitised pages (from which nearly 4 million are OCR-ed), the BnF offers one of the richest digitised newspaper collections in Europe and provides the project with the highest number of digitised issues (1,385,727 pages for OCR and 1,002,761 pages for OLR). This is the only corpus of the project in French.
The concerning newspapers are out of copyright and focus on a period ranging from the beginning of the 19th century to the middle of the 20th century. They were selected with the idea of offering those newspapers that researchers and the general public request the most in the digital library Gallica.
A famous example is the well known “J’accuse”. On January 13th 1898, Emile Zola published on the front-page of “l’Aurore” a defense of Captain Dreyfus who was accused of high treason in the form of an open letter to President Félix Faure. In his plea, Zola is making this case as a national and public one, putting the Republic and its values at the center of his arguments. This article shook the regime on its foundations and has become one of the most famous newspaper articles ever.
What impact does and will the project have on your library and its users?
Researching specific topics and browsing through the text is more difficult with newspapers than with books or academic journals. The frequent lack of tables of contents or indexes in the French newspapers makes the research even more difficult. With the Europeana Newspapers Project, the accessibility to the journals collections and their usability by the internet users will be considerably bettered thanks to the improvement of OCR, the refinement by article and the indexing of each title. Moreover, we expect that the possibility for the search engines to identify article titles will generate additional traffic of new audiences towards Gallica and the BnF catalogue.
The Europeana Newspapers Project represents a formidable step forward for the BnF, to name a few: the metadata should be opened up more widely, the update of the METS/ALTO format is ongoing and the IT systems will have to develop functionalities in order to allow implementation and access to OLR files with Gallica.
Thanks to this project, the BnF is also working on the development of new tools to improve searches within its newspaper collections. Work is being done on named entities and on refinement by article. Even though these new functionalities are of particular interest to researchers (e.g. linguists, political scientists, literature historians), they will also benefit a larger audience as they will allow, for instance, genealogists to access personal data more easily.
With the project entering its third and final year, controlling the files once they have been OCR-ed and OCL-ed is keeping us busy, even if the quality control (level of text recognition, accurate refinement of articles, etc.) is done by sampling. Questions are also arising regarding, for example, the comparison with other previous OCR-ed versions: what version shall we propose? Can we (or should we) keep a version of the previous OCR files? What tools do we have at our disposal for the control and the comparison? What criteria and methodology should be applied? How the named entities should be implemented? We hope that these questions will create new synergies within the various departments of the library.
And what about having a concerted selection policy or even working together on thematic newspapers projects at a European level? Well, that would be the beginning of another story that would go much farther than the scope of this project…