Within the Europeana Newspapers project, we often speak of the value of historic newspapers for the academic community but how exactly might a researcher use the material that we’re gathering? This month we’re interviewing Toine Pieters.
“European libraries must secure free access for researchers”
You can find Professor Toine Pieters on the fertile grounds where the life sciences meet the humanities and social sciences – with a background in pharmacology and a PhD in Social Studies of Science he teaches history of the life sciences at Utrecht University. Pieters: ‘Medicines have always been at the forefront of the public debate – both as focal points of cultural enthusiasm and as the locus of public controversy. The utopia of the wonder drug and the dystopia of poisoned minds or bodies represent the opposite ends of the range of attributed meanings.’
‘One of my current projects is Translantis, research into so-called ”reference cultures”. How do we perceive foreign cultures which are dominant in a certain field or age? The Americans have long dominated the pharmaceutical debate. Before them it was the Germans. It is interesting to study tell-tale signs of reference cultures in our public sphere. And see how the shift occurred from the Germans to the Americans.’
“Newspapers resonate with public sentiment”
‘Newspapers do not only reflect the news of the day, but also the way contemporaries perceived those events, in background articles, editorials and letters to the editor. Plus: they constitute a very large corpus. This makes newspapers a true treasure trove for studying public discourse, the interaction between science, culture and the economy.’
“Digitisation has opened up fabulous new research opportunities”
‘In the analogue days, any number of researchers could only analyse a fraction of the information that was out there. Also, the research question very much determined the selection. Now software allows us to work with millions of pages. By combining words and expressions, machines uncover patterns that we never even suspected were there.’
‘For example, I researched the duplicitous attitude of the Dutch towards drugs before World War II. At home their religion forbade the use of drugs, but in the Dutch colonies they were actively engaged in the production and trade of cocaine and opium. Somehow the Chinese living in the Netherlands were also involved, but we did not quite know how. By analysing so-called “hidden debates” in the newspapers, we were able to uncover a pattern revealing that the Chinese were identified as “the other” to whom the Dutch laws did not apply as much. Their involvement in the drug trade as users and as dealers was in line with the economic interests of the Dutch Government and turned out to be the key to the double standard.’
“We are still in the pre-history of working with big data”
‘But there are pitfalls as well, of course. With big data at our disposal and prototypes of software to work with them, we have yet to develop corresponding methods of source criticism and heuristics. First of all, the available collections are not representative. Libraries have made the selection for us – sometimes consciously, sometimes because they had no choice. Many publishers expect to make money from their digital archives and therefore will not provide free access, even if it is in the public interest. For example, the Dutch post-World War II newspaper collection has a strong bias towards smaller newspapers. All but one of the national papers are missing from the collection. That greatly affects any research outcome.’
“OCR quality varies”
‘Another obstacle is the varying quality of the OCR. There are quite a few issues with the earliest scans. Such are the dialectics of lead. Our technical means have greatly improved in recent years, but even present-day OCR is far from perfect. Historical texts continue to confound OCR equipment.’
‘Improving low-quality OCR is a tremendous challenge. Post-digitisation OCR correction tools are being developed, but their usefulness is limited. Sometimes rekeying the content is the only option. As resources are limited, I think we will need a massive crowd-sourcing effort to get that job done. In the past the KB was not particularly receptive to the idea of crowd sourcing, but hopefully that is changing now. Crowd sourcing is the way to go, it is absolutely necessary.’
“Do not throw away the paper originals!”
The issues we now have with OCR from the past only go to show that it is very, very unwise to do away with paper originals once they are digitised. Governments may think it is a way to save money, but for research it is a downright disaster. In view of the quality issues with OCR we must always be able to go back to the originals to verify whether the digital copy has, in fact, the whole story. For academic research we need the facts, nothing but the facts. ’
“Multilingualism at the European level”
‘A European aggregation such as Europeana is, of course, a great idea. For the first time in history we have the opportunity to do transnational comparative research. Multilingualism is an obstacle we can overcome, I think, although there is lots of work to be done. I myself am involved in the HERA project which involves newspaper collections from the Dutch KB, the National Library of Luxemburg, some German libraries and the British Library. We have developed a demonstrator for Dutch-German bilingual text mining called BILAND and are now proceeding towards trilingual text mining. Also, I am co-organising a two-day HERA-sponsored seminar on Mining Digital Repositories in a European context in April at the Dutch KB. At the seminar, we will discuss all angles of transnational and multilingual text mining with a select group of researchers [link to on-line report to follow]. Again, this is entirely new ground we are breaking; it is very exciting, but we must develop new methods and rules for such research.’
“Libraries should secure free access for researchers at the European level”
‘Looking at the future of newspaper collections and of Europeana, I of course hope that they will keep building and improving the collections. But I see another important task for the libraries of Europe: to negotiate free access to the collections for research purposes.’
‘I think the researchers themselves will develop most of the tools they need, as they know exactly what their research questions are. But securing access must be done at the aggregation level. I understand perfectly well that publishers need to make money by selling access to their archives to the general public. But there must be some way to negotiate unrestricted access for research purposes, in the public interest. There is a crucial role for libraries and Europeana to play here.’
Text and photograph Inge Angevaare / March 2014
Pingback: How to maximise usage of digital collections | Research in KB