Within the Europeana Newspapers Project, we often speak of the value of historic newspapers for the academic community but how exactly might a researcher use the material that we’re gathering? This month we’re interviewing Timo Honkela, professor at University of Helsinki, Department of Modern Languages, and National Library of Finland, Centre for Preservation and Digitisation.
Can you briefly describe yourself: your background and the research you’ve done using historic newspapers?
My name is Timo Honkela and from the beginning of 2014, I have served as a professor at University of Helsinki, Department of Modern Languages, and National Library of Finland, Centre for Preservation and Digitisation. In 1980s, I worked as a researcher in a project that was developing a natural language database interface for Finnish. Since then, the research has been around natural language processing, machine learning, text mining, cognitive modeling and related areas
Your work highlights the importance of newspapers as an information source. How did you discover the link between your research topic and the information recorded by the press?
Newspapers are a natural information source in our research. First of all, the Centre for Preservation and Digitisation in Mikkeli, Finland, is respobsible for storing historical and modern newspapers in digital form. At the same time, it is a major source regarding Finnish language, history and culture.
Was this the first time that you had used newspapers as a source of information for research, and did it change the way that you perceive newspapers? In other words, were you surprised by what you could research using newspapers?
In earlier research, newspapers or news feed was used to test the Websom system. This system for organizing text documents automatically into an organized map has been a forerunner in the area of information visualization and widely cited in scientific literature. In that purpose, newspapers were dealt with rather superficially. In our current research, we consider the contents in a more detailed manner.
How would you compare newspapers to other sources of information such as books and journals? Are there certain aspects of newspapers that just can’t be replicated anywhere else?
Newspapers provide an in-depth view into the functioning of a whole society. The details regarding people, places and events are often such that cannot be found elsewhere.
In terms of your work process, did you use digital or paper copies of newspapers and what kind of techniques did you use (eg. simple keyword search, text mining)?
Our work is essentially text mining. Therefore, it is important that the newpapers are available in digital form. Regarding the historical newspapers, one problem is that the error rate in the output of the Optical Character Recognition is often rather high and sometimes not useful at all. Therefore, one of the first tasks is to device automatical postprocessing that corrects some proportion of the errors. Various approaches can be used including language technology and statistical machine learning. The newspaper corpus contains mainly texts in Swedish and Finnish. As there are millions of pages, manual revision work is not possible. Finnish introduces challenges due to its complex morphology. Every Finnish noun has about 2,000 inflectional word forms and every verb more than 10,000. Therefore, one cannot simply list Finnish words and compare the OCR output with this kind of a list. Regarding complexities of Finnish, collaboration is conducted with Dr. Krister Lindén who directs FIN-CLARIN consortium.
How do you use machine learning?
We apply machine learning methods and techniques in various ways. One basic approach is to build a language model. This model can predict the next letter in a word or the next word in a sentence based on a probabilistic model. We also conducted conceptual modeling based on unsupervised learning algorithms such as the self-organizing map and independent component analysis. Machine learning can also device various other tasks such as term extraction and sentiment analysis.
Looking forward, how would you improve access to historic newspapers? Are there specific tools that need to be provided, or needs that should be met by libraries and digital archives?
Actually, this question relates nicely to the earlier discussion on improving the quality of scanned texts and finding means to text mine the contents of the articles. One could, in addition, build new kinds of views into the collections using visual text mining. The Websom project developed suitable techniques for this already in the 1990s. On the other hand, these kind of sophisticated techniques have become feasible only recently thanks to better computational resources. Another interesting area is the use of multilingual language technologies. In the META-NET Network of Excellence, we were involved in taking this area further. It seems evident that the use of machine translation and cross-lingual information retrieval will increase.
What potential do you see for a pan-European archive such as the one being built by Europeana Newspapers? Could you, for example, extend your thesis by having access to newspapers from across Europe via a single website?
A pan-European archive and European collaboration in this area will be very important. In our institution, we have been discussing a platform for historical and societal research based on European newspaper collections. New text mining and machine translation technologies would enable research in which important historical developments and societal questions could be analyzed in an unforseen manner. More specifically, we have developed a method called Grounded Intersubjective Concept Analysis (GICA) that can be used to analyze similarities and differences between concepts and conceptual systems among people and gorups of people. This and related methods can be used to analyze corpora of hundreds of millions of European newspaper pages to find out how different conceptions and perspectives into important historical events and societal themes have developed over time.
Another interview with Timo Honkela on ” Making the most of digital materials” is available here on the website of the National Library of Finland.