Within the Europeana Newspapers Project, we often speak of the value of historic newspapers for the academic community but how exactly might a researcher use the material that we’re gathering? This month we’re interviewing Antal van den Bosch, full professor at Centre for Language Studies and Communication and Information Sciences at the Radboud University Nijmegen in the Netherlands.
Could you briefly describe your background and your main fields of research?
My background is in computational linguistics. I received a Master’s degree in computational linguistics in 1992 from Tilburg University at the faculty of Arts. Afterwards I went to Maastricht University also in the Netherlands where I did my PhD research at the computer science department. I continued working on computational linguistics especially on the application of machine learning, e.g. text to speech synthesis and text to text processing.
Over the years I developed interest in hard challenges in the area of machine learning of natural language, e.g. in word sense disambiguation, machine translation, and language modeling. It is a busy field but still a lot to be done.
Historical texts and archived materials have always been interesting for linguists. Data gathering has also been going on for quite some time, e.g. collections of newspaper archives, literary works, or folk tales. We, as computational linguists, generally analyze those materials on specific linguistic levels and generate automatic classifications for complete texts or cluster texts within the archive.
You used historic newspapers as a source in a project to find strikes that actually never happened – Could you please expand a bit more on the project and how it was related to digitized historic newspapers?
Actually, there were two projects. The first started five years ago, finished last year and was called HiTiME, which is an abbreviation of “Historical Timeline Mining and Extraction”. The project was carried out at Tilburg University and was about mining historical texts and text analysis. Our partner institute, the International Institute for Social History in Amsterdam, has a lot of databases and also texts covering all matters relating to the history of the social movement. One database was about strikes in the Netherlands; another contained a list of all labor unions that ever existed in the Netherlands. They also had terminology lists and authority files, basically structured pieces of information. They needed us to link them up and semantically connect them. Our question to them was: What do you want to do with that information? We asked several historians which kind of research they do, which kind of research they would like to do and so on. And they gave us a lot of hints. One of the things they often said was that they were interested in the dynamics of things. How do strikes evolve over time, what are the points of no return that lead to either nothing or to a strike? It is like at a crossroads. You rarely see an accident but often situations occur that almost go wrong. For researchers, strikes that did not happen are as interesting as strikes that did happen.
We linked this question to the digital collection team of the National Library of the Netherlands in The Hague, who were close to finishing their big newspaper digitization project.
We took the database that Sjaak van der Velden has been working on at the International Institute of Social History. It is a database of all the strikes that ever occurred in the Netherlands with information on where each strike took place, when it started, when and how it ended and so on. We took every item in the database and turned it into a query that said: find me articles that contain words like staken which means to strike in Dutch or staakten – the past tense – plus words from the database entry indicating location or factory name. We also restricted the dates in the query to a week before and three days after the date from the database. For each query we got a lot of articles. So we roughly linked articles to strikes. We focused on a couple of periods that were kind of hot according to historians. From these sample periods we got a lot of articles that were talking about strikes. If this produces fairly reliable data, which it did, you can build a classifier similar to a spam filter. You can then filter articles for strike threats for example by feeding the classifier with positive and negative examples, where positive examples are articles that are written before a strike. That filter can be applied to the complete archive. The result is a selection of articles that point at a particular moment of social unrest where a strike either was going to happen or not. This is where we stopped because it then becomes the task of the historian to work with the data, classify it, and look for patterns.
The gain is that we are saving the historian a lot of time. We prioritized the search. The historians that followed us said that there is no manual alternative to this type of very fast method.
Which were the main challenges you faced when working on these projects?
There were a few challenges: One was the question: How do you do scale up? – At the beginning the National Library sent us hard disk drives with their files from the archive, because it isn’t possible to apply the filter on the whole archive over the internet. The network can’t handle it. Then there is the issue of OCR errors. The National Library did a great job but OCR hasn’t reached a sufficient quality level to guarantee precise searching and retrieval results at the moment, especially with older newspapers. Due to their paper quality for example, OCR can be really bad. Although fuzzy search is available as well for the archive, one person within our group, Martin Reynaert, developed new post-correction tools for OCR. So work is carried out to solve this problem but it isn’t at the moment.
Was this the first time that you had used historical newspapers as a source and did the experiences you gathered from working with that source influence your later work?
We worked with digitally-born newspaper archives in the 1990s and 2000s by asking newspaper companies. There is a lot of applied work on current news, a lot on commercial and social media. The work with the National Library was our first encounter with historical newspapers; they had just released their newspaper archive. Now it has a lot of users but we were one of the first.
If one thinks of researchers working with historical newspapers one frequently thinks of historians or literary scholars. Your background is quite different. Can other disciplines use digitized newspapers for their research as well?
Speaking for my own field, we regard texts rather agnostically. Text is data for us. But if I may generalize a little, linguists have a big interest in such corpora. They study variation and change in language over time and geographical spaces. Newspapers give a fairly good idea about what was common to talk about in a certain region or time. I know some studies based on newspapers from Dutch colonies such as Surinam or Indonesia. The language of these newspapers shows interesting differences to standard Dutch newspapers. Some words came into use in the colonies and diffused into general Dutch in interesting periods both from a linguistic as well from a historical point of view.
Is the gathered data collection interesting for completely different areas of research as well, e.g. as testing grounds for software tools, which are then utilized in another area?
You can find examples of that. Scaling is always an issue in technology but there are other archives available as well. The internet itself is a much bigger archive. Social media alone has produced a lot of data. In the Netherlands alone about 7 million tweets are posted every day and this since about 2010, when it started to be a mainstream type of social media. This is creating an amazing amount of data.
But for the issue of searching and indexing, having those archives available I think is really great as a scientific challenge, because of its noisiness. How do you reliably link canonical, present-day keywords to words with historical spellings, with and without OCR errors? How do you access this enormous amount of data which is noisier than average internet data?
You mentioned data mining. As a rather new field of research this area becomes more and more an issue. Could you maybe expand a bit more on the concepts and research interests behind that term?
Data mining is obviously more general than text mining. It is about exploring and discovering patterns that haven’t been linked so far in any kind of data. For example, newspapers reflect the economy or economic events and provide economic data in great detail. Often in retrospective research, e.g. socio-economic historical research, official publications have been used (e.g. company press releases and official publications of central statistics bureaus) whereas a newspaper has more to offer. In the official documents you might have four data points a year per company, while newspapers talk maybe a hundred times a year about the same company. For economic historians this is an interesting extra layer not present in databases so far.
You are also using Twitter as a source for research. What are the similarities and differences between an historical newspapers and Twitter – its sort of contemporary counterpart?
I think that historical newspapers are more subjective than modern newspapers. In the Netherlands each had a very clear profile as e.g. social, liberal, catholic. They were rather antagonistic against one or two of the other groups. You can see that very clearly. And they were written by journalists who were not trained as they are nowadays, where they are academics or have had professional training. This profession has definitively grown. What you see in Twitter, in contrast, is that obviously most users are not professional writers. You see subjectivity in Twitter in a higher degree than in newspapers. Another difference in Twitter is that tweets come rather close to speech. This is different from newspapers in all ages as it always has been a written genre and intentionally not too colloquial – which Twitter is to a large extent.
What potential do you see for a pan-European digital collection such as the one being built by Europeana Newspapers?
The big picture should of course be on a European and global scale. Social processes such as how and why labor unions organize cannot be fully understood unless you go beyond the borders of a particular country. Social movements in the Netherlands can’t be seen without European movements. The Netherlands were in some cases trendsetting but usually followed other examples.
Studies that have been done on a smaller, monolingual and single-country scale should be lifted to a multilingual level. We did a second project after HiTiME with partners from the UK and US. It was funded by Digging into Data. We basically extended the study to English and got access to the archive of the New York Times. In essence we did the same analysis of strikes and news reports on strikes and reproduced the study. This is an example what you should do: get multiple perspectives. The New York Times did not only cover major strikes in the U.S., especially on the East Coast and New York, but also strikes in Europe. The American perspective, on a European strike is obviously framed within the international relations of the countries. For example when there was a strike of coal miners, the Americans reported on the consequences for import and export.
Looking into the future, how would you improve access to historic newspapers? Are there specific tools that need to be provided, or needs that should be met by libraries and digital archives?
A good question and also not easy to answer. It is great that they are available. One issue I was pointed at by the National Library in The Hague is that they are rescanning things. It is great if the quality of the data improves. But they may not be providing access to all versions of newspapers they are scanning. The reproducibility of our experiments therefore becomes a problem and version control an issue. We did our study but if the data is not in our place, and it shouldn’t be because of copyright issues, we can only refer to the data at the National Library. If they change even small things, the results of applying our methods to the Library’s current data will be different. This goes against a basic scientific principle that it should be possible to replicate results. It is not enough to publish our software; it needs to be the same for data. Libraries currently have trouble guaranteeing that because they may have trouble providing enough storage and retrieval or backup space.
Besides more storage and good backup schemes, there is a need for version control. We develop software that is dynamic but you need to know which software is used and it must be possible to roll back to older versions; the same goes for data. This is where all this v. 1.0, v. 2.0 and so on is coming from. So versioning in computer science is kind of an art but also an obligation. The same should go for digital collections as well.
Mr. Van den Bosch, many thanks for this insightful interview.