Research Projects

Finnish Internet Parsebank

PI: Filip Ginter (TY), Veronika Laippala (TY)

2014-2017, funded by the Kone foundation (Koneen Säätiö)

The Finnish Internet Parsebank is a joint project of the Department of Information Technology and the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.

Universal Dependency Parser

PI: Filip Ginter (TY)

2016-2020, funded by the Academy of Finland

Automated syntactic analysis is one of the fundamental tasks of language technology, utilized in many applications such as search engines and machine translation. For a small number of well-resourced and well-researched languages, like English, the accuracy of the automatic parsers approaches human level. But for many others, the accuracy is significantly worse and the vast majority of languages cannot be automatically parsed at all, because the necessary data to train the parsers has not been created. In this project, we are finding ways in which the parser training data of many languages could be pooled using techniques of vector space representation of words. This will result in a “Universal Parser” which operates in a language-independent manner and can, for instance, use Czech training data to improve its performance on Finnish.

U-bot: News Generation Using Advanced Language Technology Methods

PI: STT, Partners: Filip Ginter (TY), Namia Oy

2018-2019, funded by the Google Digital News Initiative (DNI)

U-bot uses advanced language technology to automatically generate news text based on facts from a data source and previous news stories in the same area. Most organisations producing automatically generated text use simple template slot filling. These templates usually have to be written by journalists, which takes time and effort. With U-bot, text generation is driven by advanced language technology methods, cutting out the role of the time-consuming templates and letting the machine do the work.

Structuring language use across multilingual web corpora

Partners: Veronika Laippala (TY), Jesse Egbert and Douglas Biber, Northern Arizona University, USA

2018-2019, funded by Cultural foundation of Finland, Fulbright

The project combines corpus linguistics and machine learning to study registers, i.e. language varieties such as user manual, news article or film review used in the Internet. The objective is to define the linguistic characteristics of these registers and develop methods to automatically detect them from very large, web-crawled corpora, such as the Finnish Internet Parsebank and similar collections in other languages. This will improve the usability of such collections, because the users can then focus on particular registers. In the long term, the project will also enhance the availability of information in the Internet, because the results can be used to detect the origins of web documents. As a consequence, for instance a Google search could be asked to focus on specific registers, such as news or product reviews.

Computational History and the Transformation of Public Discourse in Finland, 1640–1910

PI: Hannu Salmi (TY), Partners: Kimmo Kettunen (HY), Tapio Salakoski (TY), Mikko Tolonen (HY)

2016-2019, funded by the Academy of Finland

The consortium Computational History and the Transformation of Public Discourse in Finland, 1640–1910 is based on the shared expertise of The Faculty of Humanities at the University of Helsinki, the Departments of Cultural History and Information Technology at the University of Turku, and the Centre for Preservation and Digitisation of the National Library of Finland. Its objective is to reassess the scope, nature and transnational connections of public discourse in Finland, 1640–1910. Two complementary approaches will be utilized, one based on the use of library catalogue metadata and the other based on the full text-mining of all the digitized Finnish newspapers and journals until 1910. The consortium will analyze how the language barriers, elite culture and popular debate, text reuse as well as different publication channels interacted. As a key methodological innovation, the consortium introduces the concept of open data analytical ecosystems.

Citizen Mindscapes – Detecting Social, Emotional and National Dynamics in Social Media

PI: Jussi Pakkasvista (HY), Partners: Juha Alho (HY), Filip Ginter (TY), Juho Saari (TAY), Jaakko Suominen (TY)

2016-2018, funded by the Academy of Finland

Mindscapes24 builds a research frontier for social media analysis by focusing on Suomi24–Finland’s largest topic-centric social media, and one of the world’s largest non-English online discussion fora. We bring together researchers from social sciences, digital culture, welfare sociology, language technology, and statistical data analysis, developing new ways of exploring social and political interaction. We tackle Suomi24 from three perspectives: (1) the digital culture that produces social media (2) novel visual tools and analysis methods for studying the digital content, and (3) a small number of spearhead research questions, such as characterizing the types of micro interaction, how heated debates might turn into political movements and how to detect emotional waves. In addition to an open data set made available through the Language Bank, the results will include a book on digital culture, visual tools for social scientists, and an international conference.

Profiling Premodern Authors

PI: Marjo Kaartinen (TY), Partners: Sampo Pyysalo (TY)

2016-2019, funded by the Academy of Finland

The Consortium Profiling premodern authors (PROPREAU) applies and develops new computational methods based on machine learning to explore several fundamental and unresolved questions of authorship in classical and medieval Latin texts, ranging from Roman grammarians to papal court and works of the inquisitors. Despite the unsurpassed cultural importance of Latin, many essential texts remain anonymous. Identifying their authors requires an analysis and comparison of large quantities of text, often characterized by imitation of earlier sources. Computational models and machine learning have potential to significantly alter our view of premodern authorship by allowing a much wider look at textual material than is attainable by conventional methods and a single human reader. The Consortium expects to offer new, well-grounded answers to questions of authorship that were previously considered unsolvable as well as guidelines for future endeavors to identify anonymous premodern texts.