VepKar is a unique digital platform created by linguists and mathematicians at the Karelian Research Centre RAS. Firstly, it is the world’s only corpus of the Veps and Karelian languages. A corpus is an information and reference system based on a collection of electronic texts of various genres. VepKar contains over nine thousand texts in 58 dialects and nearly three million words. Almost all of them have been parsed – linguistically or metatextually – allowing users to recognize lexical, grammatical, and other characteristics of text elements. This is an invaluable source of information for researchers and language learners. Secondly, VepKar has evolved from a collection of texts into a multifunctional linguistic platform, which, in addition to the main library, hosts several major resources: Audio Map of Balto-Finnic Languages of Karelia, Multimedia Dictionary of the Karelian Language LiPaS – Livvin paginan sanat and Ludic Dialect Lexicon.

VepKar creators and specialists (left to right): Ekaterina Zakharova, Senior Researcher at the ILLH KarRC RAS Linguistics Section; Natalya Krizhanovskaya, Leading Research Engineer at the Laboratory of Information Computer Technologies, Institute of Applied Mathematical Research (IAMR) KarRC RAS; Irina Novak, ILLH KarRC RAS Director; Natalia Pellinen, Researcher at the ILLH KarRC RAS Linguistics Section; Alexandra Rodionova, Senior Researcher at the ILLH KarRC RAS Linguistics Section; Anastasia Runtova, Section Head at the Tver State University’s Scientific Library; Tatyana Boiko, Researcher at the ILLH KarRC RAS Linguistics Section; Andrey Krizhanovsky, Head of the Laboratory of Information Computer Technologies at IAMR KarRC RAS; and Nina Shibanova, Chief IT Specialist at ILLH KarRC RAS. Photo: I. Georgievsky / KarRC RAS[/i]
– The person who was at the origins of the project is Dr. Nina Zaitseva. Under her leadership, the work began in 2009 for creating the Veps Language Corpus – precursor of the modern VepKar. In 2016, the resource was significantly expanded to include texts in Karelian, – recollected Alexandra Rodionova, relating the platform's history. This year, VepKar celebrates its tenth anniversary.
Parallel texts, i.e. those supplied with Russian translations, accumulated within VepKar serve two purposes: developing the corpus as a research resource and building a technological foundation for the Karelian and Veps languages to be present on the web. In particular, VepKar is a platform where data is prepared for online Karelian and Veps translation tools being created in collaboration with the Federal Agency for Ethnic Affairs of Russia, the Ministry of Ethnic and Regional Policies of the Republic of Karelia, and Yandex.
– Training a machine translator requires a database of 100,000 sentences with Russian translations. When compiling such a database in the corpus, a new function for verifying the alignment of parallel texts at the sentence level was implemented. To date, VepKar contains over 1,500 texts in Livvi Karelian with Russian translations. In total, more than 50,000 sentences have already been submitted to programmers at Yandex. Simultaneously, a similar database is being prepared for the Veps language, – said Alexandra Rodionova.

Senior Researcher at ILLH KarRC RAS Alexandra Rodionova at the conference’s plenary session
As to VepKar subcorpora, the developers not only continue stocking the existing collections but also create new ones, thus promoting the platform's role of an electronic library. As part of the work commemorating the 800th anniversary of Christianization of Karelians, the subcorpus of biblical texts was expanded and two new subcorpora were launched: "Written Language Heritage" and "Ethnographic Texts." The new subcorpora help trace the changes in Karelian folk traditions and rituals associated with the adoption of Christianity. They are currently being used as the basis for producing an interactive map "The Fests Culture of South Karelia." The "Written Language Heritage" subcorpus incorporates digitized early handwritten and early printed texts in Karelian.
In recent years, the capabilities of VepKar for linguistic research have also been significantly expanded. This is primarily due to the word-form generators developed by specialists from the Institute of Linguistics, Literature and History and the Institute of Applied Mathematical Research KarRC RAS. Firstly, they have helped increase the share of automatic text mark-up across subcorpora to an average of 81.5%. Secondly, application of the word-form generators has resulted in the identification of some linguistic patterns, which have supplemented the new grammars of the Karelian and Veps languages. Finally, the new generators have significantly accelerated the work of editors and eliminated manual input errors.

Fragment of the VepKar homepage
– This is a vivid example of how the development of a tool promotes linguistics. A morphologically parsed corpus is a necessary foundation for creating a morphological analyzer and, subsequently, spelling checkers and machine translation systems for the Karelian and Veps languages, – explained Alexandra Rodionova.
Speaking about potential advancements, the researcher mentioned improvement of the "predictor" – the module that prompts the most likely linking of a word form to its dictionary meaning, creation of an epistolary subcorpus, widening of opportunities for interdisciplinary research, and the development of applied products based on corpus data, such as games and educational materials.
– The experience of VepKar shows that a corpus of an endangered language can simultaneously serve the tasks of language preservation, its scientific study, and digital development. This is especially critical for the minority languages of Russia – the corpus infrastructure provides the foundation without which neither full-fledged linguistics nor modern language technologies can exist, – the researcher concluded.
Photos: Mari State University Press Office





