KarRC RAS. News

KarRC RAS
in social media

News

April 28, 2026

Linguists handed over 50k sentences in Karelian over to Yandex for creating an online translation tool

Specialists of the VepKar language platform continue their work to prepare data for creating Yandex-based online Karelian and Veps translators. More than half of the required 100,000 sentences (text in various genres) in the Livvi dialect of Karelian have already been submitted to the company. This and other perspectives of the platform’s development were presented by Alexandra Rodionova, Senior Researcher at the Institute of Linguistics, Literature and History KarRC RAS, during the 7th International Conference "Digitizing the Languages of the Peoples of Russia: Experience Upscaling and Prospects" in Yoshkar-Ola.

The 7th International Conference "Digitizing the Languages of the Peoples of Russia: Experience Upscaling and Prospects" took place at Mari State University on April 16–17. A talk at the plenary session was given by Alexandra Rodionova, Senior Researcher at the Institute of Linguistics, Literature and History (ILLH) KarRC RAS. The linguist recapitulated on ten years of operation of the open Karelian and Veps corpus VepKar and contemplated on the new vectors and prospects for its development.

VepKar is a unique digital platform created by linguists and mathematicians at the Karelian Research Centre RAS. Firstly, it is the world’s only corpus of the Veps and Karelian languages. A corpus is an information and reference system based on a collection of electronic texts of various genres. VepKar contains over nine thousand texts in 58 dialects and nearly three million words. Almost all of them have been parsed – linguistically or metatextually – allowing users to recognize lexical, grammatical, and other characteristics of text elements. This is an invaluable source of information for researchers and language learners. Secondly, VepKar has evolved from a collection of texts into a multifunctional linguistic platform, which, in addition to the main library, hosts several major resources: Audio Map of Balto-Finnic Languages of Karelia, Multimedia Dictionary of the Karelian Language LiPaS – Livvin paginan sanat and Ludic Dialect Lexicon.

VepKar creators and specialists (left to right): Ekaterina Zakharova, Senior Researcher at the ILLH KarRC RAS Linguistics Section; Natalya Krizhanovskaya, Leading Research Engineer at the Laboratory of Information Computer Technologies, Institute of Applied Mathematical Research (IAMR) KarRC RAS; Irina Novak, ILLH KarRC RAS Director; Natalia Pellinen, Researcher at the ILLH KarRC RAS Linguistics Section; Alexandra Rodionova, Senior Researcher at the ILLH KarRC RAS Linguistics Section; Anastasia Runtova, Section Head at the Tver State University’s Scientific Library; Tatyana Boiko, Researcher at the ILLH KarRC RAS Linguistics Section; Andrey Krizhanovsky, Head of the Laboratory of Information Computer Technologies at IAMR KarRC RAS; and Nina Shibanova, Chief IT Specialist at ILLH KarRC RAS. Photo: I. Georgievsky / KarRC RAS[/i]

– The person who was at the origins of the project is Dr. Nina Zaitseva. Under her leadership, the work began in 2009 for creating the Veps Language Corpus – precursor of the modern VepKar. In 2016, the resource was significantly expanded to include texts in Karelian, – recollected Alexandra Rodionova, relating the platform's history. This year, VepKar celebrates its tenth anniversary.

Parallel texts, i.e. those supplied with Russian translations, accumulated within VepKar serve two purposes: developing the corpus as a research resource and building a technological foundation for the Karelian and Veps languages to be present on the web. In particular, VepKar is a platform where data is prepared for online Karelian and Veps translation tools being created in collaboration with the Federal Agency for Ethnic Affairs of Russia, the Ministry of Ethnic and Regional Policies of the Republic of Karelia, and Yandex.

– Training a machine translator requires a database of 100,000 sentences with Russian translations. When compiling such a database in the corpus, a new function for verifying the alignment of parallel texts at the sentence level was implemented. To date, VepKar contains over 1,500 texts in Livvi Karelian with Russian translations. In total, more than 50,000 sentences have already been submitted to programmers at Yandex. Simultaneously, a similar database is being prepared for the Veps language, – said Alexandra Rodionova.

Senior Researcher at ILLH KarRC RAS Alexandra Rodionova at the conference’s plenary session

As to VepKar subcorpora, the developers not only continue stocking the existing collections but also create new ones, thus promoting the platform's role of an electronic library. As part of the work commemorating the 800th anniversary of Christianization of Karelians, the subcorpus of biblical texts was expanded and two new subcorpora were launched: "Written Language Heritage" and "Ethnographic Texts." The new subcorpora help trace the changes in Karelian folk traditions and rituals associated with the adoption of Christianity. They are currently being used as the basis for producing an interactive map "The Fests Culture of South Karelia." The "Written Language Heritage" subcorpus incorporates digitized early handwritten and early printed texts in Karelian.

In recent years, the capabilities of VepKar for linguistic research have also been significantly expanded. This is primarily due to the word-form generators developed by specialists from the Institute of Linguistics, Literature and History and the Institute of Applied Mathematical Research KarRC RAS. Firstly, they have helped increase the share of automatic text mark-up across subcorpora to an average of 81.5%. Secondly, application of the word-form generators has resulted in the identification of some linguistic patterns, which have supplemented the new grammars of the Karelian and Veps languages. Finally, the new generators have significantly accelerated the work of editors and eliminated manual input errors.

Fragment of the VepKar homepage

– This is a vivid example of how the development of a tool promotes linguistics. A morphologically parsed corpus is a necessary foundation for creating a morphological analyzer and, subsequently, spelling checkers and machine translation systems for the Karelian and Veps languages, – explained Alexandra Rodionova.

Speaking about potential advancements, the researcher mentioned improvement of the "predictor" – the module that prompts the most likely linking of a word form to its dictionary meaning, creation of an epistolary subcorpus, widening of opportunities for interdisciplinary research, and the development of applied products based on corpus data, such as games and educational materials.

– The experience of VepKar shows that a corpus of an endangered language can simultaneously serve the tasks of language preservation, its scientific study, and digital development. This is especially critical for the minority languages of Russia – the corpus infrastructure provides the foundation without which neither full-fledged linguistics nor modern language technologies can exist, – the researcher concluded.

Photos: Mari State University Press Office

See also:

June 2, 2026

The journal Transactions of KarRC RAS released its fifth issue this year in the Experimental Biology Series

This year’s issue No. 5 of the journal Transactions of the Karelian Research Centre RAS, Experimental Biology Series, is out of press. This issue includes review articles on the causes of photodamage to plant leaves under abnormal light–dark cycles, as well as on the role of microRNAs in regulating flowering timing in plants. The feature in the Dates and Anniversaries section is devoted to the 95th anniversary of birth of Vyacheslav Berestov, the first head of the Laboratory of Fur Animal Physiology at the Institute of Biology KarRC RAS.

More

All news