How Does the LIEPA-3 Project Empower Technology to Speak Lithuanian?

Sukurta: 19 December 2024

catherine breslin 2dA2zWv0A8o unsplashThe Lithuanian language has been largely inaccessible in many of our daily technologies for a long time. Why can't we communicate with our smart devices in Lithuanian? Why doesn't a robot vacuum respond to commands given in Lithuanian? Why are voice assistants in our native language so limited? Researchers at Vilnius University (VU) have worked for over ten years to make technology more accessible in Lithuania.

The "Creation of the Great Lithuanian Language Listserv" (LIEPA-3) project represents a significant milestone in Lithuanian language technology. LIEPA-3 seeks to create new opportunities for our language to adapt to modern intelligent systems, ensuring that Lithuania plays an integral role in technology, on par with major languages like English and German.

"We live in a world where language technology is becoming increasingly essential daily. If Lithuania does not flourish in this domain, we risk falling behind. LIEPA-3 presents us with the opportunity to preserve and promote the Lithuanian language within modern technology," remarks Gediminas Navickas, a researcher at VU's Faculty of Mathematics and Informatics and one of the project's initiators.

According to G. Navickas from the Faculty of Mathematics and Informatics at Vilnius University, this project not only fosters the development of new technologies but also preserves the distinctive sound and character of the Lithuanian language for future generations. This endeavour holds particular significance for advancing language technologies and the study of the language itself.

The Future of the Lithuanian Language in Technology

LIEPA-3 is an ongoing initiative to ensure the survival and adaptation of the Lithuanian language in today's digital landscape by enhancing the country's digitisation capabilities. It builds upon the achievements of the previous LIEPA and LIEPA-2 projects, broadening the scope of Lithuanian language technology.

Researchers from the Faculty of Mathematics and Informatics and the Faculty of Philology at Vilnius University also conducted the LIEPA and LIEPA-2 projects. Both projects centred on two primary objectives: developing information technology solutions to provide innovative services to the public and establishing infrastructural solutions related to the Lithuanian spoken language, including lexicons, speech synthesisers, and speech recognition systems.

A large team of scientists is currently working on the new LIEPA-3 project. Unlike its predecessors, this project focuses on a single primary objective: creating a comprehensive, Large, Annotated Lithuanian Language Linguistic Listserv. An annotated lexicon is a structured collection of sound recordings that provide examples of Lithuanian speech, accompanied by corresponding time-stamped texts. This new dictionary will be ten times larger than the largest existing Lithuanian dictionary, encompassing 10,000 hours of recordings.

With a vocabulary of this magnitude, we can develop Lithuanian speech recognition systems that allow computers and other devices to understand spoken Lithuanian accurately, meeting modern quality standards.

When asked about the three most important factors that determine the value of a property, real estate professionals often answer: location, location, and location. To rephrase that question: what are the three most crucial elements that contribute to the value of language technology? There needs to be a comprehensive and extensive linguistic corpus of the Lithuanian language.

The 1000-hour vocabulary produced by the LIEPA-2 project is relatively small compared to the vocabularies of more technologically advanced languages. Besides its size, developing a linguistic corpus is crucial because computer scientists collaborate with philologists and linguists in speech technology.

"This is a beautiful and meaningful example of collaboration and interdisciplinarity that has been present since the beginning of the LIEPA-1 project. The large-scale audio library is significant for speech technology and is an excellent foundation for a wide range of linguistic research. Notably, half of the linguistic repertoire will consist of spontaneous speech, providing valuable insights into the state of contemporary spoken Lithuanian. Unfortunately, this area has been somewhat fragmented due to insufficient comprehensive and extensive data. The dictionary is not only vital in the context of speech technology and linguistic research, but it may not be overly bold to compare it to the great Dictionary of the Lithuanian Language, which preserves not only the words of our language but also the essence of our identity—expressed not in written form, but through the living word," said Vytautas Kardelis, a Professor in the Faculty of Philology at the Vilnius University of Technology.

The big team leads the project – the Faculty of Mathematics and Informatics at Vilnius University, collaborating with the Faculty of Philology and partners Vytautas Magnus University and the Lithuanian Language Institute. Dr Gražina Korvel, the project leader of the LIEPA-3 project and a researcher at VU MIF, notes that "the timeframe for the project is very short—just over a year and a half—whereas usually such work would take at least three years. A strong consortium of experts in the field, consisting of experienced organisations, is carrying out this project, and we are confident they will complete it successfully.

What results can we expect?

"The main goal of the LIEPA-3 project is to create an annotated dictionary of 10,000 hours of spoken Lithuanian. Dr. Korvel emphasises the significance of this work for both science and society, explaining that the team will compile this dictionary based on the criteria of age, gender, and dialect region of the speakers. They will develop an audio library to showcase the spoken content's phonetic, morphological, syntactic, stylistic, and dialectal diversity while capturing variations in the acoustic background influenced by the recording equipment and environment.

Creating the soundtrack involves a considerable amount of work, including collecting, processing, and assessing the accuracy of sound data. Once the team completes this, they will upload the data to open-access platforms, making the project results available to all interested parties. Once the team completes this, they will upload the data to open-access platforms, making the project results available to all interested parties. According to the project leader, having a publicly accessible and comprehensive sound library will enable researchers to develop advanced speech recognition, synthesis, and natural language processing techniques. This project will create opportunities for the advancement of artificial intelligence in Lithuania. Furthermore, this toolkit will be extremely valuable for research focused on social inclusion, helping us to better respond to the needs of individuals with disabilities and to develop technologies that are intuitive and accessible to everyone.

The project will create numerous opportunities for the practical application of its results. Specifically, Lithuanian researchers and technology developers will have the chance to advance language technologies and innovative e-services in Lithuania. Additionally, the publicly accessible resources from the Lithuanian language repository will encourage researchers in other countries to incorporate the Lithuanian language into their studies in language technologies. The results will enhance the visibility of the Lithuanian language in the digital space and facilitate international collaboration.

According to Navickas, the project aims to support the implementation of the Ministry of Economy and Innovation's State Digitisation Development Programme. Its goals include increasing the accessibility of language technologies in the Lithuanian language and helping to modernise digital skills within society. "Over the next couple of years, the research team will create a dictionary and make it publicly available for research and the development of digital solutions. This resource will support the development of higher-quality e-services and advance the overall digitisation process in Lithuania."