Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy

Garabík, Radovan; Mitana, Denis

Use this url to cite publication: https://cris.mruni.eu/cris/handle/007/18680

Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy

Type of publication

Tezės kitame recenzuojamame leidinyje / Theses in other peer-reviewed publication (T1e)

Type of publication (old)

T2

Author(s)

Author
Garabík, Radovan

Title

Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy

[lt]

Date Issued

Date
2022

Is part of

LLOD Approaches for Language Data Research and Management LLODREAM2022 : International Scientific Interdisciplinary Conference, September 21-22, 2022 : Abstract Book. ISBN 9786094880414

Research Area

Field of Science

Keywords (en)

Abstract (en)

The Slovak language, as a “typical” Slavic language, belongs to the group of moderately inflected languages, with three or four genders, two grammatical numbers, all interacting with the inflections in somewhat complicated and unpredictable ways. The inflections are realized primarily by suffixes, but with many irregularities; one suffix encodes several relevant grammatical categories and the same suffix often reflects unrelated features in other words, a typical inflectional language not amenable to a heuristic analysis. Following these limitations, lemmatization is often an indispensable step in all kinds of text processing (starting with full-text search), and full morphosyntactic analysis or description (MSD) is the core of corpus linguistic research. Given the core importance of lemmatization and MSD in Slovak corpus linguistics, it is important to realize its limitations and recognize achievable accuracy. Since modern approaches aim to utilize deep learning and huge language models, we evaluate the accuracy of lemmatization + MSD in several common usage scenarios by comparing the state-of-the-art “classical” lemmatizer and MSD tagger MorhoDiTa, based on perceptron; and spaCy, using a multilingual BERT language model.

Type of document

type::text::journal::journal article::research article

Language

Anglų / English (en)

URI

URI
https://cris.mruni.eu/cris/handle/007/18680
https://cris.mruni.eu/cris/handle/007/18680
https://cris.mruni.eu/cris/handle/007/18680

Access Rights

Atviroji prieiga / Open Access

File(s)

LLOD_2022 Book of Abstracts-94-96.pdf (61.78 KB)

Owning collection

2. Kitos mokslo publikacijos / Other Research Publications