UZBEK TAGSET: CREATING A LIST OF MORPHOLOGICAL AND SYNTACTIC TAGS FOR BUILDING MACHINE LEARNING MODELS FOR THE UZBEK LANGUAGE
Keywords:
Uzbek language, syntactic tags, morphological tags, natural language processing, part of speech. HMM model.Abstract
The aim of this study is to develop a comprehensive list of syntactic and morphological tags in Uzbek, which aims to create a dataset for natural language processing (NLP) problems in Uzbek. Based on existing tagset models for other languages and taking into account the specific features of the Uzbek language, we propose a hierarchical tagset structure that includes word classes, morphological features, and syntactic functions. A Hidden Markov Model (HMM) was built to solve the POS tagging NLP task using the created tagset.
References
Abdurashetona AM, Ismailovich IO. Methods of Tagging Part of Speech of Uzbek Language. Proceedings - 6th International Conference on Computer Science and Engineering, UBMK 2021, 2021, 82 – 85.
Sharipov MS, Adinaev HS, Kuriyozov ER. Rule-Based Punctuation Algorithm for the Uzbek Language. 2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM), 2024, 2410–2414.
Can Ş, Karaoğlan B, Kşla T, Metin SK. Using Word Embeddings in Turkish Part of Speech Tagging. Int J Mach Learn Comput 2021; 11.
Sharoff S, Kopotev M, Erjavec T, Feldman A, Divjak D. Designing and evaluating a Russian tagset. Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008, 2008, 279 – 285.
Kumawat D, Jain V. POS tagging approaches: A comparison. Int J Comput Appl 2015; 118.
Petrov S, Das D, McDonald R. A universal part-of-speech tagset. arXiv preprint arXiv:11042086 2011;
Zeman D. Hard problems of tagset conversion. Proceedings of the Second International Conference on Global Interoperability for Language Resources, 2010, 181–185.
Pham B. Parts of Speech Tagging: Rule-Based. 2020;
Maksud S, Elmurod K, Ollabergan Y, Ogabek S. UzbekVerbDetection: Rule-based Detection of Verbs in Uzbek Texts. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, 17343 – 17347.
Murat A, Ali S. Low-Resource POS Tagging With Deep Affix Representation and Multi-Head Attention. IEEE Access 2024; 12: 66495 – 66504.
Additional Files
Published
How to Cite
License
Copyright (c) 2025 Maqsud Sharipov, Hakimjon Zaynidinov

This work is licensed under a Creative Commons Attribution 4.0 International License.