Construction and evaluation of sentiment Datasets for low-resource languages: the case of Uzbek

Kuriyozov, Elmurod; Matlatipov, Sanatbek; Alonso, Miguel A.; Gómez-Rodríguez, Carlos

Use this link to cite:

http://hdl.handle.net/2183/39913

Construction and evaluation of sentiment Datasets for low-resource languages: the case of Uzbek

Files

Kuriyozov_Elmurod_2022_Construction_and_evaluation_of_sentiment_Datasets_for_Low_Resource-Languages.pdf (543.03 KB)

Identifiers

URI: http://hdl.handle.net/2183/39913

DOI: 10.1007/978-3-031-05328-3_15

Publication date

2022-06

Authors

Kuriyozov, Elmurod

Matlatipov, Sanatbek

Alonso, Miguel A.

Gómez-Rodríguez, Carlos

Bibliographic citation

Kuriyozov, E., Matlatipov, S., Alonso, M.A., Gómez-Rodríguez, C. (2022). Construction and Evaluation of Sentiment Datasets for Low-Resource Languages: The Case of Uzbek. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2019. Lecture Notes in Computer Science(), vol 13212. Springer, Cham. https://doi.org/10.1007/978-3-031-05328-3_15

Abstract

[Abstract]: To our knowledge, the majority of human language processing technologies for low-resource languages don’t have well-established linguistic resources for the development of sentiment analysis applications. Therefore, it is in dire need of such tools and resources to overcome the NLP barriers, so that, low-resource languages can deliver more benefits. In this paper, we fill that gap by providing its first annotated corpora for Uzbek language polarity classification. Our methodology considers collecting a medium-size manually annotated dataset and a larger-size dataset automatically translated from existing resources. Then, we use these datasets to train what, to our knowledge, are the first sentiment analysis models on the Uzbek language, using both traditional machine learning techniques and recent deep learning models. Both sets of techniques achieve similar accuracy (the best model on the manually annotated test set is a convolutional neural network with 88.89% accuracy, and on the translated set, a logistic regression with 89.56% accuracy); with the accuracy of the deep learning models being limited by the quality of available pre-trained word embeddings.

Description

This is the Author Accepted Manuscript. This version of the conference paper has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use (https://www.springernature.com/gp/open-science/policies/accepted-manuscript-terms), but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/978-3-031-05328-3_15.
Conference paper presented at: 9th Language and Technology Conference, LTC 2019, Poznan, Poland, May 17–19, 2019.

Keywords

Sentiment analysis Low-resource languages Uzbek language

Editor version

https://doi.org/10.1007/978-3-031-05328-3_15

Rights

Collections

Investigación (FFIL)

Full item page

Construction and evaluation of sentiment Datasets for low-resource languages: the case of Uzbek

Files

Identifiers

Publication date

Authors

Advisors

Other responsabilities

Journal Title

Bibliographic citation

Type of academic work

Academic degree

Abstract

Description

Keywords

Editor version

Rights

Collections