BERTbek: A Pretrained Language Model for Uzbek

Kuriyozov, Elmurod; Vilares, David; Gómez-Rodríguez, Carlos

dc.contributor.author	Kuriyozov, Elmurod
dc.contributor.author	Vilares, David
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.date.accessioned	2024-09-05T10:13:24Z
dc.date.available	2024-09-05T10:13:24Z
dc.date.issued	2024-05
dc.identifier.citation	Elmurod Kuriyozov, David Vilares, and Carlos Gómez-Rodríguez. 2024. BERTbek: A Pretrained Language Model for Uzbek. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 33–44, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.sigul-1.5	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/38881
dc.description	All the code used in this work are openly available at https://github.com/elmurod1202/BERTbek. Also, the BERTbek models have been uploaded to the HuggingFace Models Hub at https://huggingface.co/elmurod1202/bertbek-news-big-cased.	es_ES
dc.description.abstract	[Abstract]: Recent advances in neural networks based language representation made it possible for pretrained language models to outperform previous models in many downstream natural language processing (NLP) tasks. These pretrained language models have also shown that if large enough, they exhibit good few-shot abilities, which is especially beneficial for low-resource scenarios. In this respect, although there are some large-scale multilingual pretrained language models available, language-specific pretrained models have demonstrated to be more accurate for monolingual evaluation setups. In this work, we present BERTbek - pretrained language models based on the BERT (Bidirectional Encoder Representations from Transformers) architecture for the low-resource Uzbek language. We also provide a comprehensive evaluation of the models on a number of NLP tasks: sentiment analysis, multi-label topic classification, and named entity recognition, comparing the models with various machine learning methods as well as multilingual BERT (mBERT). Experimental results indicate that our models outperform mBERT and other task-specific baseline models in all three tasks. Additionally, we also show the impact of training data size and quality on the downstream performance of BERT models, by training three different models with different text sources and corpus sizes.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	European Language Resources Association (ELRA)	es_ES
dc.relation.uri	https://aclanthology.org/2024.sigul-1.5	es_ES
dc.rights	Atribución-NoComercial 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/es/	*
dc.subject	BERT	es_ES
dc.subject	language modeling	es_ES
dc.subject	low-resource languages	es_ES
dc.subject	natural language processing	es_ES
dc.subject	Uzbek language	es_ES
dc.title	BERTbek: A Pretrained Language Model for Uzbek	es_ES
dc.type	info:eu-repo/semantics/conferenceObject	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.startPage	33	es_ES
UDC.endPage	44	es_ES
UDC.conferenceTitle	SIGUL 2024	es_ES

Ficheiros no ítem

Nome:: Vilares_David_2024_BERTbek_A_P ...
Tamaño:: 268.9Kb
Formato:: PDF

Ver/abrir

Nome:: license_rdf
Tamaño:: 1.346Kb
Formato:: application/rdf+xml

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-LYS - Congresos, conferencias, etc. [71]

Mostrar o rexistro simple do ítem

BERTbek: A Pretrained Language Model for Uzbek

Ficheiros no ítem

Este ítem aparece na(s) seguinte(s) colección(s)

Ítems relacionados

A influencia das linguas dominantes no desenvolvemento de políticas de linguaxe non-sexista en linguas subordinadas: análise do cado do galés ﻿

Hacia una didáctica de la lengua minorizada en contextos de asimilación lingüística: algunas aportaciones desde la teoría ﻿

O neofalantismo no ensino secundario: un espazo sen lexitimar ﻿

A influencia das linguas dominantes no desenvolvemento de políticas de linguaxe non-sexista en linguas subordinadas: análise do cado do galés

Hacia una didáctica de la lengua minorizada en contextos de asimilación lingüística: algunas aportaciones desde la teoría

O neofalantismo no ensino secundario: un espazo sen lexitimar