Bertinho: Galician BERT Representations

Vilares, David; García, Marcos; Gómez-Rodríguez, Carlos

dc.contributor.author	Vilares, David
dc.contributor.author	García, Marcos
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.date.accessioned	2024-07-16T11:09:36Z
dc.date.available	2024-07-16T11:09:36Z
dc.date.issued	2021-03
dc.identifier.citation	D. Vilares, M. Garcia, and C. Gómez-Rodríguez, "Bertinho: Galician BERT Representations", Procesamiento del Lenguaje Natural, Revista nº 66, marzo de 2021, pp. 13-26, doi: 10.26342/2021-66-1	es_ES
dc.identifier.other	http://hdl.handle.net/10045/114222
dc.identifier.uri	http://hdl.handle.net/2183/38058
dc.description.abstract	[Abstract]: This paper presents a monolingual BERT model for Galician. We follow the recent trend that shows that it is feasible to build robust monolingual BERT models even for relatively low-resource languages, while performing better than the well-known official multilingual BERT (mBERT). More particularly, we release two monolingual Galician BERT models, built using 6 and 12 transformer layers, respectively; trained with limited resources (∼45 million tokens on a single GPU of 24GB). We then provide an exhaustive evaluation on a number of tasks such as POS-tagging, dependency parsing and named entity recognition. For this purpose, all these tasks are cast in a pure sequence labeling setup in order to run BERT without the need to include any additional layers on top of it (we only use an output classification layer to map the contextualized representations into the predicted label). The experiments show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.	es_ES
dc.description.abstract	[Resumen]: Este artículo presenta un modelo BERT monolingüe para el gallego. Nos basamos en la tendencia actual que ha demostrado que es posible crear modelos BERT monolingües robustos incluso para aquellos idiomas para los que hay una relativa escasez de recursos, funcionando éstos mejor que el modelo BERT multilingüe oficial (mBERT). Concretamente, liberamos dos modelos monolingües para el gallego, creados con 6 y 12 capas de transformers, respectivamente, y entrenados con una limitada cantidad de recursos (~45 millones de palabras sobre una única GPU de 24GB.) Para evaluarlos realizamos un conjunto exhaustivo de experimentos en tareas como análisis morfosintáctico, análisis sintáctico de dependencias o reconocimiento de entidades. Para ello, abordamos estas tareas como etiquetado de secuencias, con el objetivo de ejecutar los modelos BERT sin la necesidad de incluir ninguna capa adicional (únicamente se a~nade la capa de salida encargada de transformar las representaciones contextualizadas en la etiqueta predicha). Los experimentos muestran que nuestros modelos, especialmente el de 12 capas, mejoran los resultados de mBERT en la mayor parte de las tareas.	es_ES
dc.description.sponsorship	This work has received funding from the European Research Council (ERC), which has funded this research under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from MINECO (ANSWER-ASAP, TIN2017-85160-C2-1-R), from Xunta de Galicia (ED431C 2020/11), from Centro de Investigación de Galicia `CITIC', funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01, and by Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), ERDF 2014-2020: Call ED431G 2019/04. DV is supported by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation. MG is supported by a Ramón y Cajal grant (RYC2019-028473-I).	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431C 2020/11	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G 2019/01	es_ES
dc.description.sponsorship	Xunta de Galicia; D431G 2019/04	es_ES
dc.language.iso	jpn	es_ES
dc.publisher	Sociedad Española para el Procesamiento del Lenguaje Natural	es_ES
dc.relation	info:eu-repo/grantAgreement/EC/H2020/714150	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-85160-C2-1-R/ES/AVANCES EN NUEVOS SISTEMAS DE EXTRACCION DE RESPUESTAS CON ANALISIS SEMANTICO Y APRENDIZAJE PROFUNDO	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RYC2019-028473-I/ES/	es_ES
dc.relation.uri	https://doi.org/10.26342/2021-66-1	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	BERT	es_ES
dc.subject	Galician	es_ES
dc.subject	Embeddings	es_ES
dc.subject	Language modeling	es_ES
dc.subject	Gallego	es_ES
dc.subject	Modelado del lenguaje	es_ES
dc.title	Bertinho: Galician BERT Representations	es_ES
dc.title.alternative	Bertinho: Representaciones BERT para el gallego	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Procesamiento del Lenguaje Natural	es_ES
UDC.volume	66	es_ES
UDC.startPage	13	es_ES
UDC.endPage	26	es_ES
dc.identifier.doi	10.26342/2021-66-1

Ficheiros no ítem

Nome:: Vilares_David_2021_Bertinho_Ga ...
Tamaño:: 655.0Kb
Formato:: PDF

Ver/abrir

Nome:: license_rdf
Tamaño:: 1.203Kb
Formato:: application/rdf+xml

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

OpenAIRE [336]
GI-LYS - Artigos [49]

Mostrar o rexistro simple do ítem