New Treebank or Repurposed? On the Feasibility of Cross-Lingual Parsing of Romance Languages with Universal Dependencies

García, Marcos; Gómez-Rodríguez, Carlos; Alonso, Miguel A.

dc.contributor.author	García, Marcos
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.contributor.author	Alonso, Miguel A.
dc.date.accessioned	2017-12-13T15:13:16Z
dc.date.issued	2018-01
dc.identifier.citation	GARCIA, M., GÓMEZ-RODRÍGUEZ, C., & ALONSO, M. (2018). New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies. Natural Language Engineering, 24(1), 91-122. doi:10.1017/S1351324917000377	es_ES
dc.identifier.issn	1351-3249
dc.identifier.uri	http://hdl.handle.net/2183/19896
dc.description	This is the final peer-reviewed manuscript that was accepted for publication in Natural Language Engineering. Changes resulting from the publishing process, such as editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.	es_ES
dc.description.abstract	[Abstract] This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.	es_ES
dc.description.sponsorship	Ministerio de Economía y Competitividad; FJCI-2014-22853	es_ES
dc.description.sponsorship	Ministerio de Economía y Competitividad; FFI2014-51978-C2-1-R	es_ES
dc.description.sponsorship	Ministerio de Economía y Competitividad; FFI2014-51978-C2-2-R	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Cambridge University Press	es_ES
dc.relation	info:eu-repo/grantAgreement/EC/H2020/714150
dc.relation.uri	https://doi.org/10.1017/S1351324917000377	es_ES
dc.rights	This article has been published in a revised form in Natural Language Engineering (https://doi.org/10.1017/S1351324917000377). This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © Cambridge University Press.
dc.subject	Universal dependencies	es_ES
dc.subject	Parsing	es_ES
dc.subject	Cross-lingual	es_ES
dc.subject	Treebank	es_ES
dc.subject	Linguistic resources	es_ES
dc.title	New Treebank or Repurposed? On the Feasibility of Cross-Lingual Parsing of Romance Languages with Universal Dependencies	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/embargoedAccess	es_ES
dc.date.embargoEndDate	2018-07-01	es_ES
dc.date.embargoLift	2018-07-01
UDC.journalTitle	Natural Language Engineering	es_ES
UDC.volume	24	es_ES
UDC.issue	1	es_ES
UDC.startPage	91	es_ES
UDC.endPage	122	es_ES
dc.identifier.doi	10.1017/S1351324917000377

Ficheiros no ítem

Nome:: Garcia_Marcos_Gomez_Rodriguez_ ...
Tamaño:: 878.0Kb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-LYS - Artigos [43]
OpenAIRE [287]

Mostrar o rexistro simple do ítem