Assessment of Pre-Trained Models Across Languages and Grammars

Muñoz-Ortiz, Alberto; Vilares, David; Gómez-Rodríguez, Carlos

dc.contributor.author	Muñoz-Ortiz, Alberto
dc.contributor.author	Vilares, David
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.date.accessioned	2024-05-22T11:58:07Z
dc.date.available	2024-05-22T11:58:07Z
dc.date.issued	2023-11
dc.identifier.citation	Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez. 2023. Assessment of Pre-Trained Models Across Languages and Grammars. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–373, Nusa Dua, Bali. Association for Computational Linguistics.	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/36572
dc.description	Bali, Indonesia. November, 1-4 2023.	es_ES
dc.description.abstract	[Absctract]: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.	es_ES
dc.description.sponsorship	We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), grant FPI 2021 (PID2020-113230RB-C21) funded by MCIN/AEI/10.13039/501100011033, and Centro de Investigación de Galicia “CITIC”, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS).	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431C2020/11	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Association for Computational Linguistics	es_ES
dc.relation	info:eu-repo/grantAgreement/EC/HE/101100615	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113230RB-C21/ES/MODELOS MULTITAREA DE ETIQUETADO SECUENCIAL PARA EL RECONOCIMIENTO DE ENTIDADES ENRIQUECIDO CON INFORMACIÓN LINGÜÍSTICA: SINTAXIS E INTEGRACIÓN MULTITAREA (SCANNER-UDC)	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-1393080A-100/ES/REPRESENTACIONES ESTRUCTURADAS VERDES Y ENCHUFABLES	es_ES
dc.relation.uri	https://aclanthology.org/2023.ijcnlp-main.23/	es_ES
dc.rights	Atribución 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Syntax learning	es_ES
dc.subject	Sequence labeling	es_ES
dc.subject	Subword tokenization	es_ES
dc.subject	Pre-trained word vectors	es_ES
dc.subject	Language occurrence in pretraining data	es_ES
dc.title	Assessment of Pre-Trained Models Across Languages and Grammars	es_ES
dc.type	info:eu-repo/semantics/conferenceObject	es_ES
dc.type	info:eu-repo/semantics/conferenceObject	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)	es_ES
UDC.startPage	359	es_ES
UDC.endPage	373	es_ES
UDC.conferenceTitle	13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP 2023)	es_ES

Ficheiros no ítem

Nome:: Muñoz_Ortiz_2023_Assessment_pr ...
Tamaño:: 1.925Mb
Formato:: PDF

Ver/abrir

Nome:: license_rdf
Tamaño:: 1.337Kb
Formato:: application/rdf+xml

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

OpenAIRE [357]
GI-LYS - Congresos, conferencias, etc. [71]

Mostrar o rexistro simple do ítem