Assessment of Pre-Trained Models Across Languages and Grammars
| UDC.coleccion | Investigación | es_ES |
| UDC.conferenceTitle | 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP 2023) | es_ES |
| UDC.departamento | Letras | es_ES |
| UDC.endPage | 373 | es_ES |
| UDC.grupoInv | Lingua e Sociedade da Información (LYS) | es_ES |
| UDC.journalTitle | Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) | es_ES |
| UDC.startPage | 359 | es_ES |
| dc.contributor.author | Muñoz-Ortiz, Alberto | |
| dc.contributor.author | Vilares, David | |
| dc.contributor.author | Gómez-Rodríguez, Carlos | |
| dc.date.accessioned | 2024-05-22T11:58:07Z | |
| dc.date.available | 2024-05-22T11:58:07Z | |
| dc.date.issued | 2023-11 | |
| dc.description | Bali, Indonesia. November, 1-4 2023. | es_ES |
| dc.description.abstract | [Absctract]: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors. | es_ES |
| dc.description.sponsorship | We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), grant FPI 2021 (PID2020-113230RB-C21) funded by MCIN/AEI/10.13039/501100011033, and Centro de Investigación de Galicia “CITIC”, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS). | es_ES |
| dc.description.sponsorship | Xunta de Galicia; ED431C2020/11 | es_ES |
| dc.identifier.citation | Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez. 2023. Assessment of Pre-Trained Models Across Languages and Grammars. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–373, Nusa Dua, Bali. Association for Computational Linguistics. | es_ES |
| dc.identifier.uri | http://hdl.handle.net/2183/36572 | |
| dc.language.iso | eng | es_ES |
| dc.publisher | Association for Computational Linguistics | es_ES |
| dc.relation.projectID | info:eu-repo/grantAgreement/EC/HE/101100615 | es_ES |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113230RB-C21/ES/MODELOS MULTITAREA DE ETIQUETADO SECUENCIAL PARA EL RECONOCIMIENTO DE ENTIDADES ENRIQUECIDO CON INFORMACIÓN LINGÜÍSTICA: SINTAXIS E INTEGRACIÓN MULTITAREA (SCANNER-UDC) | es_ES |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-1393080A-100/ES/REPRESENTACIONES ESTRUCTURADAS VERDES Y ENCHUFABLES | es_ES |
| dc.relation.uri | https://aclanthology.org/2023.ijcnlp-main.23/ | es_ES |
| dc.rights | Atribución 3.0 España | es_ES |
| dc.rights.accessRights | open access | es_ES |
| dc.rights.uri | http://creativecommons.org/licenses/by/3.0/es/ | * |
| dc.subject | Syntax learning | es_ES |
| dc.subject | Sequence labeling | es_ES |
| dc.subject | Subword tokenization | es_ES |
| dc.subject | Pre-trained word vectors | es_ES |
| dc.subject | Language occurrence in pretraining data | es_ES |
| dc.title | Assessment of Pre-Trained Models Across Languages and Grammars | es_ES |
| dc.type | conference output | es_ES |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | edf1cde8-d272-4a73-bdd3-9be2361b7651 | |
| relation.isAuthorOfPublication | 37dabbe9-f54f-43bb-960e-0bf3ac7e54eb | |
| relation.isAuthorOfPublication | e70a3969-39f6-4458-9339-3b71756fa56e | |
| relation.isAuthorOfPublication.latestForDiscovery | edf1cde8-d272-4a73-bdd3-9be2361b7651 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf
- Size:
- 1.93 MB
- Format:
- Adobe Portable Document Format
- Description:

