Assessment of Pre-Trained Models Across Languages and Grammars

Muñoz-Ortiz, AlbertoVilares, DavidGómez-Rodríguez, Carlos2024-05-222024-05-222023-11Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez. 2023. Assessment of Pre-Trained Models Across Languages and Grammars. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–373, Nusa Dua, Bali. Association for Computational Linguistics.http://hdl.handle.net/2183/36572Bali, Indonesia. November, 1-4 2023.[Absctract]: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.engAtribución 3.0 Españahttp://creativecommons.org/licenses/by/3.0/es/Syntax learningSequence labelingSubword tokenizationPre-trained word vectorsLanguage occurrence in pretraining dataAssessment of Pre-Trained Models Across Languages and Grammarsconference outputopen access