Assessment of Pre-Trained Models Across Languages and Grammars

Muñoz-Ortiz, Alberto; Vilares, David; Gómez-Rodríguez, Carlos

Use this link to cite:

http://hdl.handle.net/2183/36572

Assessment of Pre-Trained Models Across Languages and Grammars

Files

Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf (1.93 MB)

Identifiers

URI: http://hdl.handle.net/2183/36572

Publication date

2023-11

Authors

Muñoz-Ortiz, Alberto

Vilares, David

Gómez-Rodríguez, Carlos

Bibliographic citation

Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez. 2023. Assessment of Pre-Trained Models Across Languages and Grammars. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–373, Nusa Dua, Bali. Association for Computational Linguistics.

Abstract

[Absctract]: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.

Description

Bali, Indonesia. November, 1-4 2023.

Keywords

Syntax learning Sequence labeling Subword tokenization Pre-trained word vectors Language occurrence in pretraining data

Editor version

https://aclanthology.org/2023.ijcnlp-main.23/

Rights

Atribución 3.0 España

Collections

Investigación (FFIL)

Full item page

Except where otherwise noted, this item's license is described as Atribución 3.0 España

Assessment of Pre-Trained Models Across Languages and Grammars

Files

Identifiers

Publication date

Authors

Advisors

Other responsabilities

Journal Title

Bibliographic citation

Type of academic work

Academic degree

Abstract

Description

Keywords

Editor version

Rights

Collections