Skip navigation
  •  Home
  • UDC 
    • Getting started
    • RUC Policies
    • FAQ
    • FAQ on Copyright
    • More information at INFOguias UDC
  • Browse 
    • Communities
    • Browse by:
    • Issue Date
    • Author
    • Title
    • Subject
  • Help
    • español
    • Gallegan
    • English
  • Login
  •  English 
    • Español
    • Galego
    • English
  
View Item 
  •   DSpace Home
  • Facultade de Filoloxía
  • Investigación (FFIL)
  • View Item
  •   DSpace Home
  • Facultade de Filoloxía
  • Investigación (FFIL)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Assessment of Pre-Trained Models Across Languages and Grammars

Thumbnail
View/Open
Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf (1.925Mb)
Use this link to cite
http://hdl.handle.net/2183/36572
Atribución 3.0 España
Except where otherwise noted, this item's license is described as Atribución 3.0 España
Collections
  • Investigación (FFIL) [877]
Metadata
Show full item record
Title
Assessment of Pre-Trained Models Across Languages and Grammars
Author(s)
Muñoz-Ortiz, Alberto
Vilares, David
Gómez-Rodríguez, Carlos
Date
2023-11
Citation
Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez. 2023. Assessment of Pre-Trained Models Across Languages and Grammars. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–373, Nusa Dua, Bali. Association for Computational Linguistics.
Abstract
[Absctract]: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
Keywords
Syntax learning
Sequence labeling
Subword tokenization
Pre-trained word vectors
Language occurrence in pretraining data
 
Description
Bali, Indonesia. November, 1-4 2023.
Editor version
https://aclanthology.org/2023.ijcnlp-main.23/
Rights
Atribución 3.0 España

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsResearch GroupAcademic DegreeThis CollectionBy Issue DateAuthorsTitlesSubjectsResearch GroupAcademic Degree

My Account

LoginRegister

Statistics

View Usage Statistics
Sherpa
OpenArchives
OAIster
Scholar Google
UNIVERSIDADE DA CORUÑA. Servizo de Biblioteca.    DSpace Software Copyright © 2002-2013 Duraspace - Send Feedback