Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Kellert, Olga; Tyagi, Nemika; Imran, Muhammad; Licona-Guevara, Nelvin; Gómez-Rodríguez, Carlos

Use this link to cite:

https://hdl.handle.net/2183/47335

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Files

Imran_Muhammad_2025_Parsing_the_Switch.pdf (1.14 MB)

Identifiers

URI: https://hdl.handle.net/2183/47335

DOI: 10.18653/v1/2025.findings-emnlp.863

Publication date

2025-11

Authors

Kellert, Olga

Tyagi, Nemika

Imran, Muhammad

Licona-Guevara, Nelvin

Gómez-Rodríguez, Carlos

Bibliographic citation

Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, and Carlos Gómez-Rodríguez. 2025. Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15934–15949, Suzhou, China. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-emnlp.863

Abstract

[Abstract]: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.

Description

Presented at: Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, November 4th to November 9th, 2025.

Keywords

Code-switching Large language models (LLMs) Universal Dependencies Prompt-based framework

Editor version

https://doi.org/10.18653/v1/2025.findings-emnlp.863

Rights

Attribution 4.0 International

Collections

Investigación (FIC)

Full item page

Except where otherwise noted, this item's license is described as Attribution 4.0 International

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Files

Identifiers

Publication date

Authors

Advisors

Other responsabilities

Journal Title

Bibliographic citation

Type of academic work

Academic degree

Abstract

Description

Keywords

Editor version

Rights

Collections