Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Kellert, Olga; Tyagi, Nemika; Imran, Muhammad; Licona-Guevara, Nelvin; Gómez-Rodríguez, Carlos

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

UDC.coleccion	Investigación
UDC.conferenceTitle	Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.endPage	15949
UDC.grupoInv	Lingua e Sociedade da Información (LYS)
UDC.startPage	15934
dc.contributor.author	Kellert, Olga
dc.contributor.author	Tyagi, Nemika
dc.contributor.author	Imran, Muhammad
dc.contributor.author	Licona-Guevara, Nelvin
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.date.accessioned	2026-02-10T18:56:02Z
dc.date.available	2026-02-10T18:56:02Z
dc.date.issued	2025-11
dc.description	Presented at: Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, November 4th to November 9th, 2025.
dc.description.abstract	[Abstract]: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.
dc.description.sponsorship	We thank the anonymous annotators and reviewers for their constructive suggestions and help. We ex- tend our gratitude to the Research Computing (RC) and Enterprise Technology at ASU for providing computing resources and access to the ChatGPT en- terprise version for experiments. We acknowledge grants GAP (PID2022-139308OA-I00) funded by MICIU/AEI/10.13039/501100011033/ and ERDF, EU; LATCHING (PID2023-147129OB-C21) funded by MICIU/AEI/10.13039/501100011033 and ERDF, EU. CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Sci- ence, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). Furthermore, this research was supported by the International, Interdisciplinary and Intersectoral Information and Communications Technology PhD programme (3-i ICT) granted to CITIC and supported by the European Union through the Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie agreement (H2020-MSCA-COFUND), GA 101034261.
dc.description.sponsorship	Xunta de Galicia; ED431G 2023/01
dc.identifier.citation	Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, and Carlos Gómez-Rodríguez. 2025. Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15934–15949, Suzhou, China. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-emnlp.863
dc.identifier.doi	10.18653/v1/2025.findings-emnlp.863
dc.identifier.isbn	979-8-89176-335-7
dc.identifier.uri	https://hdl.handle.net/2183/47335
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-139308OA-100/ES/REPRESENTACIONES ESTRUCTURADAS VERDES Y ENCHUFABLES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-147129OB-C21/ES/TECNOLOGÍAS DEL LENGUAJE DESDE UNA PERSPECTIVA VERDE (LATCHING): DOMINIOS CON ESCASOS RECURSOS
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/101034261/EU
dc.relation.uri	https://doi.org/10.18653/v1/2025.findings-emnlp.863
dc.rights	Attribution 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Code-switching
dc.subject	Large language models (LLMs)
dc.subject	Universal Dependencies
dc.subject	Prompt-based framework
dc.title	Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
dc.type	conference output
dspace.entity.type	Publication
relation.isAuthorOfPublication	da9e8872-ab78-4a1c-8212-1121388beb43
relation.isAuthorOfPublication	6779b734-3d4b-4242-9bde-78e83eea84db
relation.isAuthorOfPublication	e70a3969-39f6-4458-9339-3b71756fa56e
relation.isAuthorOfPublication.latestForDiscovery	da9e8872-ab78-4a1c-8212-1121388beb43

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Imran_Muhammad_2025_Parsing_the_Switch.pdf
Size:: 1.14 MB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)