Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

UDC.coleccionInvestigación
UDC.conferenceTitleConference on Empirical Methods in Natural Language Processing (EMNLP 2025)
UDC.departamentoCiencias da Computación e Tecnoloxías da Información
UDC.endPage15949
UDC.grupoInvLingua e Sociedade da Información (LYS)
UDC.startPage15934
dc.contributor.authorKellert, Olga
dc.contributor.authorTyagi, Nemika
dc.contributor.authorImran, Muhammad
dc.contributor.authorLicona-Guevara, Nelvin
dc.contributor.authorGómez-Rodríguez, Carlos
dc.date.accessioned2026-02-10T18:56:02Z
dc.date.available2026-02-10T18:56:02Z
dc.date.issued2025-11
dc.descriptionPresented at: Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, November 4th to November 9th, 2025.
dc.description.abstract[Abstract]: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.
dc.description.sponsorshipWe thank the anonymous annotators and reviewers for their constructive suggestions and help. We ex- tend our gratitude to the Research Computing (RC) and Enterprise Technology at ASU for providing computing resources and access to the ChatGPT en- terprise version for experiments. We acknowledge grants GAP (PID2022-139308OA-I00) funded by MICIU/AEI/10.13039/501100011033/ and ERDF, EU; LATCHING (PID2023-147129OB-C21) funded by MICIU/AEI/10.13039/501100011033 and ERDF, EU. CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Sci- ence, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). Furthermore, this research was supported by the International, Interdisciplinary and Intersectoral Information and Communications Technology PhD programme (3-i ICT) granted to CITIC and supported by the European Union through the Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie agreement (H2020-MSCA-COFUND), GA 101034261.
dc.description.sponsorshipXunta de Galicia; ED431G 2023/01
dc.identifier.citationOlga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, and Carlos Gómez-Rodríguez. 2025. Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15934–15949, Suzhou, China. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-emnlp.863
dc.identifier.doi10.18653/v1/2025.findings-emnlp.863
dc.identifier.isbn979-8-89176-335-7
dc.identifier.urihttps://hdl.handle.net/2183/47335
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-139308OA-100/ES/REPRESENTACIONES ESTRUCTURADAS VERDES Y ENCHUFABLES
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-147129OB-C21/ES/TECNOLOGÍAS DEL LENGUAJE DESDE UNA PERSPECTIVA VERDE (LATCHING): DOMINIOS CON ESCASOS RECURSOS
dc.relation.projectIDinfo:eu-repo/grantAgreement/EC/H2020/101034261/EU
dc.relation.urihttps://doi.org/10.18653/v1/2025.findings-emnlp.863
dc.rightsAttribution 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectCode-switching
dc.subjectLarge language models (LLMs)
dc.subjectUniversal Dependencies
dc.subjectPrompt-based framework
dc.titleParsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
dc.typeconference output
dspace.entity.typePublication
relation.isAuthorOfPublicationda9e8872-ab78-4a1c-8212-1121388beb43
relation.isAuthorOfPublication6779b734-3d4b-4242-9bde-78e83eea84db
relation.isAuthorOfPublicatione70a3969-39f6-4458-9339-3b71756fa56e
relation.isAuthorOfPublication.latestForDiscoveryda9e8872-ab78-4a1c-8212-1121388beb43

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Imran_Muhammad_2025_Parsing_the_Switch.pdf
Size:
1.14 MB
Format:
Adobe Portable Document Format