Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
| UDC.coleccion | Investigación | |
| UDC.conferenceTitle | Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) | |
| UDC.departamento | Ciencias da Computación e Tecnoloxías da Información | |
| UDC.endPage | 15949 | |
| UDC.grupoInv | Lingua e Sociedade da Información (LYS) | |
| UDC.startPage | 15934 | |
| dc.contributor.author | Kellert, Olga | |
| dc.contributor.author | Tyagi, Nemika | |
| dc.contributor.author | Imran, Muhammad | |
| dc.contributor.author | Licona-Guevara, Nelvin | |
| dc.contributor.author | Gómez-Rodríguez, Carlos | |
| dc.date.accessioned | 2026-02-10T18:56:02Z | |
| dc.date.available | 2026-02-10T18:56:02Z | |
| dc.date.issued | 2025-11 | |
| dc.description | Presented at: Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, November 4th to November 9th, 2025. | |
| dc.description.abstract | [Abstract]: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. | |
| dc.description.sponsorship | We thank the anonymous annotators and reviewers for their constructive suggestions and help. We ex- tend our gratitude to the Research Computing (RC) and Enterprise Technology at ASU for providing computing resources and access to the ChatGPT en- terprise version for experiments. We acknowledge grants GAP (PID2022-139308OA-I00) funded by MICIU/AEI/10.13039/501100011033/ and ERDF, EU; LATCHING (PID2023-147129OB-C21) funded by MICIU/AEI/10.13039/501100011033 and ERDF, EU. CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Sci- ence, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). Furthermore, this research was supported by the International, Interdisciplinary and Intersectoral Information and Communications Technology PhD programme (3-i ICT) granted to CITIC and supported by the European Union through the Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie agreement (H2020-MSCA-COFUND), GA 101034261. | |
| dc.description.sponsorship | Xunta de Galicia; ED431G 2023/01 | |
| dc.identifier.citation | Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, and Carlos Gómez-Rodríguez. 2025. Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15934–15949, Suzhou, China. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-emnlp.863 | |
| dc.identifier.doi | 10.18653/v1/2025.findings-emnlp.863 | |
| dc.identifier.isbn | 979-8-89176-335-7 | |
| dc.identifier.uri | https://hdl.handle.net/2183/47335 | |
| dc.language.iso | eng | |
| dc.publisher | Association for Computational Linguistics | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-139308OA-100/ES/REPRESENTACIONES ESTRUCTURADAS VERDES Y ENCHUFABLES | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-147129OB-C21/ES/TECNOLOGÍAS DEL LENGUAJE DESDE UNA PERSPECTIVA VERDE (LATCHING): DOMINIOS CON ESCASOS RECURSOS | |
| dc.relation.projectID | info:eu-repo/grantAgreement/EC/H2020/101034261/EU | |
| dc.relation.uri | https://doi.org/10.18653/v1/2025.findings-emnlp.863 | |
| dc.rights | Attribution 4.0 International | en |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Code-switching | |
| dc.subject | Large language models (LLMs) | |
| dc.subject | Universal Dependencies | |
| dc.subject | Prompt-based framework | |
| dc.title | Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages | |
| dc.type | conference output | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | da9e8872-ab78-4a1c-8212-1121388beb43 | |
| relation.isAuthorOfPublication | 6779b734-3d4b-4242-9bde-78e83eea84db | |
| relation.isAuthorOfPublication | e70a3969-39f6-4458-9339-3b71756fa56e | |
| relation.isAuthorOfPublication.latestForDiscovery | da9e8872-ab78-4a1c-8212-1121388beb43 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Imran_Muhammad_2025_Parsing_the_Switch.pdf
- Size:
- 1.14 MB
- Format:
- Adobe Portable Document Format

