Adapting Large Language Models for Underrepresented Languages

UDC.coleccionPublicacións UDCes_ES
UDC.endPage32es_ES
UDC.startPage25es_ES
dc.contributor.authorBao, Eliseo
dc.contributor.authorPérez, Anxo
dc.contributor.authorParapar, Javier
dc.date.accessioned2025-01-13T18:55:08Z
dc.date.available2025-01-13T18:55:08Z
dc.date.issued2024
dc.description.abstract[Abstract] The popularization of Large Language Models (LLMs), especially with the development of conversational systems, makes mandatory to think about facilitating the use of artificial intelligence (AI) to everyone. Most models neglect minority languages, prioritizing widely spoken ones. This exacerbates their underrepresentation in the digital world and negatively affects their speakers. We present two resources aimed at improving natural language processing (NLP) for Galician: (i) a Llama 3.1 instruct model adapted through continuous pre-training on the CorpusNós dataset; and (ii) a Galician version of the Alpaca dataset, used to assess the improvement over the base model. In this evaluation, our model outperformed both the base model and another Galician model in quantitative and qualitative termses_ES
dc.identifier.urihttp://hdl.handle.net/2183/40687
dc.language.isoenges_ES
dc.relation.urihttps://doi.org/10.17979/spudc.9788497498913.4
dc.rightsAtribución 4.0es_ES
dc.rights.accessRightsopen accesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/Internacional*
dc.subjectLarge language models (LLMs)es_ES
dc.subjectNatural language processing (NLP)es_ES
dc.titleAdapting Large Language Models for Underrepresented Languageses_ES
dc.typeconference outputes_ES
dspace.entity.typePublication
relation.isAuthorOfPublication99ed6581-6dee-442a-9b37-c35da63bef8a
relation.isAuthorOfPublicationc673c8b1-1afc-48f6-85e9-8f29f9cffb91
relation.isAuthorOfPublicationfef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscovery99ed6581-6dee-442a-9b37-c35da63bef8a

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
XoveTIC_2024_proceedings_Parte04.pdf
Size:
453.1 KB
Format:
Adobe Portable Document Format
Description: