Análise de toxicidade en contas galegas de Instagram: desenvolvemento dun sistema de detección

Castro, Laura M.Álvarez Crespo, Lucía MaríaMolina Muñiz, BeatrizUniversidade da Coruña. Facultade de Informática2025-07-242025-07-242025-06https://hdl.handle.net/2183/45567[Resumo]: O presente Traballo de Fin de Grao constitúe un proxecto de investigación que ten como obxectivo o desenvolvemento dun sistema para a detección de toxicidade en comentarios escritos en lingua galega na rede social Instagram. Esta iniciativa xorde da necesidade de dotar ás linguas minorizadas de ferramentas tecnolóxicas que permitan identificar discursos de odio, especialmente aqueles de carácter misóxino, nun contexto dixital onde predominan as solucións adaptadas a linguas maioritarias. O proceso de investigación estrutúrase en varias fases: comeza coa identificación e selección de contas galegas activas en Instagram, seguida da recollida automatizada de comentarios públicos mediante ferramentas de scraping. Unha vez recompilados os datos, lévase a cabo un preprocesamento detallado que inclúe a detección do idioma, a normalización dos textos e a protección da privacidade dos usuarios. Ademais de avaliar a viabilidade de empregar modelos previamente adestrados con datos de Twitter/X e Mastodon, o proxecto contempla o deseño e adestramento de modelos propios mediante técnicas de aprendizaxe automática, orientadas á clasificación de comentarios tóxicos. Estes modelos foron validados utilizando métricas estándar e testados con datos reais en galego, destacando o seu potencial para a detección temperá de condutas dixitais nocivas. Este traballo supón unha contribución relevante tanto no eido da tecnoloxía lingüística como na loita contra a violencia dixital en galego. O corpus recompilado, xunto co sistema desenvolvido, constitúe unha base sólida para futuras investigacións que desexen afondar na análise da toxicidade nas redes sociais desde unha perspectiva sociolingüística e de xénero.[Abstract]: The present Bachelor’s Thesis constitutes a research project whose objective is the development of a system for detecting toxicity in comments written in the Galician language on the social network Instagram. This initiative arises from the need to provide minoritized languages with technological tools that allow the identification of hate speech, especially those of a misogynistic nature, in a digital context where solutions adapted to majority languages predominate. The research process is structured in several phases: it begins with the identification and selection of active Galician accounts on Instagram, followed by the automated collection of public comments through scraping tools. Once the data is compiled, a detailed preprocessing is carried out, which includes language detection, text normalization, and the protection of user privacy. In addition to evaluating the feasibility of using models previously trained with data from Twitter/X and Mastodon, the project includes the design and training of its own models using machine learning techniques, aimed at the classification of toxic comments. These models were validated using standard metrics and tested with real data in Galician, highlighting their potential for the early detection of harmful digital behavior. This work represents a relevant contribution both in the field of language technology and in the fight against digital violence in Galician. The compiled corpus, together with the developed system, constitutes a solid foundation for future research that wishes to delve into the analysis of toxicity on social networks from a sociolinguistic and gender perspective.glgAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/Procesamento de linguaxe naturalAprendizaxe automáticaInstagramGalegoCorpusToxicidadeComentarios misóxinosNatural language processingMachine learningGalicianToxicityMisogynistic commentsAnálise de toxicidade en contas galegas de Instagram: desenvolvemento dun sistema de detecciónbachelor thesisopen access