Decoding Hate: Exploring Language Models' Reactions to Hate Speech

UDC.coleccionInvestigación
UDC.conferenceTitleNAACL-HLT 2025
UDC.departamentoCiencias da Computación e Tecnoloxías da Información
UDC.endPage990
UDC.grupoInvInformation Retrieval Lab (IRlab)
UDC.institutoCentroCITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.startPage973
dc.contributor.authorPiot, Paloma
dc.contributor.authorParapar, Javier
dc.date.accessioned2026-02-05T08:40:07Z
dc.date.available2026-02-05T08:40:07Z
dc.date.issued2025
dc.description.abstract[Abstract]: Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing. Finally, we explore the models’ responses to hate speech framed in politically correct language.
dc.description.sponsorshipThe authors thank the funding from the Horizon Europe research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 101073351. The authors also thank the financial support supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditation 2019-2022 ED431G/01, ED431B 2022/33) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT as a Research Center of the Galician University System and the project PID2022-137061OB-C21 (Ministerio de Ciencia e Innovación supported by the European Regional Development Fund). The authors also thank the funding of project PLEC2021-007662 (MCIN/AEI/10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU).
dc.description.sponsorshipXunta de Galicia; 2019-2022 ED431G/01
dc.description.sponsorshipXunta de Galicia; ED431B 2022/33
dc.identifier.citationPaloma Piot and Javier Parapar. 2025. Decoding Hate: Exploring Language Models’ Reactions to Hate Speech. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL-HLT 2025, pp. 973–990, Albuquerque, New Mexico, 29 April- 4 May 2025. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-long.45
dc.identifier.doi10.18653/v1/2025.naacl-long.45
dc.identifier.isbn9798891761896
dc.identifier.urihttps://hdl.handle.net/2183/47249
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics (ACL)
dc.relation.projectIDinfo:eu-repo/grantAgreement/EC/HE/101073351
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2024/PLEC2021-007662/ES/BIG-eRISK: PREDICCIÓN TEMPRANA DE RIESGOS PERSONALES EN CONJUNTOS DE DATOS MASIVOS
dc.relation.urihttps://doi.org/10.18653/v1/2025.naacl-long.45
dc.rightsAttribution 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectDigital environment
dc.subjectInternet data
dc.subjectLanguage model
dc.subjectModel reactions
dc.subjectQualitative analysis
dc.subjectSpeech generation
dc.subjectSpeech input
dc.subjectSpeech patterns
dc.titleDecoding Hate: Exploring Language Models' Reactions to Hate Speech
dc.typeconference output
dspace.entity.typePublication
relation.isAuthorOfPublication0563c6c3-cd50-4d7d-b11f-127ee297dd6b
relation.isAuthorOfPublicationfef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscovery0563c6c3-cd50-4d7d-b11f-127ee297dd6b

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Parapar_Javier_2025_Decoding_Hate.pdf
Size:
540.64 KB
Format:
Adobe Portable Document Format