Decoding Hate: Exploring Language Models' Reactions to Hate Speech

Piot, Paloma; Parapar, Javier

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

UDC.coleccion	Investigación
UDC.conferenceTitle	NAACL-HLT 2025
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.endPage	990
UDC.grupoInv	Information Retrieval Lab (IRlab)
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.startPage	973
dc.contributor.author	Piot, Paloma
dc.contributor.author	Parapar, Javier
dc.date.accessioned	2026-02-05T08:40:07Z
dc.date.available	2026-02-05T08:40:07Z
dc.date.issued	2025
dc.description.abstract	[Abstract]: Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing. Finally, we explore the models’ responses to hate speech framed in politically correct language.
dc.description.sponsorship	The authors thank the funding from the Horizon Europe research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 101073351. The authors also thank the financial support supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditation 2019-2022 ED431G/01, ED431B 2022/33) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT as a Research Center of the Galician University System and the project PID2022-137061OB-C21 (Ministerio de Ciencia e Innovación supported by the European Regional Development Fund). The authors also thank the funding of project PLEC2021-007662 (MCIN/AEI/10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU).
dc.description.sponsorship	Xunta de Galicia; 2019-2022 ED431G/01
dc.description.sponsorship	Xunta de Galicia; ED431B 2022/33
dc.identifier.citation	Paloma Piot and Javier Parapar. 2025. Decoding Hate: Exploring Language Models’ Reactions to Hate Speech. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL-HLT 2025, pp. 973–990, Albuquerque, New Mexico, 29 April- 4 May 2025. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-long.45
dc.identifier.doi	10.18653/v1/2025.naacl-long.45
dc.identifier.isbn	9798891761896
dc.identifier.uri	https://hdl.handle.net/2183/47249
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics (ACL)
dc.relation.projectID	info:eu-repo/grantAgreement/EC/HE/101073351
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2024/PLEC2021-007662/ES/BIG-eRISK: PREDICCIÓN TEMPRANA DE RIESGOS PERSONALES EN CONJUNTOS DE DATOS MASIVOS
dc.relation.uri	https://doi.org/10.18653/v1/2025.naacl-long.45
dc.rights	Attribution 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Digital environment
dc.subject	Internet data
dc.subject	Language model
dc.subject	Model reactions
dc.subject	Qualitative analysis
dc.subject	Speech generation
dc.subject	Speech input
dc.subject	Speech patterns
dc.title	Decoding Hate: Exploring Language Models' Reactions to Hate Speech
dc.type	conference output
dspace.entity.type	Publication
relation.isAuthorOfPublication	0563c6c3-cd50-4d7d-b11f-127ee297dd6b
relation.isAuthorOfPublication	fef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscovery	0563c6c3-cd50-4d7d-b11f-127ee297dd6b

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Parapar_Javier_2025_Decoding_Hate.pdf
Size:: 540.64 KB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)