Decoding Hate: Exploring Language Models' Reactions to Hate Speech
| UDC.coleccion | Investigación | |
| UDC.conferenceTitle | NAACL-HLT 2025 | |
| UDC.departamento | Ciencias da Computación e Tecnoloxías da Información | |
| UDC.endPage | 990 | |
| UDC.grupoInv | Information Retrieval Lab (IRlab) | |
| UDC.institutoCentro | CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación | |
| UDC.startPage | 973 | |
| dc.contributor.author | Piot, Paloma | |
| dc.contributor.author | Parapar, Javier | |
| dc.date.accessioned | 2026-02-05T08:40:07Z | |
| dc.date.available | 2026-02-05T08:40:07Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | [Abstract]: Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing. Finally, we explore the models’ responses to hate speech framed in politically correct language. | |
| dc.description.sponsorship | The authors thank the funding from the Horizon Europe research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 101073351. The authors also thank the financial support supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditation 2019-2022 ED431G/01, ED431B 2022/33) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT as a Research Center of the Galician University System and the project PID2022-137061OB-C21 (Ministerio de Ciencia e Innovación supported by the European Regional Development Fund). The authors also thank the funding of project PLEC2021-007662 (MCIN/AEI/10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU). | |
| dc.description.sponsorship | Xunta de Galicia; 2019-2022 ED431G/01 | |
| dc.description.sponsorship | Xunta de Galicia; ED431B 2022/33 | |
| dc.identifier.citation | Paloma Piot and Javier Parapar. 2025. Decoding Hate: Exploring Language Models’ Reactions to Hate Speech. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL-HLT 2025, pp. 973–990, Albuquerque, New Mexico, 29 April- 4 May 2025. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-long.45 | |
| dc.identifier.doi | 10.18653/v1/2025.naacl-long.45 | |
| dc.identifier.isbn | 9798891761896 | |
| dc.identifier.uri | https://hdl.handle.net/2183/47249 | |
| dc.language.iso | eng | |
| dc.publisher | Association for Computational Linguistics (ACL) | |
| dc.relation.projectID | info:eu-repo/grantAgreement/EC/HE/101073351 | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2024/PLEC2021-007662/ES/BIG-eRISK: PREDICCIÓN TEMPRANA DE RIESGOS PERSONALES EN CONJUNTOS DE DATOS MASIVOS | |
| dc.relation.uri | https://doi.org/10.18653/v1/2025.naacl-long.45 | |
| dc.rights | Attribution 4.0 International | en |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Digital environment | |
| dc.subject | Internet data | |
| dc.subject | Language model | |
| dc.subject | Model reactions | |
| dc.subject | Qualitative analysis | |
| dc.subject | Speech generation | |
| dc.subject | Speech input | |
| dc.subject | Speech patterns | |
| dc.title | Decoding Hate: Exploring Language Models' Reactions to Hate Speech | |
| dc.type | conference output | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 0563c6c3-cd50-4d7d-b11f-127ee297dd6b | |
| relation.isAuthorOfPublication | fef1a9cb-e346-4e53-9811-192e144f09d0 | |
| relation.isAuthorOfPublication.latestForDiscovery | 0563c6c3-cd50-4d7d-b11f-127ee297dd6b |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Parapar_Javier_2025_Decoding_Hate.pdf
- Size:
- 540.64 KB
- Format:
- Adobe Portable Document Format

