Exploiting Topic Analysis Models to Explore Psychological Dimensions in Social Media Data

Couto Pintos, Manuel; Parapar, Javier; Losada, David E.

Exploiting Topic Analysis Models to Explore Psychological Dimensions in Social Media Data

UDC.coleccion	Investigación
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.endPage	21
UDC.grupoInv	Information Retrieval Lab (IRlab)
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.issue	6047
UDC.journalTitle	Scientific Reports
UDC.startPage	1
UDC.volume	16
dc.contributor.author	Couto Pintos, Manuel
dc.contributor.author	Parapar, Javier
dc.contributor.author	Losada, David E.
dc.date.accessioned	2026-02-23T16:40:10Z
dc.date.available	2026-02-23T16:40:10Z
dc.date.issued	2026
dc.description	Data availability: The dataset consisting of extracted topics and human ratings is publicly available at Zenodo: https://doi.org/10.5281/zenodo.15081947. The system built for topic assessment (web application developed specifically for storing topics and assessments in a database with a customized design) is available at https://github.com/manucouto1/Topic-Quality-Assessment-Tool. The eRisk datasets used in this study are publicly available (eRisk website). Specifically, the eRisk 2017 and 2018 datasets can be obtained at https://tec.citius.usc.es/ir/code/eRisk.html (note that 2017’s data was 2018’s training split), while the eRisk 2019 dataset is available at https://erisk.irlab.org/2019/eRisk2019.html.
dc.description.abstract	[Abstract]: Automatic topic generation is a fundamental tool in unstructured text analysis, yet its application to noisy web-based collections for extracting psychological patterns remains underexplored. This work compares three representative topic models from different families: Latent Dirichlet Allocation (classical probabilistic), BERTopic (embedding-based), and TopClus (deep neural network), evaluating their performance on mental health data from the eRisk initiative. Using posts from individuals with depressive disorders and control groups, we assess topic quality through both automatic coherence metrics and rigorous human evaluation by expert reviewers. This dual approach addresses the limitations of purely automatic evaluation in complex social media datasets where thematic content does not always reveal psychological cues. Our results demonstrate that BERTopic significantly outperforms other models in perceived coherence, identifying clearer and more specific themes, including depression-related topics such as mental health struggles and self-harm. Thematic analysis across user groups revealed that certain topics contained higher proportions of posts from individuals with depression, providing actionable insights for psychological screening. This work underscores the potential of advanced topic models for mental health analysis in noisy social media data and highlights the importance of human evaluation in validating topic quality for sensitive applications.
dc.description.sponsorship	The first and third authors thank the financial support supplied by the Agencia Estatal de Investigación (Spain) (PID2022-137061OB-C22; MCIN/AEI/10.13039/501100011033, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU), Consellería de Cultura, Educación, Formación Profesional e Universidades (Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04 and Reference Competitive Group accreditation 2022-2025, ED431C 2022/19) and the European Union (European Regional Development Fund - ERDF). These authors also acknowledge the project “Cátedra de IA aplicada a la Medicina Personalizada de Precisión” (Cátedras ENIA, TSI-100932-2023-3); Cátedras ENIA is funded by the Ministerio de Transformación Digital y Función Pública (Secretaría de Estado de Digitalización e Inteligencia Artificial); and by the NextGeneration EU-fund. The second author thanks the financial support supplied from projects: PID2022-137061OB-C21 (MCIN/AEI/10.13039/501100011033/, Ministerio de Ciencia e Innovación, ERDF A way of making Europe, by the European Union); Consellería de Educación, Universidade e Formación Profesional, Spain (accreditations 2019–2022 ED431G/01 and GPC ED431B 2022/33) and the European Regional Development Fund, which acknowledges the CITIC Research Center.
dc.description.sponsorship	Xunta de Galicia; ED431G/01
dc.description.sponsorship	Xunta de Galicia; ED431B 2022/33
dc.description.sponsorship	Xunta de Galicia; ED431G-2023/04
dc.description.sponsorship	Xunta de Galicia; ED431C 2022/19
dc.description.uri	https://doi.org/10.5281/zenodo.15081947
dc.description.uri	https://github.com/manucouto1/Topic-Quality-Assessment-Tool
dc.description.uri	https://tec.citius.usc.es/ir/code/eRisk.html
dc.description.uri	https://erisk.irlab.org/2019/eRisk2019.html
dc.identifier.citation	Couto, M., Parapar, J. & Losada, D.E. Exploiting topic analysis models to explore psychological dimensions in social media data. Sci Rep 16, 6047 (2026). https://doi.org/10.1038/s41598-026-36339-y
dc.identifier.doi	10.1038/s41598-026-36339-y
dc.identifier.issn	2045-2322
dc.identifier.uri	https://hdl.handle.net/2183/47483
dc.language.iso	eng
dc.publisher	Nature Research
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C22/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD: BUSQUEDA Y DETECCION DE DESINFORMACION
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.projectID	info:eu-repo/grantAgreement/MTDPF//TSI-100932-2023-3/ES/CÁTEDRA DE INTELIGENCIA ARTIFICIAL APLICADA A LA MEDICINA PERSONALIZADA DE PRECISIÓN
dc.relation.uri	https://doi.org/10.1038/s41598-026-36339-y
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Social media
dc.subject	Mental health
dc.subject	Psychological analysis
dc.subject	Topic analysis
dc.subject	Large language models
dc.title	Exploiting Topic Analysis Models to Explore Psychological Dimensions in Social Media Data
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	fef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscovery	fef1a9cb-e346-4e53-9811-192e144f09d0

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Parapar_Javier_2026_Exploiting_topic_analysis_models.pdf
Size:: 5.28 MB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)