Use this link to cite:
https://hdl.handle.net/2183/46383 Desarrollo y evaluación de sistemas basados en modelos de lenguaje para respuesta automática a preguntas en ciencias de la salud
Loading...
Identifiers
Publication date
Authors
Correa Guillen, Alexis
Other responsabilities
Universidade da Coruña. Facultade de Informática
Journal Title
Bibliographic citation
Type of academic work
Academic degree
Abstract
[Resumen]: Evaluar modelos de lenguaje en dominios que requieren razonamiento complejo es un desafío clave en procesamiento de lenguaje natural. Aunque han demostrado gran capacidad en múltiples tareas, su rendimiento en contextos especializados sigue siendo una cuestión abierta. Presentamos HEAD-QA V2, una versión ampliada de un conjunto de datos basado en los exámenes de acceso al sistema de Formación Sanitaria Especializada en España. Contiene más de 12 000 preguntas de opción múltiple en seis disciplinas biomédicas, incluidas preguntas multimodales con imágenes, y ha sido traducido automáticamente del español al inglés, italiano, ruso y gallego. Para evaluar estos modelos, realizamos experimentos con LLMs de distintos tamaños y múltiples estrategias de inferencia. Analizamos tres enfoques: (i) prompting, que guía la generación mediante instrucciones; (ii) Retrieval-Augmented Generation (RAG), que proporciona contexto adicional con fragmentos de libros; y (iii) selección basada en probabilidades, que evita la generación de texto y elige la respuesta según las puntuaciones del modelo. Los resultados indican que el modelo es el principal factor en el desempeño, mientras que estrategias avanzadas de inferencia no solo carecen de mejoras significativas, sino que a veces lo perjudican. Estos hallazgos consolidan HEAD-QA V2 como un recurso clave para la investigación en procesamiento de lenguaje natural en dominios especializados, proporcionando un entorno desafiante para evaluar modelos.
[Abstract]: Evaluating language models in domains requiring complex reasoning is a key challenge in natural language processing. While they have demonstrated strong capabilities across multiple tasks, their performance in specialized contexts remains an open question. We present HEAD-QA V2, an expanded version of a dataset based on the entrance exams for Spain’s Specialized Health Training system. It contains more than 12 000 multiple-choice questions across six biomedical disciplines, including multimodal questions with images, and has been automatically translated from Spanish into English, Italian, Russian, and Galician. To evaluate these models, we conducted experiments with LLMs of various sizes and multiple inference strategies. We analyzed three approaches: (i) prompting, which guides generation through instructions; (ii) Retrieval-Augmented Generation (RAG), which provides additional context using book excerpts; and (iii) probability-based selection, which avoids text generation and selects the answer based on the model’s assigned scores.Results indicate that model choice is the primary performance factor, while advanced inference strategies not only fail to provide significant improvements but sometimes degrade performance. These findings establish HEAD-QA V2 as a key resource for natural language processing research in specialized domains, providing a challenging environment for model evaluation.
[Abstract]: Evaluating language models in domains requiring complex reasoning is a key challenge in natural language processing. While they have demonstrated strong capabilities across multiple tasks, their performance in specialized contexts remains an open question. We present HEAD-QA V2, an expanded version of a dataset based on the entrance exams for Spain’s Specialized Health Training system. It contains more than 12 000 multiple-choice questions across six biomedical disciplines, including multimodal questions with images, and has been automatically translated from Spanish into English, Italian, Russian, and Galician. To evaluate these models, we conducted experiments with LLMs of various sizes and multiple inference strategies. We analyzed three approaches: (i) prompting, which guides generation through instructions; (ii) Retrieval-Augmented Generation (RAG), which provides additional context using book excerpts; and (iii) probability-based selection, which avoids text generation and selects the answer based on the model’s assigned scores.Results indicate that model choice is the primary performance factor, while advanced inference strategies not only fail to provide significant improvements but sometimes degrade performance. These findings establish HEAD-QA V2 as a key resource for natural language processing research in specialized domains, providing a challenging environment for model evaluation.
Description
Editor version
Rights
Attribution 4.0 International








