Use this link to cite:
https://hdl.handle.net/2183/46437 Representaciones latentes perceptuales: mejorando la fidelidad visual y el control autónomo en modelos generativos de mundos
Loading...
Identifiers
Publication date
Authors
Corcuera Sánchez, Bruno
Other responsabilities
Universidade da Coruña. Facultade de Informática
Journal Title
Bibliographic citation
Type of academic work
Academic degree
Abstract
[Resumen]: Los modelos generativos de mundos (World Models) representan un avance significativo en inteligencia artificial, permitiendo a los sistemas construir simulaciones internas del entorno donde los agentes pueden interactuar y aprender a través de las mismas. Esta capacidad para simular y explorar escenarios potenciales constituye un paso fundamental hacia sistemas inteligentes con una comprensión más profunda de las dinámicas y la física del entorno.
Sin embargo, estos modelos presentan aún limitaciones críticas, particularmente en la calidad visual de las imágenes generadas, lo que impacta directamente en el rendimiento de agentes entrenados mediante aprendizaje por refuerzo. La baja fidelidad en las representaciones visuales dificulta la transferencia de políticas aprendidas al mundo real y limita la eficacia de estos sistemas.
El presente trabajo propone una arquitectura de compresión y generación de imágenes innovadora centrada en optimizar las representaciones latentes mediante autoencoders y técnicas de selección dinámica de características no supervisada. Todo ello sin incrementar el tamaño del modelo subyacente. Permitiendo desarrollar representaciones que reflejen mejor la percepción estructural y espacial del entorno, mejorando tanto la calidad de las imágenes generadas como el rendimiento de los agentes en tareas de control autónomo.
La metodología experimental replica entornos y modelos estándares utilizados en publicaciones científicas de referencia. Para evaluar objetivamente los resultados, se utilizan métricas de reconstrucción, basadas en divergencias de distribuciones y recompensa acumulada para el desempeño de agentes. Los experimentos se han diseñado bajo configuraciones y condiciones validadas experimentalmente para asegurar resultados fidedignos. Los resultados experimentales demuestran mejoras realmente significativas como: Reducción de un 53% en FID (Fréchet Inception Distance ) y un 26% en FVD ( Fréchet Video Distance ), reflejando una sustancial mejora en la calidad visual de las imágenes generadas. Asimismo, se observa un incremento del 12% en el desempeño de los agentes.
Este estudio contribuye directamente a líneas de investigación actuales promovidas por laboratorios líderes en inteligencia artificial, particularmente en aplicaciones de los World Models como motores gráficos generativos en tiempo real para videojuegos y para el entrenamiento eficiente de robots y agentes autónomos en mundos simulados internos. La relevancia de estas contribuciones se ha materializado en un artículo científico, sometido a revisión en la Conferencia Internacional NeurIPS de 2025. El código fuente de este trabajo está disponible públicamente para su consulta y reproducción en: https://github.com/BrunooCS/Perceptual-Latent-Representations-World-Model
[Abstract]: World Models represent a significant advance in artificial intelligence, allowing systems to build internal simulations of the environment where agents can interact and learn from them. This ability to simulate and explore potential scenarios is a fundamental step towards intelligent systems with a deeper understanding of the dynamics and physics of the environment. However, these models still have critical limitations, particularly in the visual quality of the generated images, which directly impacts the performance of agents trained by reinforcement learning. The low fidelity of the visual representations makes it difficult to transfer learned policies to the real world and limits the effectiveness of these systems. The present work proposes an innovative image compression and generation architecture focused on optimising latent representations by means of autoencoders and unsupervised dynamic feature selection techniques. All this is done without increasing the size of the underlying model. This allows the development of representations that better reflect the structural and spatial perception of the environment, improving both the quality of the generated images and the performance of the agents in autonomous control tasks. The experimental methodology replicates standard environments and models used in peerreviewed scientific publications. Reconstruction metrics, based on divergence of distributions and cumulative reward for agent performance, are used to objectively evaluate the results. Experiments have been designed under experimentally validated settings and conditions to ensure reliable results. Experimental results show significant improvements such as: 53% reduction in FID (Fréchet Inception Distance) and 26% reduction in FVD (Fréchet Video Distance), reflecting a substantial improvement in the visual quality of the generated images. Furthermore, a 12% increase in the performance of the agents is observed. This study contributes directly to current research lines promoted by leading artificial intelligence laboratories, particularly in applications of World Models as real-time generative graphics engines for video games and for the efficient training of robots and autonomous agents in indoor simulated worlds. The relevance of these contributions has been materialised in a scientific paper, submitted for review at the NeurIPS 2025 International Conference. The source code of this work is publicly available for consultation and reproduction at: https://github.com/BrunooCS/Perceptual-Latent-Representations-World-Model
[Abstract]: World Models represent a significant advance in artificial intelligence, allowing systems to build internal simulations of the environment where agents can interact and learn from them. This ability to simulate and explore potential scenarios is a fundamental step towards intelligent systems with a deeper understanding of the dynamics and physics of the environment. However, these models still have critical limitations, particularly in the visual quality of the generated images, which directly impacts the performance of agents trained by reinforcement learning. The low fidelity of the visual representations makes it difficult to transfer learned policies to the real world and limits the effectiveness of these systems. The present work proposes an innovative image compression and generation architecture focused on optimising latent representations by means of autoencoders and unsupervised dynamic feature selection techniques. All this is done without increasing the size of the underlying model. This allows the development of representations that better reflect the structural and spatial perception of the environment, improving both the quality of the generated images and the performance of the agents in autonomous control tasks. The experimental methodology replicates standard environments and models used in peerreviewed scientific publications. Reconstruction metrics, based on divergence of distributions and cumulative reward for agent performance, are used to objectively evaluate the results. Experiments have been designed under experimentally validated settings and conditions to ensure reliable results. Experimental results show significant improvements such as: 53% reduction in FID (Fréchet Inception Distance) and 26% reduction in FVD (Fréchet Video Distance), reflecting a substantial improvement in the visual quality of the generated images. Furthermore, a 12% increase in the performance of the agents is observed. This study contributes directly to current research lines promoted by leading artificial intelligence laboratories, particularly in applications of World Models as real-time generative graphics engines for video games and for the efficient training of robots and autonomous agents in indoor simulated worlds. The relevance of these contributions has been materialised in a scientific paper, submitted for review at the NeurIPS 2025 International Conference. The source code of this work is publicly available for consultation and reproduction at: https://github.com/BrunooCS/Perceptual-Latent-Representations-World-Model
Description
Keywords
Modelos del Mundo Aprendizaje Profundo Selección de Características Visión por Computador Representación Latente Aprendizaje por Refuerzo Inteligencia Artificial Generativa World Models Deep Learning Feature Selection Computer Vision Latent Representations Reinforcement Learning Generative Artificial Intelligence
Editor version
Rights
Attribution-NonCommercial-ShareAlike 4.0 International







