Data Drift Analysis In NLP Models

Pérez Longa, Iván

Ver/abrir

PerezLonga_Ivan_TFG_2023.pdf (3.786Mb)

Use este enlace para citar

http://hdl.handle.net/2183/34026

A non ser que se indique outra cousa, a licenza do ítem descríbese como Atribución 3.0 España

Coleccións

Enxeñaría informática, Grao en [447]

Metadatos

Mostrar o rexistro completo do ítem

Título

Data Drift Analysis In NLP Models

Autor(es)

Pérez Longa, Iván

Director(es)

Correia, Joao
Vilares, David

Data

2023

Centro/Dpto/Entidade

Universidade da Coruña. Facultade de Informática

Descrición

Traballo fin de grao (UDC.FIC). Enxeñaría Informática. Curso 2022/2023

Resumo

[Abstract]: Natural Language Processing is becoming increasingly popular, and with the current growth and popularization of AI it is only normal that these fields of study are on everyone’s lips. Thus, it is natural that more and more studies focus on finding, refining, or developing techniques to improve results with these systems. One of the problems that still lacks a welldefined and effective solution is how to deal with data changes, which can lead to model degradation, as the data on which it is trained and the data that this system is going to face in the real world may differ, complicating generalization. These changes in data, related to the evolution of features over time, are known as data drift. This may happen for various reasons, like the natural evolution of the vocabulary, topics, or styles, among others. Additionally, it is also expected for new terms to appear and others to disappear, such as COVID-19, pandemics, war or inflation. Currently, there are no well-established methods for handle this phenomenon, so new alternatives should be explored to see their performance detecting variations. In this case, data drift detection methods should represent if two articles about the same topic differ in language, and measure the similarity between data to understand if the model will fail the prediction. In this thesis we will also define three different scenarios based on three binary classification datasets (Disasters tweets, Fake news and Depression detection), and we will transform this data to a drifted one in different ways like the addition of synonyms, noise, or the change of style. To quantify whether the text to be transformed in different ways has the same meaning as the original, we use metrics such as Bleu and Edit distance on the text itself. In addition, we will transform the original and drifted sentences to its numerical representation in the form of vectors, using various encoding strategies such as Bag Of Words or embeddings strategies like Large Language Models. With these vectors, we will measure again the similarity with other metrics, such as Cosine similarity or Dot product. As the main idea of this project is to explore methods to detect data drifts in NLP models, we will also study the results of the implemented models (Multinomial Naive Bayes, Logistic Regression, Support Vector Classifier, Long-Short Term Memory and Bidirectional Encoder Representations from Transformers) on the data from each of the original datasets and the transformed data, comparing their performance to see if they fail more predictions, and if they would lose performance over time.

[Resumo]: O Procesamento da Linguaxe Natural é cada vez máis popular, e co actual crecemento e popularización da IA é normal que estes campos de estudo estean en boca de todos. Así pois, é natural que cada vez máis estudos céntrense en atopar, perfeccionar ou desenvolver técnicas para mellorar os resultados nestes sistemas. Un dos problemas que aínda carece dunha solución ben definida e eficaz é como facer fronte aos cambios nos datos, que poden provocar confusión no modelo, xa que a data na que é entrenado e a data a que vaise enfrentar no mundo real pode variar, complicando a generalización. Estes cambios na data, relacionados coa evolución das características co tempo, son coñecidos como data drifts. Isto pode ocorrer por varias razóns, como a transformación do vocabulario ou a evolución dos temas ou estilos. Ademais, tamén é de esperar que aparezan novos termos e desaparezan outros, como COVID-19, pandemias, guerra ou inflación. Na actualidade, non existen métodos ben establecidos para lidiar con este fenómeno, polo que habería que explorar novas alternativas para ver o seu rendemento detectando variacións. Neste caso, os métodos de detección de data drifts deberían representar se dous artigos sobre o mesmo tema difiren na linguaxe, e medir a similitude entre os datos para entender se o modelo fallará na predición. Nesta tesis, definiremos tres escenarios diferentes basados en tres datasets de clasificación binarios (Tweets sobre desastres naturais, Noticias falsas e Detección da depresión), e transformaremos esta data nunha drifteada en diversas formas como a adición de sinónimos, ruído ou o cambio de estilo. Para medir se o texto transformado de diferentes maneiras ten o mesmo sentido que o orixinal, usamos métricas como Bleu e a Distancia de edición directamente no propio texto. Ademáis, transformaremos as oracións orixinais e drifteadas á súa representación numérica en forma de vectores, usando varias estratexias de codificación como Bag Of Words ou estratexias de embeddings como Large Language Models. Con estes vectores, mediremos de novo a similitude con outras métricas, como a Similaridade por coseno ou o Producto escalar. Como a idea principal deste proxecto é explorar métodos que permitan detectar derivas de datos nos modelos de PLN, estudaremos tamén os resultados dos modelos implementados (Multinomial Naive Bayes, Logistic Regression, Support Vector Classifier, Long-Short Term Memory and Bidirectional Encoder Representations from Transformers) coa data orixinal dos datasets e a transformada, comparando o seu rendimento, e vendo se existe un descenso na veracidade das prediccións e perden productividade co tempo.

Palabras chave

Data drift
Data shift
Language models
Text clasification
NLP
Data drift
Data shift
Modelos de lenguaje
Clasificación de texto
PLN

Dereitos

Atribución 3.0 España