Uso de algoritmos de aprendizaje máquina para la clasificación de tráfico de red

Costa Garrido, Anxo

Title

Author(s)

Costa Garrido, Anxo

Directors

Fernández, Diego
Novoa, Francisco

Date

2022

Center/Dept./Entity

Universidade da Coruña. Facultade de Informática

Description

Traballo fin de grao (UDC.FIC). Enxeñaría Informática. Curso 2021/2022

Abstract

[Resumen] Los avances tecnológicos han permitido un acceso a internet asequible, rápido y fiable aumentando el número de usuarios y los servicios demandados. Han surgido nuevos paradigmas de diseño para simplificar la administración de unas redes cada vez más complejas, añadiendo nuevas superficies de ataque. Las restricciones debidas al COVID-19 obligaron a crear unas infraestructuras de teletrabajo de la noche a la mañana, dándole un mayor papel al usuario y la seguridad de su entorno. Cada vez hay más información sensible en contacto con internet, y por eso la ciberseguridad es más importante que nunca. La clasificación del tráfico de red es una herramienta útil para tareas de seguridad, pero analizar cada paquete conlleva un elevado coste computacional. De ahí que a menudo este análisis se realice a nivel de flujo de red. En este trabajo hemos aplicado diversos modelos de Machine Learning y Deep Learning a un conjunto de datos etiquetado (InSDN ) que contiene las características del tráfico (en flujos) de una red definida por software, con el objetivo de realizar una tarea de clasificación supervisada, siguiendo la metodología CRISP-DM, nuestra guía durante todo el proceso. A lo largo de este trabajo se ha realizado una intensiva labor de ingeniería de datos. Se ha analizado de manera exhaustiva el conjunto de datos inicial, se ha hecho una limpieza de los datos, se les ha dado el formato adecuado, se han construido nuevas características derivadas de las ya existentes, y se han seleccionado las que aportaban más información. Sobre el conjunto obtenido se han aplicado algoritmos de diferente naturaleza, tras realizar un proceso de hiperparametrización. Para su implementación se han usado principalmente las herramientas scikit-learn y Keras. Para finalizar, los modelos resultantes han sido evaluados empleando métricas de clasificación tradicionales. Los resultados muestran que los modelos de Machine Learning y Deep Learning resultan de utilidad en problemas de clasificación de tráfico de red, destacando los modelos Random Forest y LinearSVC por su exactitud y rapidez.

[Abstract] Technological advances have led to affordable, fast and reliable Internet access, increasing the number of users and the services demanded. New design paradigms have emerged to simplify the management of increasingly complex networks, adding new attack surfaces. COVID-19 restrictions forced the creation of remote work infrastructures overnight, emphasizing user’s role on the overall security. With more and more sensitive information coming into contact with the Internet, cybersecurity is more important than ever. Network traffic classification is a useful tool for security tasks, but analyzing each packet is computationally expensive. Thus, this analysis is often performed on network flows. In this paper we have applied several models of Machine Learning and Deep Learning to a labeled dataset (InSDN ) containing the traffic characteristics (in flows) of a softwaredefined network, with the objective of performing a supervised classification task, following the CRISP-DM methodology, our guide during the whole process. Throughout this work, we have performed an intensive data engineering task. We have exhaustively analyzed the initial dataset, cleaned and formatted the data, constructed new features from existing ones, selecting the ones providing the most information. We have applied algorithms of different nature on the obtained set following a hyperparameterization process, using scikit-learn and Keras. Finally we have evaluated the resulting models using traditional classification metrics. The results show that the Machine Learning and Deep Learning models are useful in network traffic classification problems, highlighting the Random Forest and LinearSVC for their accuracy and speed.

Keywords

Machine learning
Deep learning
Detección de intrusiones
Redes sefinidas por software
Clasificación
Random forest
AdaBoost
LinearSVC
Autoencoder
RNN-LSTM
CNN
Intrusion detection
Software defined networks
Classification
Random forest