pRIblast: a high efficient, parallel application for RNA-RNA interaction prediction
![Thumbnail](/dspace/bitstream/handle/2183/31769/AmatriaBarral_I%c3%b1aki_TFG_2021.pdf.jpg?sequence=5&isAllowed=y)
Use this link to cite
http://hdl.handle.net/2183/31769
Except where otherwise noted, this item's license is described as Atribución-No Comercial-No Derivadas 3.0 España
Collections
Metadata
Show full item recordTitle
pRIblast: a high efficient, parallel application for RNA-RNA interaction predictionAuthor(s)
Directors
González-Domínguez, JorgeTouriño, Juan
Date
2021Center/Dept./Entity
Enxeñaría informática, Grao enDescription
Traballo fin de grao (UDC.FIC). Enxeñaría Informática. Curso 2021/2022Abstract
[Abstract]: For a long time, it was a common and well-established belief that RNA’s only role was to intermediate
between DNA and protein. However, during the last three decades, this long-held
belief has been completely shattered. With the development of next generation sequencing
technologies, it has been found out that most RNA in the human genome does not translate
into protein. This is the so called long noncoding RNA (lncRNA), whose discovery has drastically
changed the way biologists approach genetics. Furthermore, studies show that, besides
playing important roles in many biological processes, the dysfunction of many lncRNA sequences
are associated with serious diseases, such as cancer or diabetes.
Consequently, noncoding RNA biology is a hot research topic, and biologists are constantly
trying to come up with new strategies to elucidate lncRNA functions, some of which
include computational prediction of interacting RNA and lncRNA pairs (lncRNA works by being
assembled with other proteins or RNA). For this very purpose, many application-specific
bioinformatics tools have been developed. For instance: RIsearch2, ASSA and RIblast, which
is one of the fastest, yet accurate, tools in the market right now. However, even though it
is up to 64 times faster than other predictors, RIblast still falls very short when it is supplied
with huge and significant lncRNA datasets, and, therefore, further progress in the field is still
very limited.
To address this particular problem, this thesis presents pRIblast: a high efficient, parallel
application for extensive and comprehensive RNA-RNA interaction analysis. Programmed
with industry standard parallel technologies (MPI and OpenMP), pRIblast introduces the RIblast
algorithm into high performance computing facilities (i.e. clusters of multicore systems
joined together by an interconnection network). Moreover, pRIblast has been optimized to
reduce memory usage and input and output latencies to the bare minimum and ,therefore, the
novel application is ready to take on new challenges that could never have been faced with
the former RIblast tool (i.e. the human genome).
To ensure pRIblast fulfills all quality criteria to be considered production ready, this thesis
presents comprehensive benchmarking done on a 16-node computer cluster too (64GiB of
main memory and 16 CPU cores per node, which amount for a total of 256 CPU cores). The
results are outstanding. They not only point out that the parallelization of RIblast is successful
(101 days worth of work were reduced to just 21 hours), but they also assert the importance
of the optimizations applied to the tool (it was possible to analyze two datasets which exceed
RIblast memory requirements, and I/O times were reduced from 4000 to just 90 seconds with
a dataset that produced 407GiB of output data). [Resumo]: Durante moitos anos pensouse que o ARN era un simple intermediario entre o ADN e as
proteinas, mais, porén, a aparición de tecnoloxías de secuenciación de nova xeración permitiu
descubrir que a maior parte do xenoma humano está formado por cadeas longas de ARN non
codificante (lncRNA, polas súas siglas en inglés). É dicir, un tipo de ARN que non sintetiza
proteínas. Ademais, estudos recentes demostraron que a disfunción dunha gran parte destas
cadeas de ARN están relacionadas con enfermidades tan graves coma o cancro ou a diabetes.
Para dilucidar a función das lncARNs, xurdiron numerosas ferramentas informáticas que
tratan de predicir interacións ARN e lncRNA, xa que, as últimas, funcionan ensamblándose
xunto a outras proteínas ou cadeas de ARN. Algunhas destas ferramentas son: RIsearch2,
ASSA e, máis notablemente, RIblast, que obtén resultados até 64 veces máis rápido que outras
aplicacións dispoñibles no mercado sen comprometer a calidade das predicións. Malia isto,
RIblast aínda é demasiado lenta e non pode traballar con conxuntos de lncRNAs moi grandes
sen que os tempos de predición medren exponencialmente.
Neste Traballo Fin de Grao desenvolveuse pRIblast, que é unha mellora sobre o algoritmo
RIblast que permite executalo en contornas de computación de altas prestacións. Para
isto, utilizáronse tecnoloxías de programación paralela estándar (MPI e OpenMP) que fan que
pRIblast poida explotar, eficientemente, calquera sistema de computación multinó con nós
multinúcleo. A nova ferramenta tamén se optimizou para minimizar a latencia das operacións
de entrada e saída e o uso de memoria. Así pois conseguiuse tanto reducir o tempo de
cómputo do algoritmo RIblast en varias ordes de magnitude como posibilitar a execución de
conxuntos de datos de gran tamaño que a ferramenta orixinal endexamais podería analizar
(i.e. o xenoma humano).
Para asegurar que a paralelización da ferramenta foi efectiva, fixéronse longas e extensivas
probas de rendemento nun clúster con 16 nós de cómputo, con 64GiB de memoria e 16
núcleos por nó (256 núcleos en total). Os resultados obtidos foron moi satisfactorios, xa que
se acadaron grandes aceleracións que permitiron executar un gran xenoma, que tardaría 101
días en procesar, en tan só 21 horas. A maiores, demostrouse que as optimizacións desenvolvidas
sobre o algoritmo paralelo son moi efectivas. Por exemplo, reducíronse os tempos de
escritura dende 4000 a 90 segundos nun conxunto de datos que produce 407GiB de resultados,
e se puideron analizar dous datasets que non poderían ser procesados polo algoritmo orixinal
debido ao seu uso intensivo de memoria.
Rights
Atribución-No Comercial-No Derivadas 3.0 España