Machine learning analysis of TCGA cancer data

Liñares Blanco, Jose; Pazos, A.; Fernández-Lozano, Carlos

dc.contributor.author	Liñares Blanco, Jose
dc.contributor.author	Pazos, A.
dc.contributor.author	Fernández-Lozano, Carlos
dc.date.accessioned	2021-10-14T14:10:04Z
dc.date.available	2021-10-14T14:10:04Z
dc.date.issued	2021
dc.identifier.citation	Liñares-Blanco J, Pazos A, Fernandez-Lozano C. 2021. Machine learning analysis of TCGA cancer data. PeerJ Computer Science 7:e584 https://doi.org/10.7717/peerj-cs.584	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/28634
dc.description.abstract	[Abstract] In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.	es_ES
dc.description.sponsorship	This work was supported by the “Collaborative Project in Genomic Data Integration (CICLOGEN)” PI17/01826 funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013–2016 and the European Regional Development Funds (FEDER)—“A way to build Europe.” and the General Directorate of Culture, Education and University Management of Xunta de Galicia (Ref. ED431D 2017/16), the “Galician Network for Colorectal Cancer Research” (Ref. ED431D 2017/23) and Competitive Reference Groups (Ref. ED431C 2018/49). CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
dc.description.sponsorship	Xunta de Galicia; ED431D 2017/16
dc.description.sponsorship	Xunta de Galicia; ED431D 2017/23
dc.description.sponsorship	Xunta de Galicia; ED431C 2018/49
dc.description.sponsorship	Xunta de Galicia; ED431G 2019/01
dc.language.iso	eng	es_ES
dc.publisher	PeerJ Inc.	es_ES
dc.relation	info:eu-repo/grantAgreement/ISCIII/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013–2016/PI17%2F01826/ES/PROYECTO COLABORATIVO DE INTEGRACION DE DATOS GENOMICOS (CICLOGEN). TECNICAS DE DATA MINING Y DOCKING MOLECULAR PARA ANALISIS DE DATOS INTEGRATIVOS EN CANCER DE COLON/
dc.relation.uri	https://doi.org/10.7717/peerj-cs.584	es_ES
dc.rights	Atribución 4.0 Internacional	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	Cancer	es_ES
dc.subject	Machine learning	es_ES
dc.subject	TCGA	es_ES
dc.subject	Multi-omics	es_ES
dc.subject	Data integration	es_ES
dc.subject	BRCA	es_ES
dc.subject	Random forest	es_ES
dc.subject	Support vector machines	es_ES
dc.title	Machine learning analysis of TCGA cancer data	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	PeerJ Computer Science	es_ES
UDC.volume	7	es_ES
UDC.startPage	e584	es_ES
dc.identifier.doi	10.7717/peerj-cs.584

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.337Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: Linares_Blanco_J_2021_Machine_ ...
Tamaño:: 14.88Mb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-RNASA - Artigos [195]

Mostrar o rexistro simple do ítem