Soft-404 Pages, A Crawling Problem

UDC.coleccionInvestigaciónes_ES
UDC.departamentoCiencias da Computación e Tecnoloxías da Informaciónes_ES
UDC.endPage92es_ES
UDC.grupoInvTelemáticaes_ES
UDC.issue2es_ES
UDC.journalTitleJournal of Digital Information Managementes_ES
UDC.startPage73es_ES
UDC.volume12es_ES
dc.contributor.authorPrieto Álvarez, Víctor Manuel
dc.contributor.authorÁlvarez Díaz, Manuel
dc.contributor.authorCacheda, Fidel
dc.date.accessioned2024-01-22T13:39:26Z
dc.date.available2024-01-22T13:39:26Z
dc.date.issued2014-04
dc.description.abstract[Absctract]: During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.es_ES
dc.description.sponsorshipThis research was supported by Xunta de Galicia CN2012/211, the Ministry of Education and Science of Spain and FEDER funds of the European Union (Project TIN2009-14203).es_ES
dc.description.sponsorshipXunta de Galicia; CN2012/211es_ES
dc.identifier.citationV. M. Prieto, M. Álvarez, and F. Cacheda, “Soft-404 Pages, A Crawling Problem.,” J. Digit. Inf. Manag., vol. 12, no. 2, pp. 73–92, 2014, Accessed: Jan. 22, 2024. [Online]. Available: http://www.dline.info/fpaper/jdim/v12i2/2.pdfes_ES
dc.identifier.issn0972-7272
dc.identifier.urihttp://hdl.handle.net/2183/35046
dc.language.isoenges_ES
dc.publisherSociety for Information Organization in Indiaes_ES
dc.relation.projectIDinfo:eu-repo/grantAgreement/MICINN/Plan Nacional de I+D+i 2008-2011/TIN2009-14203/ES/MODELOS Y TECNICAS PARA LA CONSTRUCCION DE APLICACIONES ¿MASHUP BASADAS EN INTELIGENCIA COLECTIVAes_ES
dc.relation.urihttp://www.dline.info/fpaper/jdim/v12i2/2.pdfes_ES
dc.rights.accessRightsopen accesses_ES
dc.subjectSoft-404 Errores_ES
dc.subjectWeb Spames_ES
dc.subjectWeb Decayes_ES
dc.subjectLink Analysises_ES
dc.subjectData Mininges_ES
dc.subjectStatistical Properties of the Webes_ES
dc.titleSoft-404 Pages, A Crawling Problemes_ES
dc.typejournal articlees_ES
dspace.entity.typePublication
relation.isAuthorOfPublication8fb413a7-b40a-48ad-861f-985d0492628e
relation.isAuthorOfPublication63253cd0-b4ea-402a-b158-84417c75846a
relation.isAuthorOfPublication.latestForDiscovery8fb413a7-b40a-48ad-861f-985d0492628e

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PrietoVictor_2014_Soft_404_pages_crawling_problem.pdf
Size:
600.55 KB
Format:
Adobe Portable Document Format
Description: