Soft-404 Pages, A Crawling Problem
Use este enlace para citar
http://hdl.handle.net/2183/35046Coleccións
Metadatos
Mostrar o rexistro completo do ítemTítulo
Soft-404 Pages, A Crawling ProblemData
2014-04Cita bibliográfica
V. M. Prieto, M. Álvarez, and F. Cacheda, “Soft-404 Pages, A Crawling Problem.,” J. Digit. Inf. Manag., vol. 12, no. 2, pp. 73–92, 2014, Accessed: Jan. 22, 2024. [Online]. Available: http://www.dline.info/fpaper/jdim/v12i2/2.pdf
Resumo
[Absctract]: During its traversal of the Web, crawler
systems have to deal with multiple challenges. Some of
them are related with detecting garbage content to avoid
wasting resources processing it. Soft-404 pages are a
type of garbage content generated when some web servers
do not use the appropriate HTTP response code for death
links making them to be incorrectly identified. Our analysis
of the Web has revealed that 7.35% of web servers send
a 200 HTTP code when a request for an unknown
document is received, instead of a 404 code, which
indicates that the document is not found. This paper
presents a system called Soft404Detector, based on web
content analysis to identify web pages that are Soft-404
pages. Our system uses a set of content-based heuristics
and combines them with a C4.5 classifier. For testing
purposes, we built a Soft-404 pages dataset. Our
experiments indicate that our system is very effective,
achieving a precision of 0.992 and a recall of 0.980 at
Soft-404 pages.
Palabras chave
Soft-404 Error
Web Spam
Web Decay
Link Analysis
Data Mining
Statistical Properties of the Web
Web Spam
Web Decay
Link Analysis
Data Mining
Statistical Properties of the Web
Versión do editor
ISSN
0972-7272