Use this link to cite:
http://hdl.handle.net/2183/35046 Soft-404 Pages, A Crawling Problem
Loading...
Identifiers
Publication date
Authors
Advisors
Other responsabilities
Journal Title
Bibliographic citation
V. M. Prieto, M. Álvarez, and F. Cacheda, “Soft-404 Pages, A Crawling Problem.,” J. Digit. Inf. Manag., vol. 12, no. 2, pp. 73–92, 2014, Accessed: Jan. 22, 2024. [Online]. Available: http://www.dline.info/fpaper/jdim/v12i2/2.pdf
Type of academic work
Academic degree
Abstract
[Absctract]: During its traversal of the Web, crawler
systems have to deal with multiple challenges. Some of
them are related with detecting garbage content to avoid
wasting resources processing it. Soft-404 pages are a
type of garbage content generated when some web servers
do not use the appropriate HTTP response code for death
links making them to be incorrectly identified. Our analysis
of the Web has revealed that 7.35% of web servers send
a 200 HTTP code when a request for an unknown
document is received, instead of a 404 code, which
indicates that the document is not found. This paper
presents a system called Soft404Detector, based on web
content analysis to identify web pages that are Soft-404
pages. Our system uses a set of content-based heuristics
and combines them with a C4.5 classifier. For testing
purposes, we built a Soft-404 pages dataset. Our
experiments indicate that our system is very effective,
achieving a precision of 0.992 and a recall of 0.980 at
Soft-404 pages.






