Soft-404 Pages, A Crawling Problem

Loading...
Thumbnail Image

Identifiers

Publication date

Authors

Prieto Álvarez, Víctor Manuel

Advisors

Other responsabilities

Journal Title

Bibliographic citation

V. M. Prieto, M. Álvarez, and F. Cacheda, “Soft-404 Pages, A Crawling Problem.,” J. Digit. Inf. Manag., vol. 12, no. 2, pp. 73–92, 2014, Accessed: Jan. 22, 2024. [Online]. Available: http://www.dline.info/fpaper/jdim/v12i2/2.pdf

Type of academic work

Academic degree

Abstract

[Absctract]: During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.

Description

Rights