Soft-404 Pages, A Crawling Problem
| UDC.coleccion | Investigación | es_ES |
| UDC.departamento | Ciencias da Computación e Tecnoloxías da Información | es_ES |
| UDC.endPage | 92 | es_ES |
| UDC.grupoInv | Telemática | es_ES |
| UDC.issue | 2 | es_ES |
| UDC.journalTitle | Journal of Digital Information Management | es_ES |
| UDC.startPage | 73 | es_ES |
| UDC.volume | 12 | es_ES |
| dc.contributor.author | Prieto Álvarez, Víctor Manuel | |
| dc.contributor.author | Álvarez Díaz, Manuel | |
| dc.contributor.author | Cacheda, Fidel | |
| dc.date.accessioned | 2024-01-22T13:39:26Z | |
| dc.date.available | 2024-01-22T13:39:26Z | |
| dc.date.issued | 2014-04 | |
| dc.description.abstract | [Absctract]: During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages. | es_ES |
| dc.description.sponsorship | This research was supported by Xunta de Galicia CN2012/211, the Ministry of Education and Science of Spain and FEDER funds of the European Union (Project TIN2009-14203). | es_ES |
| dc.description.sponsorship | Xunta de Galicia; CN2012/211 | es_ES |
| dc.identifier.citation | V. M. Prieto, M. Álvarez, and F. Cacheda, “Soft-404 Pages, A Crawling Problem.,” J. Digit. Inf. Manag., vol. 12, no. 2, pp. 73–92, 2014, Accessed: Jan. 22, 2024. [Online]. Available: http://www.dline.info/fpaper/jdim/v12i2/2.pdf | es_ES |
| dc.identifier.issn | 0972-7272 | |
| dc.identifier.uri | http://hdl.handle.net/2183/35046 | |
| dc.language.iso | eng | es_ES |
| dc.publisher | Society for Information Organization in India | es_ES |
| dc.relation.projectID | info:eu-repo/grantAgreement/MICINN/Plan Nacional de I+D+i 2008-2011/TIN2009-14203/ES/MODELOS Y TECNICAS PARA LA CONSTRUCCION DE APLICACIONES ¿MASHUP BASADAS EN INTELIGENCIA COLECTIVA | es_ES |
| dc.relation.uri | http://www.dline.info/fpaper/jdim/v12i2/2.pdf | es_ES |
| dc.rights.accessRights | open access | es_ES |
| dc.subject | Soft-404 Error | es_ES |
| dc.subject | Web Spam | es_ES |
| dc.subject | Web Decay | es_ES |
| dc.subject | Link Analysis | es_ES |
| dc.subject | Data Mining | es_ES |
| dc.subject | Statistical Properties of the Web | es_ES |
| dc.title | Soft-404 Pages, A Crawling Problem | es_ES |
| dc.type | journal article | es_ES |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 8fb413a7-b40a-48ad-861f-985d0492628e | |
| relation.isAuthorOfPublication | 63253cd0-b4ea-402a-b158-84417c75846a | |
| relation.isAuthorOfPublication.latestForDiscovery | 8fb413a7-b40a-48ad-861f-985d0492628e |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- PrietoVictor_2014_Soft_404_pages_crawling_problem.pdf
- Size:
- 600.55 KB
- Format:
- Adobe Portable Document Format
- Description:

