Universal indexes for highly repetitive document collections

Claude, Francisco; Fariña, Antonio; Martínez Prieto, Miguel A.; Navarro, Gonzalo

doi:10.1016/j.is.2016.04.002

dc.contributor.author	Claude, Francisco
dc.contributor.author	Fariña, Antonio
dc.contributor.author	Martínez Prieto, Miguel A.
dc.contributor.author	Navarro, Gonzalo
dc.date.accessioned	2017-02-22T15:43:18Z
dc.date.issued	2016-11
dc.identifier.citation	Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, Gonzalo Navarro, Universal indexes for highly repetitive document collections, Information Systems, Volume 61, October–November 2016, Pages 1-23, ISSN 0306-4379, http://dx.doi.org/10.1016/j.is.2016.04.002.	es_ES
dc.identifier.issn	0306-4379
dc.identifier.issn	1873-6076
dc.identifier.uri	http://hdl.handle.net/2183/18163
dc.description	The final publication is available via http://dx.doi.org/10.1016/j.is.2016.04.002	es_ES
dc.description.abstract	[Abstract] Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel–Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.	es_ES
dc.description.sponsorship	Chile.Fondo Nacional de Desarrollo Científico y Tecnológico; 1-140796	es_ES
dc.description.sponsorship	Ministerio de Economía y competitividad; TIN2013-47090-C3-3-P	es_ES
dc.description.sponsorship	Ministerio de Economía y competitividad; TIN2015-69951-R	es_ES
dc.description.sponsorship	Ministerio de Economía y competitividad; TIN2013-46238-C4-3-R	es_ES
dc.description.sponsorship	Ministerio de Economía y Competitividad; IC1302	es_ES
dc.description.sponsorship	Xunta de Galicia; GRC2013/053	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier Ltd	es_ES
dc.relation.uri	http://www.sciencedirect.com/science/article/pii/S0306437916301132	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	Repetitive collections	es_ES
dc.subject	Inverted index	es_ES
dc.subject	Self-index	es_ES
dc.title	Universal indexes for highly repetitive document collections	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	embargoed access	es_ES
dc.date.embargoEndDate	2018-11-30	es_ES
dc.date.embargoLift	2018-11-30
UDC.journalTitle	Information Systems	es_ES
UDC.volume	61	es_ES
UDC.startPage	1	es_ES
UDC.endPage	23	es_ES
dc.identifier.doi	10.1016/j.is.2016.04.002
UDC.coleccion	Investigación	es_ES
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información	es_ES
UDC.grupoInv	Laboratorio de Bases de Datos (LBD)	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/690941	es_ES

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.203Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: 2016_Universal_Indexes_for_Hig ...
Tamaño:: 597.1Kb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

Investigación (FIC) [1654]

Mostrar o rexistro simple do ítem