Beyond questions: Leveraging ColBERT for keyphrase search

Loading...
Thumbnail Image

Identifiers

Publication date

Authors

Gabín, Jorge
Macdonald, Craig

Advisors

Other responsabilities

Journal Title

Bibliographic citation

J. Gabín, J. Parapar, and C. Macdonald, "Beyond questions: Leveraging ColBERT for keyphrase search", Information Processing & Management, Vol. 63, Issue 2, Part B, March 2026, 104480, https://doi.org/10.1016/j.ipm.2025.104480

Type of academic work

Academic degree

Abstract

[Abstract]: While question-like queries are gaining popularity, keyphrase search is still the cornerstone of web search and other specialised domains such as academic and professional search. However, current dense retrieval models often fail with keyphrase-like queries, primarily because they are mostly trained on question-like ones. This paper introduces a novel model that employs the ColBERT architecture to enhance document ranking for keyphrase queries. For that, given the lack of large keyphrase-based retrieval datasets, we first explore how Large Language Models can convert question-like queries into keyphrase format. Then, using those keyphrases, we train a keyphrase-based ColBERT ranker (ColBERTKP ) to improve the performance when working with keyphrase queries. Furthermore, to make the model more flexible, allowing the use of both the question and keyphrase encoders depending on the query type, we investigate the feasibility of training only a keyphrase query encoder while keeping the document encoder weights static (ColBERTKP). We assess our proposals’ ranking performance using both automatically generated and manually annotated keyphrases. Our results reveal the potential of the late interaction architecture when working under the keyphrase search scenario.

Description

This study’s code and generated resources are available at https://github.com/JorgeGabin/ColBERTKP.

Rights

Attribution-NonCommercial 4.0 International
Attribution-NonCommercial 4.0 International

Except where otherwise noted, this item's license is described as Attribution-NonCommercial 4.0 International