Comparing neural- and N-gram-based language models for word segmentation

Doval, Yerai; Gómez-Rodríguez, Carlos

dc.contributor.author	Doval, Yerai
dc.contributor.author	Gómez-Rodríguez, Carlos
dc.date.accessioned	2024-06-19T08:58:45Z
dc.date.available	2024-06-19T08:58:45Z
dc.date.issued	2019-02
dc.identifier.citation	Y. Doval, and C. Gómez-Rodríguez, "Comparing neural- and N-gram-based language models for word segmentation", Journal of the Association for Information Science and Technology, Vol. 70, Issue 2, pp. 187 - 197, Feb. 2019, doi: 10.1002/asi.24082	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/37129
dc.description.abstract	[Abstract]: Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future. © 2018 The Authors. Journal of the Association for Information Science and Technology published by Wiley Periodicals, Inc. on behalf of ASIS&T.	es_ES
dc.description.sponsorship	This research received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement no. 714150-FASTPARSE). It was partially funded by the Spanish Ministry of Economy and Competitiveness (MINECO) through projects FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by the Autonomous Government of Galicia through both the Galician Network for Lexicography-RELEX (ED431D R2016/046) and Grant ED431B-2017/01. Moreover, Yerai Doval is funded by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and by the European Social Fund (ESF) under an FPI fellowship (BES-2015-073768) associated with project FFI2014-51978-C2-1-R. We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU used for this research.	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431D R2016/046	es_ES
dc.description.sponsorship	Xunta e Galicia; ED431B-2017/01	es_ES
dc.language.iso	eng	es_ES
dc.publisher	John Wiley and Sons Inc.	es_ES
dc.relation	info:eu-repo/grantAgreement/EC/H2020/714150	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/FFI2014-51978-C2-1-R/ES/TECNOLOGIAS DE LA LENGUA PARA ANALISIS DE OPINIONES EN REDES SOCIALES	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/FFI2014-51978-C2-2-R/ES/TECNOLOGIAS DE LA LENGUA PARA ANALISIS DE OPINIONES EN REDES SOCIALES: DEL TEXTO AL MICROTEXTO	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/BES-2015-073768/ES/	es_ES
dc.relation.uri	https://doi.org/10.1002/asi.24082	es_ES
dc.rights	Atribución 4.0 International (CC BY)	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Recurrent neural networks	es_ES
dc.subject	eam search algorithms	es_ES
dc.subject	Data sparsity	es_ES
dc.subject	Language model	es_ES
dc.subject	Word segmentation	es_ES
dc.title	Comparing neural- and N-gram-based language models for word segmentation	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Journal of the Association for Information Science and Technology	es_ES
UDC.volume	70	es_ES
UDC.issue	2	es_ES
UDC.startPage	187	es_ES
UDC.endPage	197	es_ES
dc.identifier.doi	10.1002/asi.24082

Ficheiros no ítem

Nome:: GomezRodriguez_Carlos_2019_Com ...
Tamaño:: 961.7Kb
Formato:: PDF

Ver/abrir

Nome:: license_rdf
Tamaño:: 1.337Kb
Formato:: application/rdf+xml

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

OpenAIRE [357]
GI-LYS - Artigos [51]

Mostrar o rexistro simple do ítem