Optimization of Regular Expression Evaluation within the Manatee Corpus Management System

Varování

Publikace nespadá pod Pedagogickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

JAKUBÍČEK Miloš RYCHLÝ Pavel

Rok publikování 2014
Druh Článek ve sborníku
Konference Eighth Workshop on Recent Advances in Slavonic Natural Language Processing
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
Obor Informatika
Klíčová slova text corpus; regular expression; Manatee
Popis This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. In this paper we present an index-based approach for optimization of regular expression evaluation that we call n-gram prefetching. It is based on the assumption that most regular expression queries on text corpora contain at least some fixed string portions representing clues that can be used for developing heuristics that would prune the number of potentially matching strings. The presented work has been designed and implemented within the Manatee corpus management system. We show that the proposed approach can significantly speed up regular expression processing by providing evaluation on a test set of queries executed on a number of billion-word text corpora.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.