Segmentation from 97% to 100%: Is It Time for Some Linguistics?

Varování

Publikace nespadá pod Pedagogickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Název česky Segmentace z 97% na 100%: není čas pro trochu lingvistiky?
Autoři

SOJKA Petr

Rok publikování 2012
Druh Článek ve sborníku
Konference Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www
Obor Informatika
Klíčová slova competing patterns;segmentation;hyphenation;NP problems;pattern generation;patgen;context-sensitive patterns;machine learning;natural language engineering;EuDML
Popis Many tasks in natural language processing (NLP) require \emph{segmentation} algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (\emph{hyphenation}) or into morphological parts (e.g.\ getting word stem for indexing), and many other tasks (e.g.\ tagging) could be formulated as segmentation problems. We evaluate methodology of using \emph{competing patterns} for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100\,\%) implement the segmentation task. We formally define this task and prove that it is in the class of \emph{non-polynomial} optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in \emph{constant} time with respect to the size of segmented dictionary. Constant time of access to segmentations makes competing patterns attractive data structure for many NLP tasks.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.