Segmentation from 97% to 100%: Is It Time for Some Linguistics?

Warning

This publication doesn't include Faculty of Education. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SOJKA Petr

Year of publication 2012
Type Article in Proceedings
Conference Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012
MU Faculty or unit

Faculty of Informatics

Citation
Web
Field Informatics
Keywords competing patterns;segmentation;hyphenation;NP problems;pattern generation;patgen;context-sensitive patterns;machine learning;natural language engineering;EuDML
Description Many tasks in natural language processing (NLP) require \emph{segmentation} algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (\emph{hyphenation}) or into morphological parts (e.g.\ getting word stem for indexing), and many other tasks (e.g.\ tagging) could be formulated as segmentation problems. We evaluate methodology of using \emph{competing patterns} for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100\,\%) implement the segmentation task. We formally define this task and prove that it is in the class of \emph{non-polynomial} optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in \emph{constant} time with respect to the size of segmented dictionary. Constant time of access to segmentations makes competing patterns attractive data structure for many NLP tasks.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.