In the last decade several approaches have been developed in the Information Extraction (IE) research area that are able to automatically construct (learn) extraction procedures, so called wrappers. Wrappers allow documents to be interpreted and accessed like relational databases. They form one of the core components in future Intelligent Information Systems, since they allow the user to query, compare and combine information from various textual information sources. This thesis presents an Logic Programming and Inductive Logic Programming (ILP) framework for supervised learning of wrappers from positive examples only. In contrast to existing systems that adapt some methods from the Artificial Intelligence subfield of Inductive Logic Programming the here presented machine learning approach follows a pure logical bottom-up learning approach under a new IE-ILP semantics. The presented learning approach for multi-slot extraction programs is independent of the chosen wrapper model and document view.
Three classes of Inductive Logic Programming algorithms are presented, two one step learning algorithms, a set of iterative learning algorithms, and one algorithm combining clustering techniques with an iterative ILP algorithm.
Several extraction tasks are investigated and a formal definition of wrapper classes is given. Based on these wrapper classes three wrapper models are presented using two different document representations, a sequential token and a DOM related representation.
The introduced learning algorithms and wrapper models are evaluated on standard test cases and they are compared with related methods and machine learning based information extraction systems. For some of the single-slot extraction tasks the implemented methods yield better results than the best state-of-the-art systems. Learned wrappers for multi-slot extraction tasks show promising competitive quality scores in comparison to the leading extraction systems.
|Versandkostenfrei innerhalb Deutschlands|
Wollen auch Sie Ihre Dissertation veröffentlichen?