Publication
SIGMOD/PODS 2009
Conference paper

Uncertainty management in rule-based information extraction systems

View publication

Abstract

Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used in a specific extraction task. Quantifying this uncertainty is crucial for querying the extracted objects in probabilistic databases, and for improving the recall of extraction tasks that use compositional rules. In this paper, we provide a probabilistic framework for handling the uncertainty in rule-based information extraction. Specifically, for each extraction task, we build a parametric exponential model of uncertainty that captures the interaction between the different rules, as well as the compositional nature of the rules; the exponential form of our model follows from maximum-entropy considerations. We also give model-decomposition techniques that make the learning algorithms scalable to large numbers of rules and constraints. Experiments over multiple real-world extraction tasks confirm that our approach yields accurate probability estimates with only a small performance overhead. Moreover, our framework supports incremental pay-as-you-go improvements in the accuracy of probability estimates as new rules, data, or constraints are added. © 2009 ACM.

Date

Publication

SIGMOD/PODS 2009