Learning Explainable Entity Resolution Algorithms for Small Business Data using SystemER
Abstract
The 2019 FEIII CALI data challenge aims at linking diferent representations of the same real-world entities across multiple public datasets that collect identiication and activity data about small to medium enterprises (SMEs) in California. We formalize this challenge as a learning-based entity resolution (ER) task, the goal of which is to learn a high-precision and high-recall pair-wise ER model that classiies small business entity pairs into matches and non-matches. Realistic ER tasks usually involve a pipeline of labor-intensive and error-prone tasks, such as data preprocesing, gathering of training data, feature engineering, and model tuning. In this task, we apply an advanced human-in-the-loop system, named SystemER, to learn ER algorithms for SME entities. Powered by active learning and via a carefully designed user interface, SystemER can learn high-quality explainable ER algorithms with low human efort, while achieving high-accuracy on the datasets provided by the FEIII CALI data challenge.