Identifying high value opportunities for human in the loop lexicon expansion
Abstract
Many real world analytics problems examine multiple entities or classes that may appear in a corpus. For example, in a customer satisfaction survey analysis there are over 60 categories of (somewhat overlapping) concerns. Each of these is backed by a lexicon of terminology associated with the concern (e.g., �Easy, user friendly process" or "Process confusing, too many handoffs�). These categories need to be expanded by a subject matter expert as the terminology is not always straight forward (e.g., �handoffs� may also include �ping-pong� and �hot potato� as relevant terms). But given that Subject Matter Expert time is costly, which of the 60+ lexicons should we expand first? We propose a metric for evaluating an existing set of lexicons and providing guidance on which are likely to benefit most from human-in-the-loop expansion. Using our ranking results we achieved 4 improvement in impact when expanding the first few lexicons off our suggested list as compared to a random selection.