Industry specific word embedding and its application in log classification
Abstract
Word, sentence and document embeddings have become the cornerstone of most natural language processing-based solutions. The training of an effective embedding depends on a large corpus of relevant documents. However, such corpus is not always available, especially for specialized heavy industries such as oil, mining, or steel. To address the problem, this paper proposes a semi-supervised learning framework to create document corpus and embedding starting from an industry taxonomy, along with a very limited set of relevant positive and negative documents. Our solution organizes candidate documents into a graph and adopts different explore and exploit strategies to iteratively create the corpus and its embedding. At each iteration, two metrics, called Coverage and Context Similarity, are used as proxy to measure the quality of the results. Our experiments demonstrate how an embedding created by our solution is more effective than the one created by processing thousands of industry-specific document pages. We also explore using our embedding in downstream tasks, such as building an industry specific classification model given labeled training data, as well as classifying unlabeled documents according to industry taxonomy terms.