Exploiting latent information in relational databases via word embedding and application to degrees of disclosure
Abstract
Cognitive Databases is a new approach for enabling Artificial Intelligence (AI) capabilities as standard features within relational database systems. Relations are textified and the text is used to build a Word Embedding (WE) model that captures the latent relationships between database tokens of various data types. For each database token, the model includes a low dimensional vector (say, 200) that encodes the token’s relationships with other tokens. The vectors are used in the existing SQL query infrastructure via UDFs. Queries use the model vectors to express semantic similarity/dissimilarity, inductive reasoning, analogies and seamlessly utilize knowledge from external sources such as Wikipedia and PubMed. WE enables novel capabilities such as the controlled disclosure of database information in a variety of ways. The degree of disclosure may depend on the sensitivity of the information and the recipient’s need to know, e.g., test results may be considered sensitive and should be only be openly disclosed to divisions concerned with them. Disclosure may be viewed as a new kind of controlled sharing of information for cooperation and integration purposes. There are some challenges in integrating WE methods into the database engine, necessitating new techniques. There are also interesting theoretical problems concerning the WE coding power.