On implementing a text-database-as-a-service
Abstract
The emergence of heterogeneous big data in the last decade calls for a hybrid data service that can manage all different kinds of data, including relational data, JSON data, and text data in a unified way. Among them, text data play an important role in many fields such as Internet-of-Things, biology, social network, and etc. For example, a smart meter application detecting the anomaly of the electricity use might want to link each anomaly of a certain area to a meaningful social event mined from the news in plain text. As a result, text data services have raised more and more attentions by the research community. Most of such services are implemented based on a content management system such as ElasticSearch and Solr. However, we found that the mere content management capabilities are not enough. On one hand, text data query often requires join operations to relational data or JSON data in an existing DBMS. On the other hand, users often have to pull the big text data out to an independent system or service for further text analytics. In this paper, we present our Text-DataBase-as-a-Service (TDBaaS), which is built on top of the Hybrid Data Service (HDS) from IBM Research. The TDBaaS is designed to manage the text data together with relational data and JSON data in a single service. Basic text analytics can be conducted directly inside the database in the form of general SQLs. Moreover, the extensible framework allows the service to have abundant text analytic capabilities with high performance. As a case study, we investigate in the implementation of the top-k word algorithm, and show how the common computations are shared across different tenants in the TDBaaS. The experimental results demonstrate the high performance of the TDBaaS on both text data management and text data analytics.