Clinical and pharmacogenomic data mining: 1. Generalized theory of expected information and application to the development of tools
Abstract
New scientific problems, arising from the human genome project, are challenging the classical means of using statistics. Yet quantified knowledge in the form of rules and rule strengths based on real relationships in data, as opposed to expert opinion, is urgently required for researcher and physician decision support. The problem is that with many parameters, the space to be analyzed is highly dimensional. That is, the combinations of data to examine are subject to a combinatorial explosion as the number of possible events (entries, items, sub-records) (a),(b),(c),... per record (a,b,c,..) increases, and hence much of the space is sparsely populated. These combinatorial considerations are particularly problematic for identifying those associations called "Unicorn Events" which occur significantly less than expected to the extent that they are never seen to be counted. To cope with the combinatorial explosion, a novel numerical "book keeping" approach is taken to generate information terms relating to the combinatorial subsets of events (a,b,c,..), and, most importantly, the ζ (Zeta) function is employed. The incomplete Zeta function ζ(s,n) with s = 1, in which frequencies of occurrence such as n = n(a,b,c,...) determine the range of summation n, is argued to be the natural choice of information function. It emerges from Bayesian integration, taken over the distribution of possible values of information measures for sparse and ample data alike. Expected mutual information l(a;b;c) in nats (i.e., natural units analogous to bits but based on the natural logarithm), such as is available to the observer, is measured as e.g., the difference ζ(s,o(a,b,c..)) - ζ(s,e(a,b,c..)) where o(a,b,c,..) and e(a,b,c,..) are, or relate to, the observed and expected frequencies of occurrence, respectively. For real values of s >1 the qualitative impact of strongly (positively or negatively) ranked data is preserved despite several numerical approximations. As real s increases, and the output of the information functions converge into three values +1, 0, and -1 nats representing a trinary logic system. For quantitative data, a useful ad hoc method, to report σ-normalized covariations in an analogous manner to mutual information for significance comparison purposes, is demonstrated. Finally, the potential ability to make use of mutual information in a complex biomedical study, and to include Bayesian prior information derived from statistical, tabular, anecdotal, and expert opinion is briefly illustrated.