High-dimensional multivariate additive regression for uncovering contributing factors to healthcare expenditure
Abstract
Many studies in health services research rely on regression models with a large number of covariates or predictors. In this article, we introduce novel methodology to estimate and perform model selection for high-dimensional non-parametric multivariate regression problems, with application to many healthcare studies. We particularly focus on multi-responses or multi-task regression models. Because of the complexity of the dependence between predictors and the multiple responses, we exploit model selection approaches that consider various level of groupings between and within responses. The novelty of the method lies in its ability to account simultaneously for between and within group sparsity in the presence of non-linear effects. We also propose a new set of algorithms that can identify inactive and active predictors that are common to all responses or to a subset of responses. Our modeling approach is applied to uncover factors that impact healthcare expenditure for children insured through the Medicaid benefits program. We provide important findings on the association between healthcare expenditure and a large number of well-cited factors for two neighboring states, Georgia and North Carolina, which have similar demographics but different Medicaid systems. We also validate our methods with a benchmark cancer data set and simulated data examples.