MiniSymposium-7: Recent Advances in Large Scale Estimation and Testing

Over the last decade, driven by applications in a wide range of scientific problems in fields such Sports, Entertainment and Health, the traditional roles of statistical and data science algorithms have rapidly evolved as new perspectives have been introduced to address and exploit complex, latent structural properties of modern big-datasets. These applications, along with the enormity of the available data, often involve non-standard inferential attributes such as asymmetric loss functions, information pooling across disparate data sources as well as intricate modeling caveats exacerbated by unobserved heterogeneity and, pose challenges not only in developing flexible algorithms but also in optimally tuning them to obtain sound theoretical properties.

The main focus of this session is on showcasing novel and rigorous data science methodologies that have been recently developed to address large scale estimation, testing and prediction. The content of the session will include research talks by four speakers who will discuss and introduce new techniques that seek to answer a variety of interesting and practical questions ranging from assessing partial association between ordinal variables, the impact of parking in routing last-mile deliveries and reproducibility in large scale inference using nonparametric empirical Bayes methods.

Businesses are undergoing a significant transformation in the amount, variety and quality of data that they collect and there is an urgent need for designing novel data science methods that can handle the complexity and enormity of such modern datasets. This session provides a timely opportunity to get a glimpse into some of the state-of-the art methodological research that is being conducted in the broad area of large-scale estimation and testing. The proposed session will be of tremendous interest to the statistical and applied mathematics community as far as new methodological research in this area is concerned. Moreover, the session will be of significant appeal to the practitioners in these areas and shall promote natural collaborations between academia and industry.

Organizer:

Trambak Banerjee, University of Kansas

Saturday, October 2, 2021 at 2:40 – 4:00 pm (CST)

Ben Sherwood

University of Kansas

2:40 pm

Penalized Quantile Regression

Quantile regression directly models a conditional quantile. Penalized quantile regression constrains the regression coefficients similar to penalized mean regression. Quantile regression with a lasso penalty can be reframed as a quantile regression problem with augmented data and therefore can be formulated as a linear programming problem. If a group lasso penalty is used, then it becomes a second order cone programming problem. These approaches become computationally burdensome for large values of n or p. Using a Huber approximation to the quantile function allows for the use of computationally efficient algorithms that require a differentiable loss function that can be implemented for both penalties. These algorithms then can be used as the backbones for implanting penalized quantile regression with other penalties such as Adaptive Lasso, SCAD, MCP and group versions of these penalties.

Sara Reed

University of Kansas

3:00 pm

Does Parking Matter in Routing Last-mile Deliveries?

Parking the delivery vehicle is a necessary component of traditional last-mile delivery practices but finding parking is often difficult. We explore the impact of the search time for parking on optimal routing decisions for last-mile delivery. The Capacitated Delivery Problem with Parking (CDPP) is the problem of a delivery person needing to park the vehicle in order to service customers on foot. We compare the CDPP to industry practices as well as other models in the literature to understand how including the search time for parking impacts the completion time of the delivery tour.

Shaobo Li

University of Kansas

3:20 pm

Assessing Partial Association Between Ordinal Variables: Quantification, Visualization and Hypothesis Testing

Partial association refers to the relationship between variables Y_1,…,Y_K while adjusting for a set of covariates X. To assess such an association when Y_1,…,Y_K are recorded on ordinal scales, a classical approach is to use partial correlation between the latent continuous variables. This so-called polychoric correlation is inadequate, as it requires multivariate normality and it only reflects a linear association. We propose a new framework for studying ordinal-ordinal partial association by using surrogate residuals introduced by Liu and Zhang (2018). We justify that conditional on X, Y_1,…,Y_K are independent to each other if and only if their corresponding surrogate residual variables are independent. Based on this result, we develop a general measure ϕ to quantify the strength of partial association between ordinal variables. As opposed to polychoric correlation, ϕ does not require normality or models with probit link, but instead it broadly applies to models with nay link functions. It can also capture a non-linear or even non-monotonic association. Moreover, the measure ϕ gives rise to a general procedure for testing the hypothesis of partial independence. Our framework also permits visualization tools, such as partial regression plots and 3-D P-P plots, to examine the association structure, which is otherwise unfeasible for ordinal data. We stress that the whole set of tools, measures, p-values and graphics, is developed within a single unified framework, which allows a coherent inference. The analyses of the National Election Study and Big Five Personality Traits demonstrate that our framework leads to a much fuller assessment of partial association and yields deeper insights for domain researchers.

Trambak Banerjee

University of Kansas

3:40 pm

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

The simultaneous estimation of many parameters based on data collected from corresponding studies is a key research problem that has received renewed attention in the high-dimensional setting. Many practical situations involve heterogeneous data where heterogeneity is captured by a nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the “Nonparametric Empirical–Bayes Structural Tweedie” (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie’s formula. For the normal means problem, NEST simultaneously handles the two main selection biases introduced by heterogeneity: one, the selection bias in the mean, which cannot be effectively corrected without also correcting for, two, selection bias in the variance. Our theoretical results show that NEST has strong asymptotic properties without requiring explicit assumptions about the prior. Simulation studies show that NEST outperforms competing methods, with substantial efficiency gains in many settings. The proposed method is demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual fund returns.