Statistics
See recent articles
Showing new listings for Friday, 22 November 2024
- [1] arXiv:2411.13567 [pdf, html, other]
-
Title: Why the p-norms $p{=}1$, $p{=}2$ and $p{=}\infty$ are so special? An answer based on spatial uniformitySubjects: Statistics Theory (math.ST); Cryptography and Security (cs.CR); Numerical Analysis (math.NA)
Among all metrics based on p-norms, the Manhattan (p=1), euclidean (p=2) and Chebyshev distances (p=infinity) are the most widely used for their interpretability, simplicity and technical convenience. But these are not the only arguments for the ubiquity of these three p-norms. This article proves that there is a volume-surface correspondence property that is unique to them. More precisely, it is shown that sampling uniformly from the volume of an n-dimensional p-ball and projecting to its surface is equivalent to directly sampling uniformly from its surface if and only if p is 1, 2 or infinity. Sampling algorithms and their implementations in Python are also provided.
- [2] arXiv:2411.13568 [pdf, html, other]
-
Title: Causal wavelet analysis of ozone pollution contingencies in the Mexico City Metropolitan AreaComments: 11 pages, 7 figuresSubjects: Applications (stat.AP)
In the recent two decades, the Mexico City Metropolitan Area (MCMA) has been plagued by high concentrations of air pollutants, risking the health integrity of its inhabitants. Although some policies have been undertaken, they have been insufficient to deplete high air pollutants. Environmental contingencies are commonly imposed when the ozone concentration overpasses a certain threshold, which is well above the recommended maximum by the WHO. This work used a causal version of a generalized Morlet wavelet to characterize the dynamics of daily ozone concentration in the MCMA. The results indicated that the formation of dangerous ozone concentration levels is a consequence of accumulation and incomplete dissipation effects acting over a wide range of time scales. Ozone contingencies occurred when the wavelet coefficient power is increasing, which was linked to an inti-persistence behavior. It was proposed that the wavelet methodology could be used as a further tool for signaling the potential formation of adverse ozone pollution scenarios.
- [3] arXiv:2411.13570 [pdf, html, other]
-
Title: Inconsistency and Acausality in Bayesian Inference for Physical ProblemsComments: 50 pages, 9 figuresSubjects: Methodology (stat.ME)
Bayesian inference is used to estimate continuous parameter values given measured data in many fields of science. The method relies on conditional probability densities to describe information about both data and parameters, yet the notion of conditional densities is inadmissible: probabilities of the same physical event, computed from conditional densities under different parameterizations, may be inconsistent. We show that this inconsistency, together with acausality in hierarchical methods, invalidate a variety of commonly applied Bayesian methods when applied to problems in the physical world, including trans-dimensional inference, general Bayesian dimensionality reduction methods, and hierarchical and empirical Bayes. Models in parameter spaces of different dimensionalities cannot be compared, invalidating the concept of natural parsimony, the probabilistic counterpart to Occams Razor. Bayes theorem itself is inadmissible, and Bayesian inference applied to parameters that characterize physical properties requires reformulation.
- [4] arXiv:2411.13601 [pdf, other]
-
Title: Error Analysis of Sum-Product Algorithms under Stochastic RoundingPablo de Oliveira Castro (LI-PaRAD, UVSQ), El-Mehdi El Arar (IRISA, UR), Eric Petit, Devan Sohier (LI-PaRAD, UVSQ)Subjects: Computation (stat.CO); Data Structures and Algorithms (cs.DS)
The quality of numerical computations can be measured through their forward error, for which finding good error bounds is challenging in general. For several algorithms and using stochastic rounding (SR), probabilistic analysis has been shown to be an effective alternative for obtaining tight error bounds. This analysis considers the distribution of errors and evaluates the algorithm's performance on average. Using martingales and the Azuma-Hoeffding inequality, it provides error bounds that are valid with a certain probability and in $\mathcal{O}(\sqrt{n}u)$ instead of deterministic worst-case bounds in $\mathcal{O}(nu)$, where $n$ is the number of operations and $u$ is the unit this http URL this paper, we present a general method that automatically constructs a martingale for any computation scheme with multi-linear errors based on additions, subtractions, and multiplications. We apply this generalization to algorithms previously studied with SR, such as pairwise summation and the Horner algorithm, and prove equivalent results. We also analyze a previously unstudied algorithm, Karatsuba polynomial multiplication, which illustrates that the method can handle reused intermediate computations.
- [5] arXiv:2411.13608 [pdf, html, other]
-
Title: Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value AnalysisComments: 33 pages, 8 figuresSubjects: Applications (stat.AP); Artificial Intelligence (cs.AI)
This paper presents an innovative approach to Extreme Value Analysis (EVA) by introducing the Extreme Value Dynamic Benchmarking Method (EVDBM). EVDBM integrates extreme value theory to detect extreme events and is coupled with the novel Dynamic Identification of Significant Correlation (DISC)-Thresholding algorithm, which enhances the analysis of key variables under extreme conditions. By integrating return values predicted through EVA into the benchmarking scores, we are able to transform these scores to reflect anticipated conditions more accurately. This provides a more precise picture of how each case is projected to unfold under extreme conditions. As a result, the adjusted scores offer a forward-looking perspective, highlighting potential vulnerabilities and resilience factors for each case in a way that static historical data alone cannot capture. By incorporating both historical and probabilistic elements, the EVDBM algorithm provides a comprehensive benchmarking framework that is adaptable to a range of scenarios and contexts. The methodology is applied to real PV data, revealing critical low - production scenarios and significant correlations between variables, which aid in risk management, infrastructure design, and long-term planning, while also allowing for the comparison of different production plants. The flexibility of EVDBM suggests its potential for broader applications in other sectors where decision-making sensitivity is crucial, offering valuable insights to improve outcomes.
- [6] arXiv:2411.13663 [pdf, html, other]
-
Title: Accounting carbon emissions from electricity generation: a review and comparison of emission factor-based methodsComments: Main text: 29 pages, 7 figures, 5 tablesSubjects: Applications (stat.AP)
Accurate estimation of greenhouse gas (GHG) is essential to meet carbon neutrality targets, particularly through the calculation of direct CO2 emissions from electricity generation. This work reviews and compares emission factor-based methods for accounting direct carbon emissions from electricity generation. The emission factor approach is commonly worldwide used. Empirical comparisons are based on emission factors computed using data from the Italian electricity market. The analyses reveal significant differences in the CO2 estimates according to different methods. This, in turn, highlights the need to select an appropriate method for reliable emissions, which could support effective regulatory compliance and informed policy-making. As concerns, in particular, the market zones of the Italian electricity market, the results underscore the importance of tailoring emission factors to accurately capture regional fuel variations.
- [7] arXiv:2411.13692 [pdf, html, other]
-
Title: Randomized Basket Trial with an Interim Analysis (RaBIt) and Applications in Mental HealthComments: 23 pages, 3 figuresSubjects: Methodology (stat.ME)
Basket trials can efficiently evaluate a single treatment across multiple diseases with a common shared target. Prior methods for randomized basket trials required baskets to have the same sample and effect sizes. To that end, we developed a general randomized basket trial with an interim analysis (RaBIt) that allows for unequal sample sizes and effect sizes per basket. RaBIt is characterized by pruning at an interim stage and then analyzing a pooling of the remaining baskets. We derived the analytical power and type 1 error for the design. We first show that our results are consistent with the prior methods when the sample and effect sizes were the same across baskets. As we adjust the sample allocation between baskets, our threshold for the final test statistic becomes more stringent in order to maintain the same overall type 1 error. Finally, we notice that if we fix a sample size for the baskets proportional to their accrual rate, then at the cost of an almost negligible amount of power, the trial overall is expected to take substantially less time than the non-generalized version.
- [8] arXiv:2411.13696 [pdf, html, other]
-
Title: The Fast and the Furious: Tracking the Effect of the Tomoa Skip on Speed ClimbingComments: 15 pages, 6 figures. Accepted at Chance MagazineSubjects: Applications (stat.AP)
Sport climbing is an athletic discipline comprised of three sub-disciplines -- lead climbing, bouldering, and speed climbing. These three sub-disciplines have distinct goals, resulting in specialization of athletes into one of the three events. The year 2020 marked the first inclusion of sport climbing in the Olympic Games. While this decision was met with excitement from the climbing community, it was not without controversy. The International Olympic Committee had allocated one set of medals for the entire sport, necessitating the combination of sub-disciplines into one competition. As a result, athletes who specialized in lead and bouldering were forced to train and compete in speed for the first time in their careers. One such athlete was Tomoa Narasaki, a World Champion boulderer, who introduced a new method of approaching the speed event. This approach, deemed the Tomoa Skip (TS), was subsequently adopted by many of the top speed climbers. Concurrently, speed records fell rapidly (from 5.48s in 2017 to 4.90s in 2023). Speed climbing involves ascending a 15m wall containing the same pattern of obstacles. Thus, records can be compared across time. In this paper we investigate the effect of the TS on speed climbing by answering two questions: (1) Did the TS result in a decrease in speed times? and (2) Do climbers who utilize the TS show less consistency? The success of the TS highlights the potential of collaboration between different disciplines of sport, showing athletes of diverse backgrounds may contribute to the evolution of competition.
- [9] arXiv:2411.13748 [pdf, html, other]
-
Title: An Economical Approach to Design Posterior AnalysesComments: arXiv admin note: text overlap with arXiv:2312.10814Subjects: Methodology (stat.ME)
To design Bayesian studies, criteria for the operating characteristics of posterior analyses - such as power and the type I error rate - are often assessed by estimating sampling distributions of posterior probabilities via simulation. In this paper, we propose an economical method to determine optimal sample sizes and decision criteria for such studies. Using our theoretical results that model posterior probabilities as a function of the sample size, we assess operating characteristics throughout the sample size space given simulations conducted at only two sample sizes. These theoretical results are used to construct bootstrap confidence intervals for the optimal sample sizes and decision criteria that reflect the stochastic nature of simulation-based design. We also repurpose the simulations conducted in our approach to efficiently investigate various sample sizes and decision criteria using contour plots. The broad applicability and wide impact of our methodology is illustrated using two clinical examples.
- [10] arXiv:2411.13763 [pdf, html, other]
-
Title: Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional DataSubjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter $\theta$ in a linear threshold $\theta^T Z$ for a continuous variable $X$ such that the discrepancy between whether $X$ exceeds the threshold $\theta^T Z$ and a binary outcome $Y$ is minimized. We propose a novel $K$-step active subsampling algorithm to estimate $\theta$, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to $\beta\geq 1$, the smoothness of the conditional density of $X$ given $Y$ and $Z$. For $\beta>(1+\sqrt{3})/2$, we show that the two-step algorithm yields an estimator with the parametric convergence rate $O_p((s \log d /N)^{1/2})$ in $l_2$ norm. The rate of our estimator is strictly faster than the minimax optimal rate with $N$ i.i.d. samples drawn from the population. For the other two scenarios $1<\beta\leq (1+\sqrt{3})/2$ and $\beta=1$, the estimator from the two-step algorithm is sub-optimal. The former requires to run $K>2$ steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset.
- [11] arXiv:2411.13764 [pdf, html, other]
-
Title: Selective inference is easier with p-valuesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Selective inference is a subfield of statistics that enables valid inference after selection of a data-dependent question. In this paper, we introduce selectively dominant p-values, a class of p-values that allow practitioners to easily perform inference after arbitrary selection procedures. Unlike a traditional p-value, whose distribution must stochastically dominate the uniform distribution under the null, a selectively dominant p-value must have a post-selection distribution that stochastically dominates that of a uniform having undergone the same selection process; moreover, this property must hold simultaneously for all possible selection processes. Despite the strength of this condition, we show that all commonly used p-values (e.g., p-values from two-sided testing in parametric families, one-sided testing in monotone likelihood ratio and exponential families, $F$-tests for linear regression, and permutation tests) are selectively dominant. By recasting two canonical selective inference problems-inference on winners and rank verification-in our selective dominance framework, we provide simpler derivations, a deeper conceptual understanding, and new generalizations and variations of these methods. Additionally, we use our insights to introduce selective variants of methods that combine p-values, such as Fisher's combination test.
- [12] arXiv:2411.13822 [pdf, html, other]
-
Title: High-Dimensional Extreme Quantile RegressionSubjects: Methodology (stat.ME)
The estimation of conditional quantiles at extreme tails is of great interest in numerous applications. Various methods that integrate regression analysis with an extrapolation strategy derived from extreme value theory have been proposed to estimate extreme conditional quantiles in scenarios with a fixed number of covariates. However, these methods prove ineffective in high-dimensional settings, where the number of covariates increases with the sample size. In this article, we develop new estimation methods tailored for extreme conditional quantiles with high-dimensional covariates. We establish the asymptotic properties of the proposed estimators and demonstrate their superior performance through simulation studies, particularly in scenarios of growing dimension and high dimension where existing methods may fail. Furthermore, the analysis of auto insurance data validates the efficacy of our methods in estimating extreme conditional insurance claims and selecting important variables.
- [13] arXiv:2411.13829 [pdf, html, other]
-
Title: Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inferenceJiaxin Zhang, S. Ghazaleh Dashti, John B. Carlin, Katherine J. Lee, Jonathan W. Bartlett, Margarita Moreno-BetancurSubjects: Methodology (stat.ME)
When using multiple imputation (MI) for missing data, maintaining compatibility between the imputation model and substantive analysis is important for avoiding bias. For example, some causal inference methods incorporate an outcome model with exposure-confounder interactions that must be reflected in the imputation model. Two approaches for compatible imputation with multivariable missingness have been proposed: Substantive-Model-Compatible Fully Conditional Specification (SMCFCS) and a stacked-imputation-based approach (SMC-stack). If the imputation model is correctly specified, both approaches are guaranteed to be unbiased under the "missing at random" assumption. However, this assumption is violated when the outcome causes its own missingness, which is common in practice. In such settings, sensitivity analyses are needed to assess the impact of alternative assumptions on results. An appealing solution for sensitivity analysis is delta-adjustment using MI, specifically "not-at-random" (NAR)FCS. However, the issue of imputation model compatibility has not been considered in sensitivity analysis, with a naive implementation of NARFCS being susceptible to bias. To address this gap, we propose two approaches for compatible sensitivity analysis when the outcome causes its own missingness. The proposed approaches, NAR-SMCFCS and NAR-SMC-stack, extend SMCFCS and SMC-stack, respectively, with delta-adjustment for the outcome. We evaluate these approaches using a simulation study that is motivated by a case study, to which the methods were also applied. The simulation results confirmed that a naive implementation of NARFCS produced bias in effect estimates, while NAR-SMCFCS and NAR-SMC-stack were approximately unbiased. The proposed compatible approaches provide promising avenues for conducting sensitivity analysis to missingness assumptions in causal inference.
- [14] arXiv:2411.13922 [pdf, html, other]
-
Title: Exponentially Consistent Nonparametric Clustering of Data StreamsSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data streams generated from unknown distributions. The distributions of the $M$ data streams belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Our results show that SLINK is exponentially consistent for a larger class of problems than $k$-medoids distribution clustering. We also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
- [15] arXiv:2411.13939 [pdf, html, other]
-
Title: Filtering and Statistical Properties of Unimodal Maps Perturbed by Heteroscedastic NoisesSubjects: Statistics Theory (math.ST); Dynamical Systems (math.DS); Probability (math.PR)
We propose a theory of unimodal maps perturbed by an heteroscedastic Markov chain noise and experiencing another heteroscedastic noise due to uncertain observation. We address and treat the filtering problem showing that by collecting more and more observations, one would predict the same distribution for the state of the underlying Markov chain no matter one's initial guess. Moreover we give other limit theorems, emphasizing in particular concentration inequalities and extreme value and Poisson distributions. Our results apply to a family of maps arising from a model of systemic risk in finance.
- [16] arXiv:2411.13974 [pdf, other]
-
Title: Distributional regression: CRPS-error bounds for model fitting, model selection and convex aggregationClément Dombry (LMB), Ahmed Zaoui (LMB)Journal-ref: 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Dec 2024, Vancoucer, CanadaSubjects: Statistics Theory (math.ST)
Distributional regression aims at estimating the conditional distribution of a targetvariable given explanatory co-variates. It is a crucial tool for forecasting whena precise uncertainty quantification is required. A popular methodology consistsin fitting a parametric model via empirical risk minimization where the risk ismeasured by the Continuous Rank Probability Score (CRPS). For independentand identically distributed observations, we provide a concentration result for theestimation error and an upper bound for its expectation. Furthermore, we considermodel selection performed by minimization of the validation error and provide aconcentration bound for the regret. A similar result is proved for convex aggregationof models. Finally, we show that our results may be applied to various models suchas Ensemble Model Output Statistics (EMOS), distributional regression networks,distributional nearest neighbors or distributional random forests and we illustrateour findings on two data sets (QSAR aquatic toxicity and Airfoil self-noise).
- [17] arXiv:2411.13995 [pdf, html, other]
-
Title: Modelling and prediction of the wildfire data using fractional Poisson processComments: 17 pages, 7 figuresSubjects: Applications (stat.AP)
Modelling wildfire events has been studied in the literature using the Poisson process, which essentially assumes the independence of wildfire events. In this paper, we use the fractional Poisson process to model the wildfire occurrences in California between June 2019 - April 2023 and predict the wildfire events that explains the underlying memory between these events. We introduce method of moments and maximum likelihood estimate approaches to estimate the parameters of the fractional Poisson process, which is an alternative to the method proposed by Cahoy (2010). We obtain the estimates of the fractional parameter as 0.8, proving that the wildfire events are dependent. The proposed model has reduced prediction error by 90\% compared to the classical Poisson process model.
- [18] arXiv:2411.14005 [pdf, html, other]
-
Title: The Double EmulatorComments: 32 pages, 12 figuresSubjects: Computation (stat.CO)
Computer models (simulators) are vital tools for investigating physical processes. Despite their utility, the prohibitive run-time of simulators hinders their direct application for uncertainty quantification. Gaussian process emulators (GPEs) have been used extensively to circumvent the cost of the simulator and are known to perform well on simulators with smooth, stationary output. In reality, many simulators violate these assumptions. Motivated by a finite element simulator which models early stage corrosion of uranium in water vapor, we propose an adaption of the GPE, called the double emulator, specifically for simulators which 'ground' in a considerable volume of their input space. Grounding is the process by which a simulator attains its minimum and can result in violation of the stationarity and smoothness assumptions used in the conventional GPE. We perform numerical experiments comparing the performance of the GPE and double emulator on both the corrosion simulator and synthetic examples.
- [19] arXiv:2411.14010 [pdf, html, other]
-
Title: Spectral domain likelihoods for Bayesian inference in time-varying parameter modelsSubjects: Methodology (stat.ME)
Inference for locally stationary processes is often based on some local Whittle-type approximation of the likelihood function defined in the frequency domain. The main reasons for using such a likelihood approximation is that i) it has substantially lower computational cost and better scalability to long time series compared to the time domain likelihood, particularly when used for Bayesian inference via Markov Chain Monte Carlo (MCMC), ii) convenience when the model itself is specified in the frequency domain, and iii) it provides access to bootstrap and subsampling MCMC which exploits the asymptotic independence of Fourier transformed data. Most of the existing literature compares the asymptotic performance of the maximum likelihood estimator (MLE) from such frequency domain likelihood approximation with the exact time domain MLE. Our article uses three simulation studies to assess the finite-sample accuracy of several frequency domain likelihood functions when used to approximate the posterior distribution in time-varying parameter models. The methods are illustrated on an application to egg price data.
- [20] arXiv:2411.14016 [pdf, html, other]
-
Title: Robust Mutual Fund Selection with False Discovery Rate ControlSubjects: Methodology (stat.ME)
In this article, we address the challenge of identifying skilled mutual funds among a large pool of candidates, utilizing the linear factor pricing model. Assuming observable factors with a weak correlation structure for the idiosyncratic error, we propose a spatial-sign based multiple testing procedure (SS-BH). When latent factors are present, we first extract them using the elliptical principle component method (He et al. 2022) and then propose a factor-adjusted spatial-sign based multiple testing procedure (FSS-BH). Simulation studies demonstrate that our proposed FSS-BH procedure performs exceptionally well across various applications and exhibits robustness to variations in the covariance structure and the distribution of the error term. Additionally, real data application further highlights the superiority of the FSS-BH procedure.
- [21] arXiv:2411.14080 [pdf, html, other]
-
Title: Tensors in algebraic statisticsComments: Overview paperSubjects: Statistics Theory (math.ST); Algebraic Geometry (math.AG); History and Overview (math.HO)
Tensors are ubiquitous in statistics and data analysis. The central object that links data science to tensor theory and algebra is that of a model with latent variables. We provide an overview of tensor theory, with a particular emphasis on its applications in algebraic statistics. This high-level treatment is supported by numerous examples to illustrate key concepts. Additionally, an extensive literature review is included to guide readers toward more detailed studies on the subject.
- [22] arXiv:2411.14185 [pdf, html, other]
-
Title: A note on numerical evaluation of conditional Akaike information for nonlinear mixed-effects modelsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We propose two methods to evaluate the conditional Akaike information (cAI) for nonlinear mixed-effects models with no restriction on cluster size. Method 1 is designed for continuous data and includes formulae for the derivatives of fixed and random effects estimators with respect to observations. Method 2, compatible with any type of observation, requires modeling the marginal (or prior) distribution of random effects as a multivariate normal distribution. Simulations show that Method 1 performs well with Gaussian data but struggles with skewed continuous distributions, whereas Method 2 consistently performs well across various distributions, including normal, gamma, negative binomial, and Tweedie, with flexible link functions. Based on our findings, we recommend Method 2 as a distributionally robust cAI criterion for model selection in nonlinear mixed-effects models.
- [23] arXiv:2411.14265 [pdf, html, other]
-
Title: A Bayesian mixture model for Poisson network autoregressionSubjects: Methodology (stat.ME)
In this paper, we propose a new Bayesian Poisson network autoregression mixture model (PNARM). Our model combines ideas from the models of Dahl 2008, Ren et al. 2024 and Armillotta and Fokianos 2024, as it is motivated by the following aims. We consider the problem of modelling multivariate count time series since they arise in many real-world data sets, but has been studied less than its Gaussian-distributed counterpart (Fokianos 2024). Additionally, we assume that the time series occur on the nodes of a known underlying network where the edges dictate the form of the structural vector autoregression model, as a means of imposing sparsity. A further aim is to accommodate heterogeneous node dynamics, and to develop a probabilistic model for clustering nodes that exhibit similar behaviour. We develop an MCMC algorithm for sampling from the model's posterior distribution. The model is applied to a data set of COVID-19 cases in the counties of the Republic of Ireland.
- [24] arXiv:2411.14285 [pdf, html, other]
-
Title: Stochastic interventions, sensitivity analysis, and optimal transportComments: 37 pages, 1 figureSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Recent methodological research in causal inference has focused on effects of stochastic interventions, which assign treatment randomly, often according to subject-specific covariates. In this work, we demonstrate that the usual notion of stochastic interventions have a surprising property: when there is unmeasured confounding, bounds on their effects do not collapse when the policy approaches the observational regime. As an alternative, we propose to study generalized policies, treatment rules that can depend on covariates, the natural value of treatment, and auxiliary randomness. We show that certain generalized policy formulations can resolve the "non-collapsing" bound issue: bounds narrow to a point when the target treatment distribution approaches that in the observed data. Moreover, drawing connections to the theory of optimal transport, we characterize generalized policies that minimize worst-case bound width in various sensitivity analysis models, as well as corresponding sharp bounds on their causal effects. These optimal policies are new, and can have a more parsimonious interpretation compared to their usual stochastic policy analogues. Finally, we develop flexible, efficient, and robust estimators for the sharp nonparametric bounds that emerge from the framework.
- [25] arXiv:2411.14323 [pdf, other]
-
Title: Estimands and Their Implications for Evidence Synthesis for Oncology: A Simulation Study of Treatment Switching in Meta-AnalysisQuang Vuong, Rebecca K. Metcalfe, Antonio Remiro-Azócar, Anders Gorst-Rasmussen, Oliver Keene, Jay J. H. ParkSubjects: Methodology (stat.ME)
The ICH E9(R1) addendum provides guidelines on accounting for intercurrent events in clinical trials using the estimands framework. However, there has been limited attention on the estimands framework for meta-analysis. Using treatment switching, a well-known intercurrent event that occurs frequently in oncology, we conducted a simulation study to explore the bias introduced by pooling together estimates targeting different estimands in a meta-analysis of randomized clinical trials (RCTs) that allowed for treatment switching. We simulated overall survival data of a collection of RCTs that allowed patients in the control group to switch to the intervention treatment after disease progression under fixed-effects and random-effects models. For each RCT, we calculated effect estimates for a treatment policy estimand that ignored treatment switching, and a hypothetical estimand that accounted for treatment switching by censoring switchers at the time of switching. Then, we performed random-effects and fixed-effects meta-analyses to pool together RCT effect estimates while varying the proportions of treatment policy and hypothetical effect estimates. We compared the results of meta-analyses that pooled different types of effect estimates with those that pooled only treatment policy or hypothetical estimates. We found that pooling estimates targeting different estimands results in pooled estimators that reflect neither the treatment policy estimand nor the hypothetical estimand. This finding shows that pooling estimates of varying target estimands can generate misleading results, even under a random-effects model. Adopting the estimands framework for meta-analysis may improve alignment between meta-analytic results and the clinical research question of interest.
- [26] arXiv:2411.14341 [pdf, other]
-
Title: Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment EffectComments: 12 pages, 2 figures. Submitted to AISTATS 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Estimation of the Average Treatment Effect (ATE) is a core problem in causal inference with strong connections to Off-Policy Evaluation in Reinforcement Learning. This paper considers the problem of adaptively selecting the treatment allocation probability in order to improve estimation of the ATE. The majority of prior work on adaptive ATE estimation focus on asymptotic guarantees, and in turn overlooks important practical considerations such as the difficulty of learning the optimal treatment allocation as well as hyper-parameter selection. Existing non-asymptotic methods are limited by poor empirical performance and exponential scaling of the Neyman regret with respect to problem parameters. In order to address these gaps, we propose and analyze the Clipped Second Moment Tracking (ClipSMT) algorithm, a variant of an existing algorithm with strong asymptotic optimality guarantees, and provide finite sample bounds on its Neyman regret. Our analysis shows that ClipSMT achieves exponential improvements in Neyman regret on two fronts: improving the dependence on $T$ from $O(\sqrt{T})$ to $O(\log T)$, as well as reducing the exponential dependence on problem parameters to a polynomial dependence. Finally, we conclude with simulations which show the marked improvement of ClipSMT over existing approaches.
- [27] arXiv:2411.14351 [pdf, html, other]
-
Title: Indiscriminate Disruption of Conditional Inference on Multivariate GaussiansComments: 30 pages, 6 figures; 4 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
The multivariate Gaussian distribution underpins myriad operations-research, decision-analytic, and machine-learning models (e.g., Bayesian optimization, Gaussian influence diagrams, and variational autoencoders). However, despite recent advances in adversarial machine learning (AML), inference for Gaussian models in the presence of an adversary is notably understudied. Therefore, we consider a self-interested attacker who wishes to disrupt a decisionmaker's conditional inference and subsequent actions by corrupting a set of evidentiary variables. To avoid detection, the attacker also desires the attack to appear plausible wherein plausibility is determined by the density of the corrupted evidence. We consider white- and grey-box settings such that the attacker has complete and incomplete knowledge about the decisionmaker's underlying multivariate Gaussian distribution, respectively. Select instances are shown to reduce to quadratic and stochastic quadratic programs, and structural properties are derived to inform solution methods. We assess the impact and efficacy of these attacks in three examples, including, real estate evaluation, interest rate estimation and signals processing. Each example leverages an alternative underlying model, thereby highlighting the attacks' broad applicability. Through these applications, we also juxtapose the behavior of the white- and grey-box attacks to understand how uncertainty and structure affect attacker behavior.
New submissions (showing 27 of 27 entries)
- [28] arXiv:2411.13606 (cross-list from cond-mat.stat-mech) [pdf, html, other]
-
Title: The stabilizing role of multiplicative noise in non-confining potentialsComments: 14 pages, 8 figuresSubjects: Statistical Mechanics (cond-mat.stat-mech); Mathematical Physics (math-ph); Statistics Theory (math.ST); Adaptation and Self-Organizing Systems (nlin.AO)
We provide a simple framework for the study of parametric (multiplicative) noise, making use of scale parameters. We show that for a large class of stochastic differential equations increasing the multiplicative noise intensity surprisingly causes the mass of the stationary probability distribution to become increasingly concentrated around the minima of the multiplicative noise term, whilst under quite general conditions exhibiting a kind of intermittent burst like jumps between these minima. If the multiplicative noise term has one zero this causes on-off intermittency. Our framework relies on first term expansions, which become more accurate for larger noise intensities. In this work we show that the full width half maximum in addition to the maximum is appropriate for quantifying the stationary probability distribution (instead of the mean and variance, which are often undefined). We define a corresponding new kind of weak sense stationarity. We consider a double well potential as an example of application, demonstrating relevance to tipping points in noisy systems.
- [29] arXiv:2411.13653 (cross-list from cs.AI) [pdf, other]
-
Title: No Free Delivery Service: Epistemic limits of passive data collection in complex social systemsComments: To appear in NeurIPS'24Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Rapid model validation via the train-test paradigm has been a key driver for the breathtaking progress in machine learning and AI. However, modern AI systems often depend on a combination of tasks and data collection practices that violate all assumptions ensuring test validity. Yet, without rigorous model validation we cannot ensure the intended outcomes of deployed AI systems, including positive social impact, nor continue to advance AI research in a scientifically sound way. In this paper, I will show that for widely considered inference settings in complex social systems the train-test paradigm does not only lack a justification but is indeed invalid for any risk estimator, including counterfactual and causal estimators, with high probability. These formal impossibility results highlight a fundamental epistemic issue, i.e., that for key tasks in modern AI we cannot know whether models are valid under current data collection practices. Importantly, this includes variants of both recommender systems and reasoning via large language models, and neither naïve scaling nor limited benchmarks are suited to address this issue. I am illustrating these results via the widely used MovieLens benchmark and conclude by discussing the implications of these results for AI in social systems, including possible remedies such as participatory data curation and open science.
- [30] arXiv:2411.13688 (cross-list from cs.LG) [pdf, other]
-
Title: Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property PredictionComments: Doctoral Thesis (Mathematical Institute, University of Oxford)Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
Molecular featurisation refers to the transformation of molecular data into numerical feature vectors. It is one of the key research areas in molecular machine learning and computational drug discovery. Recently, message-passing graph neural networks (GNNs) have emerged as a novel method to learn differentiable features directly from molecular graphs. While such techniques hold great promise, further investigations are needed to clarify if and when they indeed manage to definitively outcompete classical molecular featurisations such as extended-connectivity fingerprints (ECFPs) and physicochemical-descriptor vectors (PDVs). We systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction. We first give a technical description and critical analysis of PDVs, ECFPs and message-passing GNNs, with a focus on graph isomorphism networks (GINs). We then conduct a rigorous computational study to compare the performance of PDVs, ECFPs and GINs for QSAR and AC-prediction. Following this, we mathematically describe and computationally evaluate a novel twin neural network model for AC-prediction. We further introduce an operation called substructure pooling for the vectorisation of structural fingerprints as a natural counterpart to graph pooling in GNN architectures. We go on to propose Sort & Slice, a simple substructure-pooling technique for ECFPs that robustly outperforms hash-based folding at molecular property prediction. Finally, we outline two ideas for future research: (i) a graph-based self-supervised learning strategy to make classical molecular featurisations trainable, and (ii) trainable substructure-pooling via differentiable self-attention.
- [31] arXiv:2411.13711 (cross-list from cs.LG) [pdf, html, other]
-
Title: Almost Sure Convergence Rates and Concentration of Stochastic Approximation and Reinforcement Learning with Markovian NoiseSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
This paper establishes the first almost sure convergence rate and the first maximal concentration bound with exponential tails for general contractive stochastic approximation algorithms with Markovian noise. As a corollary, we also obtain convergence rates in $L^p$. Key to our successes is a novel discretization of the mean ODE of stochastic approximation algorithms using intervals with diminishing (instead of constant) length. As applications, we provide the first almost sure convergence rate for $Q$-learning with Markovian samples without count-based learning rates. We also provide the first concentration bound for off-policy temporal difference learning with Markovian samples.
- [32] arXiv:2411.13733 (cross-list from cs.LG) [pdf, html, other]
-
Title: On Generalization Bounds for Neural Networks with Low Rank LayersComments: Published in the MIT DSpace repository: this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply Maurer's chain rule for Gaussian complexity to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep networks exhibiting neural collapse.
- [33] arXiv:2411.13821 (cross-list from cs.LG) [pdf, html, other]
-
Title: Heterophilic Graph Neural Networks Optimization with Causal Message-passingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
In this work, we discover that causal inference provides a promising approach to capture heterophilic message-passing in Graph Neural Network (GNN). By leveraging cause-effect analysis, we can discern heterophilic edges based on asymmetric node dependency. The learned causal structure offers more accurate relationships among nodes. To reduce the computational complexity, we introduce intervention-based causal inference in graph learning. We first simplify causal analysis on graphs by formulating it as a structural learning model and define the optimization problem within the Bayesian scheme. We then present an analysis of decomposing the optimization target into a consistency penalty and a structure modification based on cause-effect relations. We then estimate this target by conditional entropy and present insights into how conditional entropy quantifies the heterophily. Accordingly, we propose CausalMP, a causal message-passing discovery network for heterophilic graph learning, that iteratively learns the explicit causal structure of input graphs. We conduct extensive experiments in both heterophilic and homophilic graph settings. The result demonstrates that the our model achieves superior link prediction performance. Training on causal structure can also enhance node representation in classification task across different base models.
- [34] arXiv:2411.13868 (cross-list from cs.LG) [pdf, html, other]
-
Title: Robust Detection of Watermarks for Large Language Models Under Human EditsSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.
- [35] arXiv:2411.13887 (cross-list from math.AT) [pdf, html, other]
-
Title: A cohomology-based Gromov-Hausdorff metric approach for quantifying molecular similarityComments: 14 pages, 3 figuresSubjects: Algebraic Topology (math.AT); Materials Science (cond-mat.mtrl-sci); Computational Geometry (cs.CG); Metric Geometry (math.MG); Machine Learning (stat.ML)
We introduce, for the first time, a cohomology-based Gromov-Hausdorff ultrametric method to analyze 1-dimensional and higher-dimensional (co)homology groups, focusing on loops, voids, and higher-dimensional cavity structures in simplicial complexes, to address typical clustering questions arising in molecular data analysis. The Gromov-Hausdorff distance quantifies the dissimilarity between two metric spaces. In this framework, molecules are represented as simplicial complexes, and their cohomology vector spaces are computed to capture intrinsic topological invariants encoding loop and cavity structures. These vector spaces are equipped with a suitable distance measure, enabling the computation of the Gromov-Hausdorff ultrametric to evaluate structural dissimilarities. We demonstrate the methodology using organic-inorganic halide perovskite (OIHP) structures. The results highlight the effectiveness of this approach in clustering various molecular structures. By incorporating geometric information, our method provides deeper insights compared to traditional persistent homology techniques.
- [36] arXiv:2411.14003 (cross-list from cs.LG) [pdf, html, other]
-
Title: Generative Intervention Models for Causal Perturbation ModelingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointly-estimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.
- [37] arXiv:2411.14166 (cross-list from math.OC) [pdf, other]
-
Title: SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel OptimizationComments: 73 pages, the Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.
- [38] arXiv:2411.14196 (cross-list from physics.bio-ph) [pdf, html, other]
-
Title: Uncertainty Quantification in Working Memory via Moment Neural NetworksComments: Code released: this https URLSubjects: Biological Physics (physics.bio-ph); Neural and Evolutionary Computing (cs.NE); Applications (stat.AP)
Humans possess a finely tuned sense of uncertainty that helps anticipate potential errors, vital for adaptive behavior and survival. However, the underlying neural mechanisms remain unclear. This study applies moment neural networks (MNNs) to explore the neural mechanism of uncertainty quantification in working memory (WM). The MNN captures nonlinear coupling of the first two moments in spiking neural networks (SNNs), identifying firing covariance as a key indicator of uncertainty in encoded information. Trained on a WM task, the model demonstrates coding precision and uncertainty quantification comparable to human performance. Analysis reveals a link between the probabilistic and sampling-based coding for uncertainty representation. Transferring the MNN's weights to an SNN replicates these results. Furthermore, the study provides testable predictions demonstrating how noise and heterogeneity enhance WM performance, highlighting their beneficial role rather than being mere biological byproducts. These findings offer insights into how the brain effectively manages uncertainty with exceptional accuracy.
- [39] arXiv:2411.14288 (cross-list from cs.LG) [pdf, html, other]
-
Title: On the Sample Complexity of One Hidden Layer Networks with Equivariance, Locality and Weight SharingSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Weight sharing, equivariance, and local filters, as in convolutional neural networks, are believed to contribute to the sample efficiency of neural networks. However, it is not clear how each one of these design choices contribute to the generalization error. Through the lens of statistical learning theory, we aim to provide an insight into this question by characterizing the relative impact of each choice on the sample complexity. We obtain lower and upper sample complexity bounds for a class of single hidden layer networks. It is shown that the gain of equivariance is directly manifested in the bound, while getting a similar increase for weight sharing depends on the sharing mechanism. Among our results, we obtain a completely dimension-free bound for equivariant networks for a class of pooling operations. We show that the bound depends merely on the norm of filters, which is tighter than using the spectral norm of the respective matrix. We also characterize the trade-off in sample complexity between the parametrization of filters in spatial and frequency domains, particularly when spatial filters are localized as in vanilla convolutional neural networks.
- [40] arXiv:2411.14305 (cross-list from cs.DS) [pdf, html, other]
-
Title: Outlier-robust Mean Estimation near the Breakdown Point via Sum-of-SquaresComments: Accepted at SODA 2025, 47 pagesSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
We revisit the problem of estimating the mean of a high-dimensional distribution in the presence of an $\varepsilon$-fraction of adversarial outliers.
When $\varepsilon$ is at most some sufficiently small constant, previous works can achieve optimal error rate efficiently \cite{diakonikolas2018robustly, kothari2018robust}. As $\varepsilon$ approaches the breakdown point $\frac{1}{2}$, all previous algorithms incur either sub-optimal error rates or exponential running time.
In this paper we give a new analysis of the canonical sum-of-squares program introduced in \cite{kothari2018robust} and show that this program efficiently achieves optimal error rate for all $\varepsilon \in[0,\frac{1}{2})$. The key ingredient for our results is a new identifiability proof for robust mean estimation that focuses on the overlap between the distributions instead of their statistical distance as in previous works. We capture this proof within the sum-of-squares proof system, thus obtaining efficient algorithms using the sum-of-squares proofs to algorithms paradigm \cite{raghavendra2018high}.
Cross submissions (showing 13 of 13 entries)
- [41] arXiv:2102.08591 (replaced) [pdf, html, other]
-
Title: Data-Driven Logistic Regression Ensembles With Applications in GenomicsSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Statistical tools used to discover patterns between the expression of certain genes and the presence of diseases should ideally perform well in terms of both prediction accuracy and identification of key biomarkers. We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. The ensembles are comprised of a relatively small number of highly accurate and interpretable models that are learned directly from the data by minimizing a global objective function. We derive the asymptotic properties of our method and develop an efficient algorithm to compute the ensembles. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical genomics datasets involving common diseases such as cancer, multiple sclerosis and psoriasis. In several applications our method could identify key biomarkers that were absent in state-of-the-art competitor methods. We develop a variable importance ranking tool that may guide the focus of researchers on the most promising genes. Based on numerical experiments we provide guidelines for the choice of the number of models in our ensembles.
- [42] arXiv:2312.06437 (replaced) [pdf, html, other]
-
Title: Posterior Ramifications of Prior Dependence StructuresSubjects: Methodology (stat.ME)
Prior elicitation methods for Bayesian analyses transfigure prior information into quantifiable prior distributions. Recently, methods that leverage copulas have been proposed to accommodate more flexible dependence structures when eliciting multivariate priors. We show that the posterior cannot retain many of these flexible prior dependence structures in large-sample settings, and we emphasize that it is our responsibility as statisticians to communicate this to practitioners. We therefore overview objectives for prior specification that guide conversations between statisticians and practitioners to promote alignment between the flexibility in the prior dependence structure and the objectives for posterior analysis. Because correctly specifying the dependence structure a priori can be difficult, we consider how the choice of prior copula impacts the posterior distribution in terms of asymptotic convergence of the posterior mode. Our resulting recommendations clarify when it is useful to elicit intricate prior dependence structures and when it is not.
- [43] arXiv:2401.12598 (replaced) [pdf, other]
-
Title: Asymptotic confidence interval for R2 in multiple linear regressionSubjects: Statistics Theory (math.ST)
Following White's approach of robust multiple linear regression, we give asymptotic confidence intervals for the multiple correlation coefficient R2 under minimal moment conditions. We also give the asymptotic joint distribution of the empirical estimators of the individual R2's. Through different sets of simulations, we show that the procedure is indeed robust (contrary to the procedure involving the near exact distribution of the empirical estimator of R2 is the multivariate Gaussian case) and can be also applied to count linear regression.
- [44] arXiv:2401.17249 (replaced) [pdf, html, other]
-
Title: Joint model with latent disease age: overcoming the need for reference timeSubjects: Methodology (stat.ME)
Introduction: Heterogeneity of the progression of neurodegenerative diseases is one of the main challenges faced in developing effective therapies. With the increasing number of large clinical databases, disease progression models have led to a better understanding of this heterogeneity. Nevertheless, these diseases may have no clear onset and biological underlying processes may start before the first symptoms. Such an ill-defined disease reference time is an issue for current joint models, which have proven their effectiveness by combining longitudinal and survival data.
Objective In this work, we propose a joint non-linear mixed effect model with a latent disease age, to overcome this need for a precise reference time.
Method: To do so, we utilized an existing longitudinal model with a latent disease age as a longitudinal sub-model and associated it with a survival sub-model that estimates a Weibull distribution from the latent disease age. We then validated our model on different simulated scenarios. Finally, we benchmarked our model with a state-of-the-art joint model and reference survival and longitudinal models on simulated and real data in the context of Amyotrophic Lateral Sclerosis (ALS).
Results: On real data, our model got significantly better results than the state-of-the-art joint model for absolute bias (4.21(4.41) versus 4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events (0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).
Conclusion: We showed that our approach is better suited than the state-of-the-art in the context where the reference time is not reliable. This work opens up the perspective to design predictive and personalized therapeutic strategies. - [45] arXiv:2402.10126 (replaced) [pdf, html, other]
-
Title: Exchangeability, prediction and predictive modeling in Bayesian statisticsSubjects: Statistics Theory (math.ST)
There is currently a renewed interest in the Bayesian predictive approach to statistics. This paper offers a review on foundational concepts and focuses on predictive modeling, which by directly reasoning on prediction, bypasses inferential models or may characterize them. We detail predictive characterizations in exchangeable and partially exchangeable settings, for a large variety of data structures, and hint at new directions. The underlying concept is that Bayesian predictive rules are probabilistic learning rules, formalizing through conditional probability how we learn on future events given the available information. This concept has implications in any statistical problem and in inference, from classic contexts to less explored challenges, such as providing Bayesian uncertainty quantification to predictive algorithms in data science, as we show in the last part of the paper. The paper gives a historical overview, but also includes a few new results, presents some recent developments and poses some open questions.
- [46] arXiv:2403.03299 (replaced) [pdf, html, other]
-
Title: Demystifying and avoiding the OLS "weighting problem": Unmodeled heterogeneity and straightforward solutionsSubjects: Methodology (stat.ME); Econometrics (econ.EM); Applications (stat.AP)
Researchers have long run regressions of an outcome variable (Y) on a treatment (D) and covariates (X) to estimate treatment effects. Even absent unobserved confounding, the regression coefficient on D in this setup reports a conditional variance weighted average of strata-wise average effects, not generally equal to the average treatment effect (ATE). Numerous proposals have been offered to cope with this "weighting problem", including interpretational tools to help characterize the weights and diagnostic aids to help researchers assess the potential severity of this problem. We make two contributions that together suggest an alternative direction for researchers and this literature. Our first contribution is conceptual, demystifying these weights. Simply put, under heterogeneous treatment effects (and varying probability of treatment), the linear regression of Y on D and X will be misspecified. The "weights" of regression offer one characterization for the coefficient from regression that helps to clarify how it will depart from the ATE. We also derive a more general expression for the weights than what is usually referenced. Our second contribution is practical: as these weights simply characterize misspecification bias, we suggest simply avoiding them through an approach that tolerate heterogeneous effects. A wide range of longstanding alternatives (regression-imputation/g-computation, interacted regression, and balancing weights) relax specification assumptions to allow heterogeneous effects. We make explicit the assumption of "separate linearity", under which each potential outcome is separately linear in X. This relaxation of conventional linearity offers a common justification for all of these methods and avoids the weighting problem, at an efficiency cost that will be small when there are few covariates relative to sample size.
- [47] arXiv:2403.03802 (replaced) [pdf, html, other]
-
Title: Inequalities and bounds for expected order statistics from transform-ordered familiesSubjects: Methodology (stat.ME)
We introduce a comprehensive method for establishing stochastic orders among order statistics in the i.i.d. case. This approach relies on the assumption that the underlying distribution is linked to a reference distribution through a transform order. Notably, this method exhibits broad applicability, particularly since several well-known nonparametric distribution families can be defined using relevant transform orders, including the convex and the star transform orders. In the context of convex-ordered families, we demonstrate that applying Jensen's inequality enables the derivation of bounds for the probability that a random variable exceeds the expected value of its corresponding order statistic.
- [48] arXiv:2404.10351 (replaced) [pdf, other]
-
Title: On the Use of Relative Validity Indices for Comparing Clustering ApproachesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Relative Validity Indices (RVIs) such as the Silhouette Width Criterion and Davies Bouldin indices are the most widely used tools for evaluating and optimising clustering outcomes. Traditionally, their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. However, there is a growing trend in the literature to use RVIs when selecting a Similarity Paradigm (SP) for clustering - the combination of normalisation procedure, representation method, and distance measure which affects the computation of object dissimilarities used in clustering. Despite the growing prevalence of this practice, there has been no empirical or theoretical investigation into the suitability of RVIs for this purpose. Moreover, since RVIs are computed using object dissimilarities, it remains unclear how they would need to be implemented for fair comparisons of different SPs. This study presents the first comprehensive investigation into the reliability of RVIs for SP selection. We conducted extensive experiments with seven popular RVIs on over 2.7 million clustering partitions of synthetic and real-world datasets, encompassing feature-vector and time-series data. We identified fundamental conceptual limitations undermining the use of RVIs for SP selection, and our empirical findings confirmed this predicted unsuitability. Among our recommendations, we suggest instead that practitioners select SPs by using external validation on high quality labelled datasets or carefully designed outcome-oriented objective criteria, both of which should be informed by careful consideration of dataset characteristics, and domain requirements. Our findings have important implications for clustering methodology and evaluation, suggesting the need for more rigorous approaches to SP selection.
- [49] arXiv:2405.09510 (replaced) [pdf, html, other]
-
Title: The Instrumental Variable Model with Categorical Instrument, Treatment and Outcome: Characterization, Partial Identification, and Statistical InferenceComments: Added corollaries, statistical inference, and a real example compared to the previous versionSubjects: Statistics Theory (math.ST)
Instrumental variable (IV) analysis is a crucial tool in estimating causal relationships by addressing the issue of confounding variables that may bias the results. Among other work on IV models with binary exposure and outcomes, Richardson and Robins (2014) studied the instrumental variable model with binary exposure (X) and binary outcome (Y) with an instrument (Z) that takes Q states where Q>=2. However, IV models beyond binary X and Y have been less explored. In this work, we consider the instrumental variable model with categorical X, Y, Z taking values in {1, ..., K}, {1, ..., M}, and {1, ..., Q} respectively. We first give a simple closed-form characterization of the set of joint distributions of the potential outcomes P(Y(x=1), ..., Y(x=K)) compatible with a given observed probability distribution P(X, Y | Z). We further show the bounds we derived are necessary, sufficient, and non-redundant, and they hold under various versions of the independence assumptions that have been discussed in the literature. We also provide how a confidence region of any convex function of the joint counterfactual probability including the average causal effect (ATE) can be computed using an algorithm proposed by Guo and Richardson (2021) which is based on a new tail bound for the KL-divergence. We implement our bounds and provide practical recommendations through a real data example of a cash-incentive smoking cessation program.
- [50] arXiv:2405.19058 (replaced) [pdf, html, other]
-
Title: Participation bias in the estimation of heritability and genetic correlationSubjects: Methodology (stat.ME)
It is increasingly recognized that participation bias can pose problems for genetic studies. Recently, to overcome the challenge that genetic information of non-participants is unavailable, it is shown that by comparing the IBD (identity by descent) shared and not-shared segments among the participants, one can estimate the genetic component underlying participation. That, however, does not directly address how to adjust estimates of heritability and genetic correlation for phenotypes correlated with participation. Here, for phenotypes whose mean differences between population and sample are known, we demonstrate a way to do so by adopting a statistical framework that separates out the genetic and non-genetic correlations between participation and these phenotypes. Crucially, our method avoids making the assumption that the effect of the genetic component underlying participation is manifested entirely through these other phenotypes. Applying the method to 12 UK Biobank phenotypes, we found 8 have significant genetic correlations with participation, including body mass index, educational attainment, and smoking status. For most of these phenotypes, without adjustments, estimates of heritability and the absolute value of genetic correlation would have underestimation biases.
- [51] arXiv:2407.19433 (replaced) [pdf, html, other]
-
Title: How Books Tell a History of Statistics in Portugal: Works of Foreigners, Estrangeirados, and OthersComments: 67 pages, 34 figuresJournal-ref: Communications in Mathematics 32 (2024), no. 3, 485-551Subjects: Other Statistics (stat.OT)
Foreigners and "estrangeirados", an expression meaning "people going to a foreign country ["estrangeiro"] getting there further education", had a leading role in the development of Mathematical Statistics in Portugal. In what concerns Statistics, "estrangeirados" in the nineteenth century were mainly liberal intellectuals exiled for political reasons. From 1930 onwards, the research funding authority sent university professors abroad, and hired foreign researchers to stay in Portuguese institutions, and some of them were instrumental in the importation of new concepts and methods of inferential statistics. After 1970, there was a huge program of sending young researchers abroad for doctoral studies. At the same time, many new universities and polytechnic institutes have been created in Portugal. After that, aside from foreigners who choose to have a research career in those institutions and the "estrangeirados" who had returned and created programs of doctoral studies, others, who hadn't the opportunity of studying abroad, began to play a decisive role in the development of Statistics in Portugal. The publication of handbooks on Probability and Statistics, thesis and core papers in Portuguese scientific journals, and also of works for the layman, reveals how Statistics progressed from descriptive to a mathematical discipline used for inference in all fields of knowledge, from natural sciences to methodology of scientific research.
- [52] arXiv:2410.02727 (replaced) [pdf, other]
-
Title: Regression Discontinuity Designs Under InterferenceSubjects: Methodology (stat.ME)
We extend the continuity-based framework to Regression Discontinuity Designs (RDDs) to identify and estimate causal effects in the presence of interference when units are connected through a network. In this setting, assignment to an "effective treatment," which comprises the individual treatment and a summary of the treatment of interfering units (e.g., friends, classmates), is determined by the unit's score and the scores of other interfering units, leading to a multiscore RDD with potentially complex, multidimensional boundaries. We characterize these boundaries and derive generalized continuity assumptions to identify the proposed causal estimands, i.e., point and boundary causal effects. Additionally, we develop a distance-based nonparametric estimator, derive its asymptotic properties under restrictions on the network degree distribution, and introduce a novel variance estimator that accounts for network correlation. Finally, we apply our methodology to the PROGRESA/Oportunidades dataset to estimate the direct and indirect effects of receiving cash transfers on children's school attendance.
- [53] arXiv:2411.12036 (replaced) [pdf, html, other]
-
Title: Prediction-Guided Active ExperimentsComments: 25 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
In this work, we introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE), which leverages predictions from an existing machine learning model to guide sampling and experimentation. Specifically, at each time step, an experimental unit is sampled according to a designated sampling distribution, and the actual outcome is observed based on an experimental probability. Otherwise, only a prediction for the outcome is available. We begin by analyzing the non-adaptive case, where full information on the joint distribution of the predictor and the actual outcome is assumed. For this scenario, we derive an optimal experimentation strategy by minimizing the semi-parametric efficiency bound for the class of regular estimators. We then introduce an estimator that meets this efficiency bound, achieving asymptotic optimality. Next, we move to the adaptive case, where the predictor is continuously updated with newly sampled data. We show that the adaptive version of the estimator remains efficient and attains the same semi-parametric bound under certain regularity assumptions. Finally, we validate PGAE's performance through simulations and a semi-synthetic experiment using data from the US Census Bureau. The results underscore the PGAE framework's effectiveness and superiority compared to other existing methods.
- [54] arXiv:2411.13199 (replaced) [pdf, html, other]
-
Title: Sharp Bounds for Multiple Models in Matrix CompletionComments: 35 pages. Several typos have been corrected. All comments are warmly welcomedSubjects: Statistics Theory (math.ST)
In this paper, we demonstrate how a class of advanced matrix concentration inequalities, introduced in \cite{brailovskaya2024universality}, can be used to eliminate the dimensional factor in the convergence rate of matrix completion. This dimensional factor represents a significant gap between the upper bound and the minimax lower bound, especially in high dimension. Through a more precise spectral norm analysis, we remove the dimensional factors for five different estimators in various settings, thereby establishing their minimax rate optimality.
- [55] arXiv:2202.01694 (replaced) [pdf, html, other]
-
Title: Variational Nearest Neighbor Gaussian ProcessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within $K$ nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of $O(K^3)$. Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.
- [56] arXiv:2311.13159 (replaced) [pdf, html, other]
-
Title: Multi-Objective Optimization via Wasserstein-Fisher-Rao Gradient FlowYinuo Ren, Tesi Xiao, Tanmay Gangwani, Anshuka Rangi, Holakou Rahmanian, Lexing Ying, Subhajit SanyalSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a "dominance potential" to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.
- [57] arXiv:2404.16583 (replaced) [pdf, html, other]
-
Title: Fast Machine-Precision Spectral Likelihoods for Stationary Time SeriesSubjects: Numerical Analysis (math.NA); Computation (stat.CO); Methodology (stat.ME)
We provide in this work an algorithm for approximating a very broad class of symmetric Toeplitz matrices to machine precision in $\mathcal{O}(n \log n)$ time with applications to fitting time series models. In particular, for a symmetric Toeplitz matrix $\mathbf{\Sigma}$ with values $\mathbf{\Sigma}_{j,k} = h_{|j-k|} = \int_{-1/2}^{1/2} e^{2 \pi i |j-k| \omega} S(\omega) \mathrm{d} \omega$ where $S(\omega)$ is piecewise smooth, we give an approximation $\mathbf{\mathcal{F}} \mathbf{\Sigma} \mathbf{\mathcal{F}}^H \approx \mathbf{D} + \mathbf{U} \mathbf{V}^H$, where $\mathbf{\mathcal{F}}$ is the DFT matrix, $\mathbf{D}$ is diagonal, and the matrices $\mathbf{U}$ and $\mathbf{V}$ are in $\mathbb{C}^{n \times r}$ with $r \ll n$. Studying these matrices in the context of time series, we offer a theoretical explanation of this structure and connect it to existing spectral-domain approximation frameworks. We then give a complete discussion of the numerical method for assembling the approximation and demonstrate its efficiency for improving Whittle-type likelihood approximations, including dramatic examples where a correction of rank $r = 2$ to the standard Whittle approximation increases the accuracy of the log-likelihood approximation from $3$ to $14$ digits for a matrix $\mathbf{\Sigma} \in \mathbb{R}^{10^5 \times 10^5}$. The method and analysis of this work applies well beyond time series analysis, providing an algorithm for extremely accurate solutions to linear systems with a wide variety of symmetric Toeplitz matrices whose entries are generated by a piecewise smooth $S(\omega)$. The analysis employed here largely depends on asymptotic expansions of oscillatory integrals, and also provides a new perspective on when existing spectral-domain approximation methods for Gaussian log-likelihoods can be particularly problematic.
- [58] arXiv:2405.01425 (replaced) [pdf, html, other]
-
Title: In-and-Out: Algorithmic Diffusion for Sampling Convex BodiesComments: 33 pages. To appear in NeurIPS 2024 (spotlight). Improve Lemma 22 and 26Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
We present a new random walk for uniformly sampling high-dimensional convex bodies. It achieves state-of-the-art runtime complexity with stronger guarantees on the output than previously known, namely in Rényi divergence (which implies TV, $\mathcal{W}_2$, KL, $\chi^2$). The proof departs from known approaches for polytime algorithms for the problem -- we utilize a stochastic diffusion perspective to show contraction to the target distribution with the rate of convergence determined by functional isoperimetric constants of the stationary density.
- [59] arXiv:2405.01715 (replaced) [pdf, html, other]
-
Title: GRAMEP: an alignment-free method based on the Maximum Entropy Principle for identifying SNPsMatheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins LopesSubjects: Genomics (q-bio.GN); Information Theory (cs.IT); Applications (stat.AP)
Background: Advances in high throughput sequencing technologies provide a huge number of genomes to be analyzed. Thus, computational methods play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations. However, this approach can be computationally expensive and restrictive in scenarios with large datasets. Results: We present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This study proposes GRAMEP, an alignment-free approach that adopts the principle of maximum entropy to discover the most informative k-mers specific to a genome or set of sequences under investigation. The informative k-mers enable the detection of variant-specific mutations in comparison to a reference genome or other set of sequences. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to methods with the same purpose. Conclusions: GRAMEP is an open and user-friendly software based on maximum entropy that provides an efficient alignment-free approach to identifying and classifying unique genomic subsequences and SNPs with high accuracy, offering advantages over comparative methods. The instructions for use, applicability, and usability of GRAMEP are open access at this https URL
- [60] arXiv:2406.11919 (replaced) [pdf, html, other]
-
Title: Graph Knowledge Distillation to Mixture of ExpertsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets. Code available at this https URL.
- [61] arXiv:2407.13979 (replaced) [pdf, html, other]
-
Title: Truthfulness of Calibration MeasuresComments: To appear at NeurIPS 2024Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We initiate the study of the truthfulness of calibration measures in sequential prediction. A calibration measure is said to be truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. Truthfulness is an important property of calibration measures, ensuring that the forecaster is not incentivized to exploit the system with deliberate poor forecasts. This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness.
We conduct a taxonomy of existing calibration measures and their truthfulness. Perhaps surprisingly, we find that all of them are far from being truthful. That is, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty. Our main contribution is the introduction of a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor. - [62] arXiv:2408.03769 (replaced) [pdf, html, other]
-
Title: Nadaraya-Watson kernel smoothing as a random energy modelComments: 10 pages, 3 figuresSubjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
Precise asymptotics have revealed many surprises in high-dimensional regression. These advances, however, have not extended to perhaps the simplest estimator: direct Nadaraya-Watson (NW) kernel smoothing. Here, we describe how one can use ideas from the analysis of the random energy model (REM) in statistical physics to compute sharp asymptotics for the NW estimator when the sample size is exponential in the dimension. As a simple starting point for investigation, we focus on the case in which one aims to estimate a single-index target function using a radial basis function kernel on the sphere. Our main result is a pointwise asymptotic for the NW predictor, showing that it re-scales the argument of the true link function. Our work provides a first step towards a detailed understanding of kernel smoothing in high dimensions.
- [63] arXiv:2409.06091 (replaced) [pdf, html, other]
-
Title: Scalable Multitask Learning Using Gradient-based Estimation of Task AffinityComments: 16 pages. Appeared in KDD 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training.
The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters.
We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches. - [64] arXiv:2410.02068 (replaced) [pdf, html, other]
-
Title: Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual BanditsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study how representation learning can improve the learning efficiency of contextual bandit problems. We study the setting where we play T contextual linear bandits with dimension d simultaneously, and these T bandit tasks collectively share a common linear representation with a dimensionality of r much smaller than d. We present a new algorithm based on alternating projected gradient descent (GD) and minimization estimator to recover a low-rank feature matrix. Using the proposed estimator, we present a multi-task learning algorithm for linear contextual bandits and prove the regret bound of our algorithm. We presented experiments and compared the performance of our algorithm against benchmark algorithms.
- [65] arXiv:2410.13714 (replaced) [pdf, html, other]
-
Title: Generation through the lens of learning theoryComments: 28 pages, 2 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold [1967], Angluin [1979], Angluin [1980] and Kleinberg and Mullainathan [2024] in terms of a binary hypothesis class defined over an abstract example space. Then, we extend the notion of "generation" from Kleinberg and Mullainathan [2024] to two new settings, we call "uniform" and "non-uniform" generation, and provide a characterization of which hypothesis classes are uniformly and non-uniformly generatable. As is standard in learning theory, our characterizations are in terms of the finiteness of a new combinatorial dimension termed the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are incompatible -- there are classes that are generatable but not predictable and vice versa. Finally, we extend our results to capture prompted generation and give a complete characterization of which classes are prompt generatable, generalizing some of the work by Kleinberg and Mullainathan [2024].
- [66] arXiv:2410.24079 (replaced) [pdf, html, other]
-
Title: Hamiltonian Monte Carlo Inference of Marginalized Linear Mixed-Effects ModelsComments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Bayesian reasoning in linear mixed-effects models (LMMs) is challenging and often requires advanced sampling techniques like Markov chain Monte Carlo (MCMC). A common approach is to write the model in a probabilistic programming language and then sample via Hamiltonian Monte Carlo (HMC). However, there are many ways a user can transform a model that make inference more or less efficient. In particular, marginalizing some variables can greatly improve inference but is difficult for users to do manually. We develop an algorithm to easily marginalize random effects in LMMs. A naive approach introduces cubic time operations within an inference algorithm like HMC, but we reduce the running time to linear using fast linear algebra techniques. We show that marginalization is always beneficial when applicable and highlight improvements in various models, especially ones from cognitive sciences.
- [67] arXiv:2411.02853 (replaced) [pdf, html, other]
-
Title: ADOPT: Modified Adam Can Converge with Any $\beta_2$ with the Optimal RateShohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka MatsuoComments: Accepted at Neural Information Processing Systems (NeurIPS 2024)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at this https URL.
- [68] arXiv:2411.12700 (replaced) [pdf, other]
-
Title: Learning multivariate Gaussians with imperfect adviceSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (stat.ML)
We revisit the problem of distribution learning within the framework of learning-augmented algorithms. In this setting, we explore the scenario where a probability distribution is provided as potentially inaccurate advice on the true, unknown distribution. Our objective is to develop learning algorithms whose sample complexity decreases as the quality of the advice improves, thereby surpassing standard learning lower bounds when the advice is sufficiently accurate.
Specifically, we demonstrate that this outcome is achievable for the problem of learning a multivariate Gaussian distribution $N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ in the PAC learning setting. Classically, in the advice-free setting, $\tilde{\Theta}(d^2/\varepsilon^2)$ samples are sufficient and worst case necessary to learn $d$-dimensional Gaussians up to TV distance $\varepsilon$ with constant probability. When we are additionally given a parameter $\tilde{\boldsymbol{\Sigma}}$ as advice, we show that $\tilde{O}(d^{2-\beta}/\varepsilon^2)$ samples suffices whenever $\| \tilde{\boldsymbol{\Sigma}}^{-1/2} \boldsymbol{\Sigma} \tilde{\boldsymbol{\Sigma}}^{-1/2} - \boldsymbol{I_d} \|_1 \leq \varepsilon d^{1-\beta}$ (where $\|\cdot\|_1$ denotes the entrywise $\ell_1$ norm) for any $\beta > 0$, yielding a polynomial improvement over the advice-free setting. - [69] arXiv:2411.13361 (replaced) [pdf, html, other]
-
Title: Integration of Active Learning and MCMC Sampling for Efficient Bayesian Calibration of Mechanical PropertiesComments: 28 pages, 14 figuresSubjects: Computational Physics (physics.comp-ph); Applications (stat.AP); Machine Learning (stat.ML)
Recent advancements in Markov chain Monte Carlo (MCMC) sampling and surrogate modelling have significantly enhanced the feasibility of Bayesian analysis across engineering fields. However, the selection and integration of surrogate models and cutting-edge MCMC algorithms, often depend on ad-hoc decisions. A systematic assessment of their combined influence on analytical accuracy and efficiency is notably lacking. The present work offers a comprehensive comparative study, employing a scalable case study in computational mechanics focused on the inference of spatially varying material parameters, that sheds light on the impact of methodological choices for surrogate modelling and sampling. We show that a priori training of the surrogate model introduces large errors in the posterior estimation even in low to moderate dimensions. We introduce a simple active learning strategy based on the path of the MCMC algorithm that is superior to all a priori trained models, and determine its training data requirements. We demonstrate that the choice of the MCMC algorithm has only a small influence on the amount of training data but no significant influence on the accuracy of the resulting surrogate model. Further, we show that the accuracy of the posterior estimation largely depends on the surrogate model, but not even a tailored surrogate guarantees convergence of the this http URL, we identify the forward model as the bottleneck in the inference process, not the MCMC algorithm. While related works focus on employing advanced MCMC algorithms, we demonstrate that the training data requirements render the surrogate modelling approach infeasible before the benefits of these gradient-based MCMC algorithms on cheap models can be reaped.