Efficient estimation of partially linear additive Cox models and variance estimation under shape restrictions
Abstract
Shape-restricted inferences have exhibited empirical success in various applications with survival data. However, certain works fall short in providing a rigorous theoretical justification and an easy-to-use variance estimator with theoretical guarantee. Motivated by Deng et al. (2023), this paper delves into an additive and shape-restricted partially linear Cox model for right-censored data, where each additive component satisfies a specific shape restriction, encompassing monotonic increasing/decreasing and convexity/concavity. We systematically investigate the consistencies and convergence rates of the shape-restricted maximum partial likelihood estimator (SMPLE) of all the underlying parameters. We further establish the aymptotic normality and semiparametric effiency of the SMPLE for the linear covariate shift. To estimate the asymptotic variance, we propose an innovative data-splitting variance estimation method that boasts exceptional versatility and broad applicability. Our simulation results and an analysis of the Rotterdam Breast Cancer dataset demonstrate that the SMPLE has comparable performance with the maximum likelihood estimator under the Cox model when the Cox model is correct, and outperforms the latter and Huang (1999)’s method when the Cox model is violated or the hazard is nonsmooth. Meanwhile, the proposed variance estimation method usually leads to reliable interval estimates based on the SMPLE and its competitors.
Keywords: Shape restriction; Righter-censored data; Additive model; Semiparametric efficiency; Variance estimation
1 Introduction
Shape restrictions (such as monotonicity and convexity) arise naturally in numerous practical scenarios. For instance, the growth curves of animals and plants in ecology and the dose-response in medicine must inherently exhibit non-decreasing characteristics (Chang et al., 2007; Wang and Ghosh, 2012). In the realm of economics, utility and production functions are often concave in income and prices (Matzkin, 1991; Varian, 1984), cost functions are monotone increasing, concave in input prices, and may exhibit non-increasing or non-decreasing returns to scale (Horowitz and Lee, 2017). In genetic epidemiology studies, the cumulative risk of a disease for individuals possess monotonicity (Qin et al., 2014). While in reliability analysis, the bathtub curve describing the failure rate typically displays convexity.
Incorporating shape restrictions into statistical analysis, apart from its exceptional interpretability and ability to enforce domain-specific constraints, often results in an estimation procedure that is devoid of tuning parameters, enhancing its efficiency and robustness. Therefore shape-restricted techniques has become an increasing popular tool for statistical inference or learning in various settings over the past decades. A comprehensive review on shape-restricted nonparametric inferences can be found in Groeneboom and Jongbloed (2014) and references therein. Recently, Chen and Samworth (2016) developed an algorithm for the estimation of the generalized additive model in which each of the additive components is linear or subject to a shape restriction. Balabdaoui et al. (2019) considered the estimation of the index parameter in a single-index model with a monotonically increasing link function. Deng and Zhang (2020) studied minimax and adaptation rates in general multiple isotonic regression. Feng et al. (2022) systematically investigate the theoretical properties of the least squared estimator of a S-shaped regression function.
This paper focus on the statistical inference for righter-censored survival data. Let denote the survival time and denote a vector of covariates. We consider the partially linear Cox model of (Sasieni, 1992, PLCM) for modelling the conditional hazard function, i.e.
(1) |
where is the unspecified baseline hazard function, is unspecified and is an unknown function. This model reduces to the renowned Cox proportional hazards model (Cox, 1972, 1975) when the covariate disappears, and it becomes the nonparametric Cox model (Sleeper and Harrington, 1990; O’Sullivan, 1993) in the absence of .
Many nonparametric techniques have been developed for the estimation of the PLCM, in particular for the linear covariate effect. Examples include profile partial likelihood together with a kernel technique (Heller, 2001), maximum likelihood estimation with a deep neural network (Zhong et al., 2022), and a kernel machine representation method (Rong et al., 2024), etc. However, these methods are either hampered by the curse of dimensionality or lack interpretability for , or suffer from tuning parameters, whose selection is not always straightforward. Alternatively, Huang (1999) proposed to model by a generalized additive model (Hastie and Tibshirani, 1986, 1990), which effectively avoids the curse of dimensionality and enforces an additive effect for the covariate . Specifically,
(2) |
where for , is the -th component of and is an unknown function. Huang (1999) proposed the use of polynomial splines to fit the unknown additive components. This method entails a number of tuning parameters, also yields convergence rates that lack conciseness and elegance. Furthermore, the spline method does not provide good interpretability for the additive covariate .
Our paper is motivated by the work of Deng et al. (2023) which studied a shape-restricted and additive PLCM. Specifically, under models (1) and 2, they assume that each is monotonic increasing/decreasing or convex/concave. An active-set optimization algorithm was provided to calculate the shape-restricted maximum likelihood estimator. The shape-restriction strategy facilitates the utilization of prior knowledge regarding the effect of the log conditional hazard function on each covariate and leads to a tuning-parameter-free estimation procedure. However, they proved only a consistency result, and did not provide any asymptotic normality results. Qin et al. (2021) studied a PLCM with a single additive component subject to shape restrictions, but they did not establish any -consistency result. In addition, in general shape-restriction inferences, even if asymptotic normality results can be established, it is generally challenging to construct reasonable estimators for the asymptotic variances with theoretical guarantees (Groeneboom and Hendrickx, 2017).
This paper makes two main contributions to the literature of additive and shape-restricted PLCMs for survival data. The first contribution is to provide powerful statistical guarantees for the shape-restricted maximum partial likelihood estimator (SMPLE) and the induced Breslow-type estimator for the baseline cumulative hazard function under the model assumption of Deng et al. (2023). This includes a thorough convergence rate analysis for the estimators of the infinitely dimensional parameters, as well as establishing asymptotic normality and semiparametric efficiency for the estimator of the linear covariate effect. Our second contribution is to offer an easy-to-use estimator for the asymptotic variance of the linear covariate effect estimator. We show that this variance estimation method always provide consistent estimators once the corresponding asymptotic normality result holds. This method is very flexible and is applicable for general purpose especially in shape-restricted inferences, where theoretical guarantee of a bootstrap variance estimator is generally rather challenging (Groeneboom and Hendrickx, 2017). Our simulation results and an analysis of the Rotterdam Breast Cancer dataset demonstrate that the SMPLE has comparable performance with the maximum likelihood estimator under the Cox model when the Cox model is correct, and outperforms the latter and Huang (1999)’s method when the Cox model is violated and the hazard is nonsmooth. Meanwhile, the proposed variance estimation method usually leads to reliable interval estimates for the SMPLE and its competitors.
The rest of this paper is organized as follows. Section 2 introduces notations, data, and the shape-restricted maximum partial likelihood estimators (SMPLE). Section 3 investigates the convergence rates of the SMPLEs for all the unknown parameters, including , the unknown additive components, and the baseline cumulative hazard function. Section 4 establishes the asymptotic normality and semiparametric efficiency of the SMPLE for . A novel estimation method is also provided to estimate the asymptotic variance of the SMPLE of . A simulation study and real data analysis are presented in Section 5 and 6, respectively. Section 7 contains concluding remarks. For clarity, all technical proofs are postponed to the supplementary material.
2 Methodology
2.1 Data and model assumptions
Let and be the survival time and the vector of covariates, respectively, in the introduction. Suppose that given , the conditional hazard function of satisfies model (1) with satisfying (2). The survival time may be right censored by a censoring time and we only observe . Throughout this paper, we use to denote the indicator function of the set and use a subscript 0 to highlight the true counterpart of a parameter. Let be the non-censoring indicator. Given independent and identically distributed (iid) observations , , from , we wish to infer and the baseline cumulative hazard function .
The identifiability issue of models (1) and (2) was investigated by Deng et al. (2023), following which we assume , , for identifiability. Furthermore, we assume that for , satisfies one of the four shape restrictions: monotone increasing, monotone decreasing, convex and concave, which are encoded as shape types 1, 2, 3 and 4, respectively. For any additive function , we define , where denotes the shape type of a univariate function . We always denote . Let be the support of and for simplicity, we assume that the support of is for .
2.2 SMPLE
For any , denote and , where . times the usual partial log likelihood is
We propose to estimate by the shape-restricted maximum partial likelihood estimator (SMPLE),
(3) |
where
is the parameter space of . With the SMPLE in (3), we estimate by the Breslow-type estimator
(4) |
where
The SMPLE defined in (3) can be calculated with the active-set algorithm introduced in Deng et al. (2023). Let be functions satisfying for all . The function is unique only at the observed and is therefore non-unique typically for other than ’s, which is akin to general shape-restricted regression estimators (Chen and Samworth, 2016). This implies that is usually non-unique for , and the solution set of always contains a piece-wise linear function (Deng et al., 2023). See Figure 3 for an illustration.
3 Rate of convergence
The consistency property of the SMPLEs , , and was established by Deng et al. (2023). In this section, we establish their convergence rates. We make the following assumptions.
Assumption 1.
(i) The observed are in the interval , for some . (ii) Given , and are mutually independent of each other. (iii) and almost surely for some constant . (iv) and .
Let denote the usual Euclidean norm and the supreme norm of a real-valued function . For any constant , define
Assumption 2.
The support of is a bounded subset of and there exists a positive constant such that .
Assumption 3.
There exists a small positive constant such that almost surely with respect to the probability measure of .
Assumption 4.
The joint density of satisfies
Assumption 5.
When , the density function with respect to the Lebesgue measure has uniformly upper and lower bounds on .
Assumptions 1–2 are standard in the theoretical analysis of traditional Cox model and its variants (Huang, 1999; Zhong et al., 2022). Assumption 3 ensures that the probability of being uncensored is positive regardless of the covariate values, and it is used to establish the convergence rate results in Theorem 1. Assumption 4 is used in the calculation of the semiparametric efficiency lower bound (Huang, 1999). In Assumption 5, the upper bound requirement is used to calculate some entropy results needed in the proof of Proposition 1, and the lower bound requirement guarantees that the approximation errors of piecewise linear approximations of the convex/concave additive components to themselves are small enough in the proof of Theorem 3.
The Proposition below establishes the consistency of , as an estimator of , which roughly implies the consistency of . We would have proved the consistencies of and each separately. However, the latter results are not needed in the proofs of the subsequent convergence rate results given the consistency of .
Proposition 1.
Define where denotes the expectation with respect to . Let denote the norm and . One of our main results is to establish the convergence rate of the SMPLE .
Theorem 1.
According to Theorem 1, the rates of convergence of all are if none of the additive components of is monotonic. Conversely, if one additive component of is monotonic, then their convergence rates all slow down to . An explanation for this finding is that the complexity of the class of bounded and monotonic functions is much larger than that of the class of bounded and convex (or concave) functions. These convergence rate results are free from the covariate dimensionality and exhibit a much more elegant form than those in Huang (1999) and Zhong et al. (2022). Theorem 1 also establishes the convergence rate of , although it is sub-optimal.
With Theorem 1, we are able to establish the uniformly rate of convergence for the SMPLE in (4) of the baseline cumulative hazard function . It turns out that has the same convergence rate as , although their convergence rates are quantified by different distances.
Theorem 2.
Assume the same conditions as in Proposition 1. As , it holds that
Remark 1.
In practice, one may impose a combination of monotonicity and convexity/concavity constraints on the additive components according to domain knowledge. See (Chen and Samworth, 2016; Kuchibhotla et al., 2023; Deng et al., 2023) for further motivation on additional shape constraints. Proposition 1 and Theorems 1–4 still hold when model (2) incorporates additive components that satisfy both monotonicity and convexity/concavity restrictions. An intuitive explanation for this result is that the parameter space is reduced by additional constraints on the additive components and this can lead to better convergence rates of the SMPLE (if not the same).
4 Asymptotic normality and efficiency
Based on the convergence rate results in the previous section, in this section, we further show that our SMPLE in (3) for the linear covariate effect is asymptotically normal and semiparametric efficient, in the sense that its asymptotic variance achieves the semiparametric efficiency lower bound (Bickel et al., 1993) or the information bound of estimating under models (1) and (2).
We begin with presenting the information bound for . Recall that , and define
which is a counting process martingale associated with the Cox model. The log-likelihood of model (1) based on one observation is (up to constant)
(6) |
Conisder a parametric smooth sub-model and , , with and . Define to be the set of satisfying and for some submodel. Similarly, for , define to be the set of satisfying , and for some submodel. The following lemma, which is Theorem 3.1 of Huang (1999), gives the information bound of .
Lemma 1 (Theorem 3.1 of Huang (1999)).
Additional assumptions are needed to obtain the asymptotic normality and efficiency of . Denote for .
Assumption 6.
When , there exist constant and such that and for all and .
Assumption 7.
When , the function is -Hlder continuous for all and some .
Assumption 8.
When , the function is convex for some constant ; when , the function is concave for some constant .
Assumption 6 is used to control the fluctuation of the score function corresponding to the monotone components in the direction of the projection defined in Lemma 1. (Huang, 2002; Cheng, 2009) adopted a similar assumption. Assumptions 7–8, which are analogues of Assumption B1 of Kuchibhotla et al. (2023), are used for approximations of the convex (concave) components.
Theorem 3.
By Theorem 3, has an asymptotically normal distribution with asymptotic variance . When making inference about based on this theorem, we need to construct a consistent estimator for . However, or equivalently has a rather complicated form, making its plug-in estimator not easy to use. To crack this nut, we propose a novel data-splitting estimation method to estimate .
4.1 Data-splitting variance estimation and inference on
We introduce the proposed data-splitting variance estimation method under a general setting as it is applicable generally. Let be a functional of a statistical population and be an estimator of based on i.i.d samples from . Suppose that , where is a semi-positive matrix. Let and . We partition the sample into subsamples, each of which has observations, and let denote the estimator of based on the -th subsample, . Our splitting-data estimator for the asymptotic variance is defined as
(8) |
where is the sample mean of . For better stableness, we may repeat the above splitting and estimating procedure for many times and take the average of the resulting variance estimates as a final variance estimate.
Theorem 4.
Let be an estimator of based on i.i.d samples and . Let for some , and be the variance estimator in (8). Then as .
Theorem 4 guarantees the validity of the data-splitting variance estimator. This method is very easy to use and is flexible enough for general purpose. Alternatively, we may construct a variance estimator by bootstrap. However, the consistency of a bootstrap variance estimator often requires stronger conditions (Groeneboom and Hendrickx, 2017) and is often very difficult to prove, especially under shape restrictions.
As a specific application, we apply the data-splitting estimation method to construct an estimator for the information bound or the asymptotic variance of . Denote the resulting estimator by , which is consistent to by Theorem 4. Therefore follows asymptotically , a chisquare distribution of degrees of freedom. For , let be the quantile of . A -level confidence region for can be constructed as
(9) |
And for the hypothesis , we propose to reject at the significance level if
(10) |
By Theorems 3 and 4, the confidence region (9) has an asymptotically correct coverage probability, and the test defined by the rejection region (10) has an asymptotically correct type I eror .
5 Simulations
In this section, we conduct simulations to assess the finite-sample performance of the proposed SMPLE and the proposed confidence region (9) for the linear covariate effect . To generate data, we take and to be two scalar random variables, which are iid from the standard normal distribution, and take the conditional distribution of given to be an exponential distribution with mean . Therefore the conditional hazard function of given is . We set the censoring time to follow a uniform distribution on . We consider three scenarios for : (I) , (II) and (III) . We set and consider two choices for : and , and two sample sizes: and . A larger results in a smaller censoring proportion.
When implementing our SMPLE, we set to be in Scenarios I–III, respectively. For comparison, we also consider the traditional Cox regression estimator (TCR) of and the partial likelihood estimator of with the -order polynomial splines under the partially linear additive model (Huang, 1999, SPLA-), where may be or . We generate 1000 random samples to evaluate the performance of the above five estimation methods.
5.1 Point estimation
Table 1 presents times the simulated root mean square errors (RMSEs) and the absolute biases (BIASs) of these estimators. The model assumptions of SMPLE and the SPLA- are correct in all the three scenarios, whereas the standard Cox model is correctly specified only in Scenario I. As expected, in Scenario I, TCR has uniformly the best performance among the five estimators under comparison in terms of RMSE and BIAS. Nevertheless, the SMPLE and the SPLA- estimators have almost the same RMSEs and BIASs. When the standard Cox model is misspecified in Scenarios II and III, TCR has much larger RMSEs and BIASs than SMPLE and the SPLA-, or equivalently, SMPLE and the SPLA- have clear priority over TCR. Compared with SPLA-, SMPLE is comparable and slightly inferior in Scenarios I and II, but has uniformly much smaller RMSEs and BIASs in Scenario III. A possible explanation for this phenomenon is that although continuous in all three scenarios, the hazard function is smooth in Scenarios I and II but nonsmooth in Scenario III. As the sample size or the constant increases, we have more completely observed data, consequently all estimators have improved performance when the underlying model assumption is correct. A counterexample is the performance of TCR in Scenarios II and III, where TCR has larger RMSEs and BIASs as or increases.
Figure 1 displays the boxplots of the SMPLE and SPLA- (r=2, 3, 4) estimators (minus ) of under study when the sample size is . TCR is excluded here as it has extremely large RMSEs and BIASs in Scenarios II and III. SMPLE and the three SPLA- exhibit almost the same performance in Scenario I. In Scenario II, where the true hazard function is smooth, the four methods have close variances, but from SMPLE, to SPLA-, SPLA-, and SPLA-, their BIASs become smaller and smaller. In Scenario III, where the true hazard function is nonsmooth, the four methods have close variances again, however the three SPLA- estimators have much larger BIASs than SMPLE, whose BIASs are negligible.
Scenario | SMPLE | TCR | SPLA- | SPLA- | SPLA- | ||
---|---|---|---|---|---|---|---|
I | 5 | 600 | |||||
800 | |||||||
10 | 600 | ||||||
800 | |||||||
II | 5 | 600 | |||||
800 | |||||||
10 | 600 | ||||||
800 | |||||||
III | 5 | 600 | |||||
800 | |||||||
10 | 600 | ||||||
800 |
![Refer to caption](/html/2407.06532v1/x1.png)
Scenario | SMPLE | TCR | SPLA- | SPLA- | SPLA- | ||
---|---|---|---|---|---|---|---|
I | 5 | 600 | |||||
800 | |||||||
10 | 600 | ||||||
800 |