Generalized Estimation and Information

Paul Vos and Qiang Wu East Carolina University, vosp@ecu.eduEast Carolina University, wuq@ecu.edu

Abstract

This paper extends the idea of a generalized estimator for a scalar parameter (Vos,, 2022) to multi-dimensional parameters both with and without nuisance parameters. The title reflects the fact that generalized estimators provide more than simply another method to find point estimators, and that the methods to assess generalized estimators differ from those for point estimators. By generalized estimation we mean the use of generalized estimators together with an extended definition of information to assess their inferential properties. We show that Fisher information provides an upper bound for the information utilized by an estimator and that the score attains this bound.

Key words: Cram\textipaér-Rao bound, Fisher information, geometry, score, slope

1 Introduction

The maximum likelihood estimator need not be efficient and, among the class of biased estimators, it need not be admissible. These issues with maximum likelihood estimation and the parameter dependence of other point estimators are addressed using generalized estimators. Generalized estimators are described by information rather than variance and the Fisher information provides an upper for the information of an estimator. This bound applies to all generalized estimators; it does not require estimators to be unbiased. The score is a generalized estimator and its information equals the Fisher information.

A point estimator assigns to each value $y$ in the sample space a point in the parameter space $\Theta$ Θしーた\Thetaroman_Θしーた. A generalized estimator $g$ assigns to each $y$ a function $g_{y}$ on $\Theta$ Θしーた\Thetaroman_Θしーた where $g_{y}(\theta)$ θしーた ) indicates the consistency of $\theta$ θしーた with $y$ . The function $g_{y}$ can be thought of as a continuum of tests statistics evaluated at $y$ . The information of $g$ describes the average rate at which these test statistics change with $\theta$ θしーた. Section 2 presents the scalar parameter case in a manner for natural extension to multi-dimensional parameters in Section 3. Section 4 presents two examples: one to illustrate the role of information in assessing estimators and the other to illustrate how confidence intervals can be obtained from a generalized estimate.

2 One Parameter Families

As we want inferences to be unaffected by the choice of parameter, we describe the basics of inference without these. Parameterization will be introduced to describe the smooth structures of estimators.

Let $M_{\mathcal{\mathcal{X}}}$ be a family of probability measures having common support $\mathcal{X}$ . While $\mathcal{X}$ can be an abstract space, for most applications $\mathcal{X}\subset\mathbb{R}^{d}$ . Points in $M_{\mathcal{X}}$ serve as models for a population whose individuals take values in $\mathcal{X}$ . We consider inference for models from $M_{\mathcal{X}}$ based on a sample that is denoted by $y$ and let $\mathcal{Y}$ be the corresponding sample space. The relationship between $\mathcal{X}$ and $\mathcal{Y}$ will depend on the sampling plan, conditioning, and dimension reduction using sufficient statistics. For a simple random sample of size $n$ without conditioning and no dimension reduction $\mathcal{Y}=\mathcal{X}^{n}$ .

Let $M=M_{\mathcal{Y}}$ be the family of probability measures obtained from $M_{\mathcal{X}}$ using a sampling plan whose sample space is $\mathcal{Y}$ . For $\mathcal{Y}=\mathcal{X}^{n}$

M=\left\{m:m(y)=\prod m_{\mathcal{X}}(x_{i}),\ m_{\mathcal{X}}\in M_{\mathcal{% X}}\right\}.

For the Bernoulli family of distributions, $\mathcal{X}=\left\{0,1\right\}$ ,

M_{\mathcal{X}}=\left\{m:0<m(1)<1,m(0)+m(1)=1\right\}.

For a sample of size $n$ we use the sufficient statistic $y=\sum x_{i}$ so that $\mathcal{Y}=\left\{0,1,2,\ldots,n\right\}$ and

M=\left\{m:m(y)={n\choose y}m_{\mathcal{X}}(1)^{y}m_{\mathcal{X}}(0)^{n-y},\ m% _{\mathcal{X}}\in M_{\mathcal{X}}\right\}.

(1)

When $\mathcal{Y}$ is open it will be convenient to let $m$ be a probability density with respect to a dominating measure $\mu$ μみゅー. For $\mathcal{X}=\mathbb{R}$ and function $\phi>0$ such that $\int\phi(x)d\mu=1$ μみゅー = 1 there is a location family

M_{\mathcal{X}}=\left\{m:m(x)=\phi(x-a),\ a\in\mathbb{R}\right\}.

For a simple random sample with $y=\left(x_{1},x_{2},\ldots,x_{n}\right)^{{\tt t}}$

M=\left\{m:m(y)=\prod\phi(x_{i}-a),\ a\in\mathbb{R}\right\}.

If $\phi(x)=\left(2\pi\right)^{-1/2}\exp\left(-\frac{1}{2}x^{2}\right)$ πぱい ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) then $M_{\mathcal{X}}$ is the normal location family with unit variance. Using the sufficient statistic $y=\bar{x}=(\sum x_{i})/n\in\mathcal{Y}=\mathbb{R}$ ,

M=\left\{m:m\left(y\right)=\sqrt{n}\phi\left(\sqrt{n}\left(y-a\right)\right),a% \in\mathbb{R}\right\}.

(2)

If $\phi(x)=\pi^{-1}\left(1+x^{2}\right)^{-1}$ πぱい start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT then $M_{\mathcal{X}}$ is the Cauchy location family with unit scale factor. There is no sufficient statistic of dimension less than $n$ so we use $y=\left(x_{1},x_{2},\ldots,x_{n}\right)^{{\tt t}}$ ,

M=\left\{m:m(y)=\pi^{-n}\prod\left(1+\left(x_{i}-a\right)^{2}\right)^{-1},a\in% \mathbb{R}\right\}.

πぱい start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ∏ ( 1 + ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_a ∈ blackboard_R } .

(3)

For real-valued measurable function $h$ we define the expected value of $h$ at $m$ ,

E_{m}h=\int_{\mathcal{Y}}h(y)m(y)d\mu

μみゅー

when $\mathcal{Y}$ is open and $E_{m}h=\sum_{y\in\mathcal{Y}}h(y)m(y)$ when $\mathcal{Y}$ is discrete. We use the following Hilbert space

H_{M}=\left\{h:E_{m}h^{2}<\infty,\ \forall\ m\in M\right\}

which has a family of inner products indexed by $M$ ,

\langle h,h^{\prime}\rangle_{m}=E_{m}\left(hh^{\prime}\right)\mbox{for all }h,% h^{\prime}\in H_{M}.

When $E_{m}(hh^{\prime})=0$ the vectors $h$ and $h^{\prime}$ are $m$ -orthogonal and we write $h\perp_{m}h^{\prime}$ . At each $m\in M$ there is a copy of $H_{M}$ and this collection we denote by

H\!M=M\times H_{M}.

The copy of $H_{M}$ at $m$ with inner product $\langle\cdot,\cdot\rangle_{m}$ is $H_{m}$ which we also write as $H_{m}M$ to indicate its relationship to $H\!M$ . For inference, $H_{m}$ will be restricted to the orthogonal complement of the constant functions, $H_{m}^{\perp}=\{h\in H_{m}:E_{m}h=0\}$ , so that

H_{m}=H_{m}^{\perp}\oplus H_{m}^{0}\ \mbox{and}\ H_{m}^{\perp}\perp_{m}H_{m}^{% 0}.

(4)

Note $E_{m}h=\langle h,1\rangle_{m}$ and $H_{m}^{0}$ does not depend on $m$ . Since (4) holds for each $m$ we write

H\!M=H^{\perp}\!M\oplus H^{0}M\ \mbox{and}\ H^{\perp}M\perp H^{0}M

(5)

where $\perp$ indicates $\perp_{m}$ holds for $H_{m}^{\perp}M=H_{m}^{\perp}$ .

As the notation suggests, $H\negmedspace M$ is a vector bundle on $M$ with vector space $H_{M}$ . It extends the tangent bundle $T\!M$ since $T\!M\subset H^{\perp}\!M$ .

For inference regarding models in $M$ , we consider functions $g_{M}:\mathcal{Y}\times M\rightarrow\mathbb{R}$ such that

g_{M}\left(\cdot,m\right)\in H_{m}^{\perp}\mbox{ for all }m\in M.

(6)

We also want $g_{M}$ to be a continuous on $M$ ,

g_{M}\left(y,\cdot\right)\in C(M)\mbox{ for a.e. }y\in\mathcal{Y},

(7)

so that the expectation of $g_{M}$ is a continuous function. For point estimators of a parameter, say $\theta$ θしーた, the expectation of the estimator $\hat{\theta}$ θしーた end_ARG is a real number. To emphasize this distinction we use the sans serif font to indicate the expectation of $g_{M}$

\mathsf{E}g_{M}\in C(M)\mbox{ while }E\hat{\theta}\in\mathbb{R}.

θしーた end_ARG ∈ blackboard_R .

Expectation $\mathsf{E}$ operates on $C(M)$ -valued distributions, whereas $E$ operates on $\mathbb{R}$ -valued distributions. To be a generalized estimator, $g_{M}(y,\cdot)$ will be required to have continuous derivatives on $M$ and these will be described using parameterizations that are diffeomorphisms.

We assume $M$ is a 1-dimensional smooth manifold. While more general manifolds can be considered (e.g., Fisher’s circle model), we will only consider families that have a global parameterization

\theta:M\rightarrow\Theta\subset\mathbb{R}

Θしーたℝ\theta:M\rightarrow\Theta\subset\mathbb{R}italic_θしーた : italic_M → roman_Θしーた ⊂ blackboard_R

(8)

and are connected so that $\Theta$ Θしーた\Thetaroman_Θしーた is an open interval. For $g_{M}:\mathcal{Y}\times M\rightarrow\mathbb{R}$ we define $g_{\Theta}=g_{M}\circ\theta^{-1}:\mathcal{Y}\times\Theta\rightarrow\mathbb{R}$ Θしーたsubscript𝑔𝑀superscript𝜃1→𝒴Θしーたℝg_{\Theta}=g_{M}\circ\theta^{-1}:\mathcal{Y}\times\Theta\rightarrow\mathbb{R}italic_g start_POSTSUBSCRIPT roman_Θしーた end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∘ italic_θしーた start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : caligraphic_Y × roman_Θしーた → blackboard_R. Unless more than one parameterization is being used, we drop the subscript and write $g$ for $g_{\Theta}$ Θしーたg_{\Theta}italic_g start_POSTSUBSCRIPT roman_Θしーた end_POSTSUBSCRIPT. The log likelihood function on $\Theta$ Θしーた\Thetaroman_Θしーた for $y$ is the function defined by

\ell=\ell(y,\cdot)=\ell_{M}(y,\cdot)\circ\theta^{-1}

θしーた start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

where $\ell_{M}\left(y,m\right)=\log m\left(y\right)$ . The score function on $\Theta$ Θしーた\Thetaroman_Θしーた for $y$ is

s=\nabla\ell=\partial\ell/\partial\theta.

θしーた .

We only consider $M$ such that $s(\cdot,\theta)\in H_{M}$ θしーた ) ∈ italic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT for all $\theta\in\Theta$ Θしーた\theta\in\Thetaitalic_θしーた ∈ roman_Θしーた. Because $M$ is a smooth manifold $s\left(y,\cdot\right)\in C^{1}(\Theta)\ a.e.\ y$ Θしーた𝑎𝑒𝑦s\left(y,\cdot\right)\in C^{1}(\Theta)\ a.e.\ yitalic_s ( italic_y , ⋅ ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( roman_Θしーた ) italic_a . italic_e . italic_y and since $\mathsf{E}s=0$ ,

s(\cdot,\theta)\in H_{\theta}^{\perp}.

θしーた ) ∈ italic_H start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT .

(9)

These properties of $s$ are used to define generalized estimators.

Definition 1.

A generalized estimator for scalar parameter $\theta$ θしーた is a function

g:\mathcal{Y}\times\Theta\longrightarrow\mathbb{R}

Θしーたℝg:\mathcal{Y}\times\Theta\longrightarrow\mathbb{R}italic_g : caligraphic_Y × roman_Θしーた ⟶ blackboard_R

and $g=g(y,\cdot)$ is the corresponding generalized estimate at $y$ if

	(i)	$\displaystyle\ \ g\left(y,\cdot\right)\in C^{1}(\Theta)\mbox{\ a.e.}\ y$ Θしーた a.e.𝑦\displaystyle\ \ g\left(y,\cdot\right)\in C^{1}(\Theta)\mbox{\ a.e.}\ yitalic_g ( italic_y , ⋅ ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( roman_Θしーた ) a.e. italic_y
	(ii)	$\displaystyle\ \ g\left(\cdot,\theta\right)\in H_{\theta}^{\perp}\mbox{ for % all }\theta$ θしーた ) ∈ italic_H start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT for all italic_θしーた
	(iii)	$\displaystyle\ \ \mathsf{V}\left(g\right)>0$

where $\mathsf{V}\left(g\right)=\mathsf{E}\left(g^{2}\right)\in C^{1}(\Theta).$ Θしーた\mathsf{V}\left(g\right)=\mathsf{E}\left(g^{2}\right)\in C^{1}(\Theta).sansserif_V ( italic_g ) = sansserif_E ( italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( roman_Θしーた ) .

f^{\perp}=f-f^{\top}\in\mathcal{G}.

(10)

where $f^{\top}=\mathsf{E}f$ .

Godambe, (1960) has similar criteria but allows $\mathsf{V}(g)=0$ for some $\theta\in\Theta$ Θしーた\theta\in\Thetaitalic_θしーた ∈ roman_Θしーた and adds that $\mathsf{E}\left(\nabla g\right)^{2}>0$ so that $\mathsf{E}\left(\nabla g\right)$ can never be zero on $\Theta$ Θしーた\Thetaroman_Θしーた. We do not need this restriction since we describe estimators in terms of information rather than variance. Allowing $\mathsf{E}\left(\nabla g\right)$ to be zero will be useful for nuisance parameters in the multi-dimension setting. Because $\mathsf{V}(g)>0$ we can define the standardization of $g$ as

\bar{g}=\frac{g}{\sqrt{\mathsf{V}(g)}}.

Since $\bar{g}(\cdot,\theta)\in H_{\theta}^{\perp}$ θしーた ) ∈ italic_H start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT is a vector of unit length, $\bar{g}$ is also called the direction of $g$ . Standardized estimators are the same in every parameterization. That is, for any $m^{\prime}\in M$ , $\bar{g}_{\Theta}(\cdot,\theta^{\prime})=\bar{g}_{\Xi}(\cdot,\xi^{\prime})$ Θしーた⋅superscript𝜃′subscript¯𝑔Ξくしー⋅superscript𝜉′\bar{g}_{\Theta}(\cdot,\theta^{\prime})=\bar{g}_{\Xi}(\cdot,\xi^{\prime})over¯ start_ARG italic_g end_ARG start_POSTSUBSCRIPT roman_Θしーた end_POSTSUBSCRIPT ( ⋅ , italic_θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_g end_ARG start_POSTSUBSCRIPT roman_Ξくしー end_POSTSUBSCRIPT ( ⋅ , italic_ξくしー start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where $\theta^{\prime}=\theta(m^{\prime})$ θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θしーた ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and $\xi^{\prime}=\xi(m^{\prime})$ ξくしー start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ξくしー ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

A non-degenerate point estimator $\hat{\theta}$ θしーた end_ARG whose first two moments are smooth functions on $\Theta$ Θしーた\Thetaroman_Θしーた is a pre-estimator so that

\hat{\theta}-\mathsf{E}\hat{\theta}\in\mathcal{G}.

θしーた end_ARG - sansserif_E over^ start_ARG italic_θしーた end_ARG ∈ caligraphic_G .

We use the sans serif notation because as a pre-estimator $\hat{\theta}(y,\cdot)$ θしーた end_ARG ( italic_y , ⋅ ) is a function on the parameter space, the constant function taking the value of the point estimate at $y$ . The estimator need not be unbiased, so that generalized estimation can be used to compare biased and unbiased point estimators as well as estimators not constrained to be constant on $\Theta$ Θしーた\Thetaroman_Θしーた. Generalized estimators are compared in terms of their information.

Definition 2.

The information for scalar parameter $\theta$ θしーた utilized by $g$ is

\Lambda(g)=\left(\mathsf{E}\nabla\bar{g}\right)^{2}=\frac{(\mathsf{E}\nabla g)% ^{2}}{\mathsf{E}(g^{2})}.

Λらむだ𝑔superscript𝖤∇¯𝑔2superscript𝖤∇𝑔2𝖤superscript𝑔2\Lambda(g)=\left(\mathsf{E}\nabla\bar{g}\right)^{2}=\frac{(\mathsf{E}\nabla g)% ^{2}}{\mathsf{E}(g^{2})}.roman_Λらむだ ( italic_g ) = ( sansserif_E ∇ over¯ start_ARG italic_g end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ( sansserif_E ∇ italic_g ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG sansserif_E ( italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG .

(11)

where the second equality follows from the definition of $\bar{g}$ and $\mathsf{E}g=0$ .

The Fisher information for a sample of size $n$ , $I_{(n)}$ , and the Fisher information in a single observation, $I_{(1)}$ , satisfy $I_{(n)}=nI_{(1)}$ . This relationship also holds for the information utilized by an estimator

\Lambda(g_{(n)})=n\Lambda(g_{(1)}).

Λらむだsubscript𝑔𝑛𝑛Λらむだsubscript𝑔1\Lambda(g_{(n)})=n\Lambda(g_{(1)}).roman_Λらむだ ( italic_g start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ) = italic_n roman_Λらむだ ( italic_g start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) .

(12)

When considering only samples of size $n$ we use $I=I_{(n)}$ and $\Lambda(g)=\Lambda(g_{(n)})$ Λらむだ𝑔Λらむだsubscript𝑔𝑛\Lambda(g)=\Lambda(g_{(n)})roman_Λらむだ ( italic_g ) = roman_Λらむだ ( italic_g start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ).

As the score is the archetype for a generalized estimator $g$ , the log likelihood function is the archetype for the scalar potential.

Definition 3.

A scalar potential of $g$ is any function $G:\mathcal{Y}\times\Theta\longrightarrow\mathbb{R}$ ΘしーたℝG:\mathcal{Y}\times\Theta\longrightarrow\mathbb{R}italic_G : caligraphic_Y × roman_Θしーた ⟶ blackboard_R such that $\nabla G=g$ .

While $G\not\in\mathcal{G}$ we define the information utilized by $G$ to be the information of its derivative: $\Lambda(G)=\Lambda(g)$ Λらむだ𝐺Λらむだ𝑔\Lambda(G)=\Lambda(g)roman_Λらむだ ( italic_G ) = roman_Λらむだ ( italic_g ). Information is a local property and so does not distinguish between a generalized estimator and its scalar potential. The scalar potential is useful for finding confidence regions especially when the parameterization is multidimensional.

We assume differentiation commutes with the integral sign so for any pre-estimator $f$

\nabla\left(\mathsf{E}f\right)=\mathsf{E}\left(\nabla f\right)+\left(\nabla% \mathsf{E}\right)\left(f\right)

(13)

where $\left(\nabla\mathsf{E}\right)$ is the linear operator on $H_{M}$ defined by

\left(\nabla\mathsf{E}\right)(h)=\mathsf{E}\left(\left(\nabla\ell\right)h% \right).

Note that we use $f$ and $g$ for functions on $\mathcal{Y}\times\Theta$ Θしーた\mathcal{Y}\times\Thetacaligraphic_Y × roman_Θしーた while $h\in H_{M}$ is a function on $\mathcal{Y}$ . For generalized estimator $g$ , $\mathsf{E}g$ vanishes so (13) becomes, after switching left- and right-hand sides, the score equation

\mathsf{E}\left(\nabla g\right)+\mathsf{E}\left(sg\right)=0.

(14)

When $g=s$ , the score equation gives the equivalent definitions of the Fisher information for $\theta$ θしーた

I=-\mathsf{E}(\nabla s)=\mathsf{E}(s^{2}).

The information upper bound follows from the score identity.

Theorem 1.

The information for $\theta$ θしーた utilized by $g$ is bounded by the Fisher information

\displaystyle\Lambda\left(g\right)

Λらむだ𝑔\displaystyle\Lambda\left(g\right)roman_Λらむだ ( italic_g )

\displaystyle\leq

\displaystyle I.

Furthermore, the score $s$ attains this bound and for any $g\in\mathcal{G}$

	$\displaystyle\Lambda(g)$ Λらむだ𝑔\displaystyle\Lambda(g)roman_Λらむだ ( italic_g )	$\displaystyle=$	$\displaystyle\mathsf{V}(\mathsf{P}_{g}s)$
		$\displaystyle=$	$\displaystyle\mathsf{R}^{2}I$

where $\mathsf{P}_{g}s$ is the projection of $s$ onto the space spanned by $g$ and $\mathsf{R}=\mathsf{E}(\bar{s}\bar{g})$ is the correlation between $s$ and $g$ .

Proof.

From the score equation

\Lambda(g)=\mathsf{E}^{2}(s\bar{g}).

Λらむだ𝑔superscript𝖤2𝑠¯𝑔\Lambda(g)=\mathsf{E}^{2}(s\bar{g}).roman_Λらむだ ( italic_g ) = sansserif_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s over¯ start_ARG italic_g end_ARG ) .

(15)

The second displayed equality follows upon noting that $\mathsf{E}^{2}(s\bar{g})=\mathsf{E}^{2}(\bar{s}\bar{g})I=\mathsf{R}^{2}I$ . The first equality follows by expressing the projection using basis vector $\bar{g}$

	$\displaystyle\mathsf{V}(\mathsf{P}_{g}s)$	$\displaystyle=$	$\displaystyle\mathsf{V}\left(\mathsf{E}(s\bar{g})\bar{g}\right)$
		$\displaystyle=$	$\displaystyle\mathsf{E}^{2}(s\bar{g}).$

∎

Efficiency of a point estimator is defined using the ratio of its variance to the variance bound. Efficiency of a generalized estimator is defined as the ratio of its information to the information bound, $I$ .

Definition 4.

The $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency of $g$ is

\mbox{Eff}^{\Lambda}(g)=I^{-1}\Lambda(g).

Λらむだ𝑔superscript𝐼1Λらむだ𝑔\mbox{Eff}^{\Lambda}(g)=I^{-1}\Lambda(g).Eff start_POSTSUPERSCRIPT roman_Λらむだ end_POSTSUPERSCRIPT ( italic_g ) = italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Λらむだ ( italic_g ) .

An immediate corollary of Theorem 1 is that the $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency is the square of the correlation between the estimator and the score.

Corollary 1.

	$\displaystyle\mbox{Eff}^{\Lambda}(g)$ Λらむだ𝑔\displaystyle\mbox{Eff}^{\Lambda}(g)Eff start_POSTSUPERSCRIPT roman_Λらむだ end_POSTSUPERSCRIPT ( italic_g )	$\displaystyle=$	$\displaystyle\mathsf{V}\left(\mathsf{P}_{g}\bar{s}\right)$
		$\displaystyle=$	$\displaystyle\mathsf{R}^{2}.$

The $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency of a point estimator $\hat{\theta}$ θしーた end_ARG is the $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency of its generalized estimator $g_{\hat{\theta}}=\hat{\theta}-\mathsf{E}\hat{\theta}$ θしーた end_ARG end_POSTSUBSCRIPT = over^ start_ARG italic_θしーた end_ARG - sansserif_E over^ start_ARG italic_θしーた end_ARG. When $\hat{\theta}$ θしーた end_ARG is unbiased $\Lambda(g_{\hat{\theta}})=\mathsf{V}^{-1}(\hat{\theta})$ Λらむだsubscript𝑔^𝜃superscript𝖵1^𝜃\Lambda(g_{\hat{\theta}})=\mathsf{V}^{-1}(\hat{\theta})roman_Λらむだ ( italic_g start_POSTSUBSCRIPT over^ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT ) = sansserif_V start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θしーた end_ARG ) so that $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency is identical to efficiency based on variance.

Even though these efficiencies can take the same numerical value, it is incorrect to characterize the information as the reciprocal of the variance. The information at $\theta^{\prime}=\theta(m^{\prime})$ θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θしーた ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), $\Lambda(g)|_{\theta=\theta^{\prime}}$ Λらむだ𝑔𝜃superscript𝜃′\Lambda(g)|_{\theta=\theta^{\prime}}roman_Λらむだ ( italic_g ) | start_POSTSUBSCRIPT italic_θしーた = italic_θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, is a measure of how $g$ changes in a neighborhood $m^{\prime}\in M$ ; that is, information depends on $M$ . The variance at $\theta^{\prime}$ θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, $\mathsf{V}(g)|_{\theta=\theta^{\prime}}$ θしーた = italic_θしーた start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, depends only on $m^{\prime}$ ; it is the same for the countless manifolds we could choose that contain $m^{\prime}$ . Another difference is that variance is defined on horizontal distributions while information is defined on vertical distributions. Horizontal and vertical distributions are described in Example 1.

Example 1.

We consider inference for the proportion of a population having a genetic variation or other specified characteristic. We let $1$ (0) indicate the characteristic is present (absent) so $\mathcal{X}=\left\{0,1\right\}$ and for a sample of size $n$ , $M$ is given by (1). Figure 1 shows the standardized score

\bar{s}=\frac{y-np}{\sqrt{np(1-p)}}

where $n=20$ and $p$ is the parameter defined by $p(m)=m(1)$ with parameter space $P=(0,1)$ . The graph of the estimate $\bar{s}_{y}$ when $y=6$ is the black curve. The estimator $\bar{s}$ is represented by the family of 21 curves, one for each $y$ in the sample space (unrealized estimates are shown in white).

Refer to caption — Figure 1: The standardized score estimate $\bar{s}_{6}$ obtained from the sample with $y=6$ and $n=20$ for the Bernoulli manifold with the parameter $p=m(1)$ is shown by the black curve. The standardized score estimator $\bar{s}$ is represented by the family of 21 curves, one for each $y$ in the sample space (unrealized estimates are shown in white). Of the continuum of vertical slices two are shown at $p=.50$ and $p=.55$ . The distribution of the point estimate $\hat{p}$ is shown by the intersection of these 21 curves with the horizontal axis. Note that for two of these curves the intersection occurs for a value outside of the parameter space.

Of the continuum of vertical slices two are shown, one at $p=.50$ and another at $p=.55$ . Every vertical slice for $0<p<1$ intersects all 21 curves and while the ordinate of these points of intersection depends on $p$ the resulting distributions all have mean zero and variance one. These vertical distributions are the same in every parameterization. For any parameter $\theta$ θしーた, $\bar{s}(y,p(m^{\prime}))=\bar{s}_{\Theta}(y,\theta(m^{\prime}))$ Θしーた𝑦𝜃superscript𝑚′\bar{s}(y,p(m^{\prime}))=\bar{s}_{\Theta}(y,\theta(m^{\prime}))over¯ start_ARG italic_s end_ARG ( italic_y , italic_p ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT roman_Θしーた end_POSTSUBSCRIPT ( italic_y , italic_θしーた ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) for all $y$ and all $m^{\prime}$ . In contrast, the abscissa values obtained from the intersection of these curves with the parameter axis are the same for all $p$ but the mean and variance of these horizontal distributions depends on the value of the parameter and on the choice of parameterization. The horizontal distributions describe the inferential properties in terms of the mean and variance of the roots of $s$ while the vertical distributions describe how each estimate $\bar{s}_{y}$ changes with the parameter.

When the maximum likelihood estimator exists and is unique it is, by definition, the parameter-intercept of the score, $\hat{p}=s^{-1}(0)$ . For $y=0$ and for $y=20$ , the maximum likelihood estimate does not exist since $s_{y}$ does not cross the parameter axis. Even when the point estimate does not exist, confidence regions can be constructed from the standardized score $\bar{s}$ . All 21 estimates $\bar{s}_{y}$ provide $z$ -standard deviation intervals

\mbox{CI}_{-z}(y)=\left\{p:\bar{s}_{y}(p)\geq-z\right\},\mbox{CI}_{+z}(y)=% \left\{p:\bar{s}_{y}(p)\leq z\right\}.

The intersection of the curve $\bar{s}_{6}$ with the white lines at $\bar{s}_{y}=\pm 2$ in Figure 1 show the endpoints of $\mbox{CI}_{-2}(6)$ and $\mbox{CI}_{+2}(6)$ . Since generalized estimators are parameter invariant, these intervals correspond to subsets of the space of models $M$ . The interpretation of these intervals can be stated in terms of their complement: if the true model is not in $\mbox{CI}_{-2}(y)$ or $\mbox{CI}_{+2}(y)$ then the score test for the observed data $y$ is at least two standard deviations from zero. That is, for models outside these intervals the observed data $y$ would be improbable since the score is at least two standard deviations from zero. Intervals based on tail probabilities can be obtained by allowing $z$ to be a function of the parameter; for $\mbox{CI}_{+z}(6)$ the value for $z$ would be obtained using the mass assigned to the values $\{0,1,\ldots,5,6\}$ .

Figure 2 shows the log likelihood ratio statistic $S$ for $y=6$ and its distribution on the other 20 values in the sample space. The vertical slices at $p=.50$ and $p=.55$ correspond to those from Figure 1 but the circles are only plotted when the slope of the intersecting curve is negative. Each vertical slice has 6 points of intersection corresponding to samples as extreme as $y=6$ . The resulting p-value is the same as for the score. This will be true for any vertical slice so that inference from the score and the signed log likelihood ratio are identical in this example. This will not be true when the curves of the estimator $g$ intersect. Also, inference from $g$ and unsigned scalar potential function $G$ will not be identical. In particular, the score and unsigned log likelihood ratio are not identical in this example.

Example 2.

– We consider the same population as before but now the variable of interest is a measured quantity and we choose $M_{\mathcal{X}}$ to be the Cauchy family so that for a random sample of size $n$ , $M$ is given by (3). For comparison we also consider models from the Normal family for which the family of sampling distributions is given by (2); we use $M_{{\tt Gaus}}$ to identify this manifold. For parameterization $\theta$ θしーた, the graph of a generalized estimate $\bar{g}_{y}$ for an observation $y=\left(x_{1},x_{2},\ldots,x_{n}\right)^{{\tt t}}$ is a curve over the parameter space $\Theta$ Θしーた\Thetaroman_Θしーた. This corresponds to the black curve in the previous example. The distribution of the estimator $\bar{g}$ is more difficult to represent since there are a continuum of curves indexed by $y$ . For $M_{{\tt Gaus}}$ there is also a continuum of curves but now the sufficient statistic $\bar{x}=n^{-1}\sum x_{i}$ provides a one dimensional index. Nevertheless, the properties of the vertical distributions for $M$ and $M_{{\tt Gaus}}$ still hold and confidence regions for $g_{y}$ are defined in the same way.

3 Multi-parameter Families

We consider inference for a parameter $\theta=\left(\theta^{1},\theta^{2},\ldots,\theta^{k}\right)^{{\tt t}}\in% \mathbb{R}^{k}$ θしーた = ( italic_θしーた start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_θしーた start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_θしーた start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in the presence of a $k^{\prime}$ -dimensional nuisance parameter $\undertilde{\theta}=(\undertilde{\theta}^{1},\undertilde{\theta}^{2},\ldots,% \undertilde{\theta}^{k^{\prime}})^{{\tt t}}$ θしーた end_ARG = ( under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT so that $M$ is a manifold of dimension $(k+k^{\prime})$ and $\text{$\underline{\theta}$}^{{\tt t}}=(\theta^{{\tt t}},\undertilde{\theta}^{{% \tt t}})$ θしーた end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT = ( italic_θしーた start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT , under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT ) is a global parameterization $\text{$\underline{\theta}$}:M\rightarrow\text{$\underline{\Theta}$}$ Θしーた\text{$\underline{\theta}$}:M\rightarrow\text{$\underline{\Theta}$}under¯ start_ARG italic_θしーた end_ARG : italic_M → under¯ start_ARG roman_Θしーた end_ARG. We use $\overline{\nabla}$ , $\nabla$ , and $\widetilde{\nabla}$ to indicate differentiation with respect to $\underline{\theta}$ θしーた end_ARG, $\theta$ θしーた, and $\undertilde{\theta}$ θしーた end_ARG, respectively, so that

\undertilde{s}=\widetilde{\nabla}\ell=(\partial\ell/\partial\undertilde{\theta% }^{1},\partial\ell/\partial\undertilde{\theta}^{2},\ldots,\partial\ell/% \partial\undertilde{\theta}^{k^{\prime}})^{{\tt t}}.

θしーた end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ∂ roman_ℓ / ∂ under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , ∂ roman_ℓ / ∂ under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT .

Note that subscripts are used for the components of $g$ while superscripts are used for $\theta$ θしーた. This convention allows us to use the Einstein summation convention for calculations involving bases. It also reminds us that the component $g_{a}$ is not a point estimate for $\theta^{a}$ θしーた start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT; if it were, we would want to use superscripts for the components of $g$ . While $\theta$ θしーた and $g$ are both $k$ -tuples, geometrically $\theta$ θしーた is a contra-variant (tangent) vector while $g$ is a covariant vector as its components co-vary with the change of basis.

Generalized estimators may depend on the value of the nuisance parameter but we can make them independent of the nuisance parameterization by restricting to functions that are orthogonal to $\undertilde{s}$ . For any fixed $m_{\circ}\in M$ there is a $k^{\prime}$ -dimensional submanifold through $m_{\circ}$

M|_{m_{\circ}}=\left\{m\in M:\theta(m)=\theta_{\circ}\right\}

θしーた ( italic_m ) = italic_θしーた start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT }

where $\theta_{\circ}=\theta(m_{\circ}).$ θしーた start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT = italic_θしーた ( italic_m start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT ) .The tangent space of $M|_{m_{\circ}}$ at $m\in M|_{m_{\circ}}$ is

\widetilde{T}_{m}M=\mbox{span}\{\undertilde{s}(\cdot,\text{$\underline{\theta}% $})\}|_{\underline{\theta}^{{\tt t}}=(\theta_{\circ}^{{\tt t}},\undertilde{% \theta}^{{\tt t}})}.

θしーた end_ARG ) } | start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT = ( italic_θしーた start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT , under~ start_ARG italic_θしーた end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT .

We will require estimators to be orthogonal to $\widetilde{T}_{m}M$ and so define

H_{m}^{\bot}=\left\{h\in H_{M}:E_{m}h=0,h\perp_{m}\widetilde{T}_{m}M\right\}.

Equations (4) and (5) for the one dimensional case become

	$\displaystyle H_{m}$	$\displaystyle=$	$\displaystyle H_{m}^{\perp}\oplus\widetilde{T}_{m}M\oplus H_{m}^{0}$		(16)
	$\displaystyle H\!M$	$\displaystyle=$	$\displaystyle H^{\perp\!}M\oplus\widetilde{T}\!M\oplus H^{0}\!M$

When $M$ is parameterized by $\underline{\theta}$ θしーた end_ARG, $m$ in (16) is replaced with $\text{$\underline{\theta}$=$\underline{\theta}$}(m)$ θしーた end_ARG = under¯ start_ARG italic_θしーた end_ARG ( italic_m ).

Definition 5.

A generalized estimator for a $k$ -dimensional parameter $\theta$ θしーた is a function

g:\mathcal{Y}\times\underline{\Theta}\longrightarrow\mathbb{R}^{k}

Θしーたsuperscriptℝ𝑘g:\mathcal{Y}\times\underline{\Theta}\longrightarrow\mathbb{R}^{k}italic_g : caligraphic_Y × under¯ start_ARG roman_Θしーた end_ARG ⟶ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT

and $g_{y}=g(y,\cdot)$ is the corresponding generalized estimate at $y$ if

	$\displaystyle\mathrm{(I)}$	$\displaystyle\ \ g\left(y,\cdot\right)\in C^{1}(\underline{\Theta},\mathbb{R}^% {k})\mbox{\ a.e.}\ y$ Θしーたsuperscriptℝ𝑘 a.e.𝑦\displaystyle\ \ g\left(y,\cdot\right)\in C^{1}(\underline{\Theta},\mathbb{R}^% {k})\mbox{\ a.e.}\ yitalic_g ( italic_y , ⋅ ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( under¯ start_ARG roman_Θしーた end_ARG , blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) a.e. italic_y
	$\displaystyle\mathrm{(II)}$	$\displaystyle\ \ g(\cdot,\text{$\underline{\theta}$})\in H_{\text{$\underline{% \theta}$}}^{\bot}\mbox{ for all }\text{$\underline{\theta}$}\in\underline{\Theta}$ Θしーた\displaystyle\ \ g(\cdot,\text{$\underline{\theta}$})\in H_{\text{$\underline{% \theta}$}}^{\bot}\mbox{ for all }\text{$\underline{\theta}$}\in\underline{\Theta}italic_g ( ⋅ , under¯ start_ARG italic_θしーた end_ARG ) ∈ italic_H start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT for all ¯θしーた ∈ under¯ start_ARG roman_Θしーた end_ARG
	(III)	$\displaystyle\ \ \mathsf{V}\left(g\right)>0$

where $\mathsf{V}(g)=\mathsf{E}(gg^{\mathsf{t}})\in C^{1}\left(\underline{\Theta},% \mathbb{R}^{k\times k}\right)$ Θしーたsuperscriptℝ𝑘𝑘\mathsf{V}(g)=\mathsf{E}(gg^{\mathsf{t}})\in C^{1}\left(\underline{\Theta},% \mathbb{R}^{k\times k}\right)sansserif_V ( italic_g ) = sansserif_E ( italic_g italic_g start_POSTSUPERSCRIPT sansserif_t end_POSTSUPERSCRIPT ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( under¯ start_ARG roman_Θしーた end_ARG , blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT ).

The space of generalized estimators for $\theta$ θしーた is $\mathcal{G}$ which we write as $\mathcal{G}_{\Theta}$ Θしーた\mathcal{G}_{\Theta}caligraphic_G start_POSTSUBSCRIPT roman_Θしーた end_POSTSUBSCRIPT if we consider more than one parameterization. If $f_{\text{$\underline{\theta}$}}=f(\cdot,\text{$\underline{\theta}$})\in H_{% \underline{\theta}}$ θしーた end_ARG end_POSTSUBSCRIPT = italic_f ( ⋅ , under¯ start_ARG italic_θしーた end_ARG ) ∈ italic_H start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT for all $\text{$\underline{\theta}$}\in\text{$\underline{\Theta}$}$ Θしーた\text{$\underline{\theta}$}\in\text{$\underline{\Theta}$}under¯ start_ARG italic_θしーた end_ARG ∈ under¯ start_ARG roman_Θしーた end_ARG and satisfies conditions (I) and (III) of Definition 5 but $f_{\text{$\underline{\theta}$}}\not\in H_{\underline{\theta}}^{\perp}$ θしーた end_ARG end_POSTSUBSCRIPT ∉ italic_H start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, then $f$ is a pre-estimator. The orthogonalization of $f$ at $\underline{\theta}$ θしーた end_ARG

f_{\text{$\underline{\theta}$}}^{\bot}=f_{\text{$\underline{\theta}$}}-f_{% \underline{\theta}}^{\top}\in H_{\text{$\underline{\theta}$}}^{\perp}

θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ italic_H start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT

(17)

where $f_{\underline{\theta}}^{\top}=E_{\underline{\theta}}(f_{\text{$\underline{% \theta}$}})+\widetilde{P}_{\text{$\underline{\theta}$}}f_{\text{$\underline{% \theta}$}}$ θしーた end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT ) + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT under¯ start_ARG italic_θしーた end_ARG end_POSTSUBSCRIPT and $\widetilde{P}_{\text{$\underline{\theta}$}}$ θしーた end_ARG end_POSTSUBSCRIPT is the orthogonal projection onto $\widetilde{T}_{\text{$\underline{\theta}$}}\!M$ θしーた end_ARG end_POSTSUBSCRIPT italic_M. Since (17) holds for all $\underline{\theta}$ θしーた end_ARG and expectation and orthogonal projections are smooth functions we have

f^{\perp}=f-f^{\top}\in\mathcal{G}

where $f^{\top}=\mathsf{E}f+\widetilde{\mathsf{P}}f$ .

The score $\nabla\ell$ is a pre-estimator so that we define $s$ to be the orthogonalized score

s=(\nabla\ell)^{\bot}\in\mathcal{G}.

The Fisher information for $\theta$ θしーた is $I=\mathsf{V}\left(\nabla\ell\right)$ and the nuisance orthogonalized Fisher information is $I^{\perp}=\mathsf{V}\left((\nabla\ell)^{\bot}\right)=\mathsf{V}\left(s\right)$ ; both can be functions of $\undertilde{\theta}$ θしーた end_ARG but only $I^{\perp}$ is the same for all nuisance parameterizations.

The relationship between the score (information) and the orthogonalized score (orthogonalized information) expressed in the $\underline{\theta}$ θしーた end_ARG parameterization is

	$\displaystyle\left(\nabla\ell\right)^{\perp}$	$\displaystyle=$	$\displaystyle\nabla\ell-I_{\nabla\widetilde{\nabla}}\undertilde{I}^{-1}% \widetilde{\nabla}\ell$
	$\displaystyle I^{\perp}$	$\displaystyle=$	$\displaystyle I-I_{\nabla\widetilde{\nabla}}\undertilde{I}^{-1}I_{\widetilde{% \nabla}\nabla}$

where

\underline{I}=I_{\overline{\nabla}\overline{\nabla}}=\begin{pmatrix}\begin{% array}[]{ll}I&I_{\nabla\widetilde{\nabla}}\\ I_{\widetilde{\nabla}\nabla}&\undertilde{I}\end{array}\end{pmatrix}

and $I=I_{\nabla\nabla}$ and $\undertilde{I}=I_{\widetilde{\nabla}\widetilde{\nabla}}$ are the Fisher informations for $\theta$ θしーた and $\undertilde{\theta}$ θしーた end_ARG. When $I_{\nabla\widetilde{\nabla}}$ vanishes on $\underline{\Theta}$ Θしーた\underline{\Theta}under¯ start_ARG roman_Θしーた end_ARG, parameterizations $\theta$ θしーた and $\undertilde{\theta}$ θしーた end_ARG are orthogonal.

The definition of the scalar potential in the multi-parameter case is straight forward. And, as in the scalar parameter case, the log likelihood $\ell$ is the scalar potential for $s$ .

Definition 6.

A scalar potential of $g\in\mathcal{G}$ is any function $G:\mathcal{Y}\times\text{$\underline{\Theta}$}\longrightarrow\mathbb{R}$ ΘしーたℝG:\mathcal{Y}\times\text{$\underline{\Theta}$}\longrightarrow\mathbb{R}italic_G : caligraphic_Y × under¯ start_ARG roman_Θしーた end_ARG ⟶ blackboard_R such that $g=(\nabla G)^{\bot}$ .

The multivariate version of (13) is

\nabla\mathsf{E}(f^{{\tt t}})=\mathsf{E}\left(\nabla f^{{\tt t}}\right)+\left(% \nabla\mathsf{E}\right)(f^{{\tt t}})

(18)

where

\left(\nabla\mathsf{E}\right)\left(f^{{\tt t}}\right)=\mathsf{E}\left(\left(% \nabla\ell\right)f^{{\tt t}}\right).

Since $g\in H_{M}^{\perp}$ we have $\mathsf{E}\left(\left(\nabla\ell\right)g^{{\tt t}}\right)=\mathsf{E}\left(sg^{% {\tt t}}\right)$ so that the multivariate version of the score equation (14) is

\mathsf{E}\left(\nabla g^{{\tt t}}\right)+\mathsf{E}\left(sg^{{\tt t}}\right)=0.

(19)

Differentiating with respect to the nuisance parameter we obtain

\mathsf{E}(\widetilde{\nabla}g^{{\tt t}})+\mathsf{E}(\undertilde{s}g^{{\tt t}}% )=0

(20)

so that $g$ being nuisance orthogonal means that the average change of $g$ in the direction of the nuisance parameter is zero.

For the mean slope to be meaningful we need to use its standardized version.

Definition 7.

For $g\in\mathcal{G}$ , define

\bar{g}=\mathsf{V}^{-1/2}g

where $\mathsf{V}=\mathsf{V}\left(g\right)$ so that $\mathsf{V}\left(\bar{g}\right)$ is $I_{\mathsf{id}}$ , the $k\times k$ identity matrix. Any $g$ such that $\mathsf{V}\left(g\right)=I_{\mathsf{id}}$ is called a standardized estimator.

Definition 8.

The information for $\theta$ θしーた utilized by $g$ is

	$\displaystyle\Lambda\left(g\right)$ Λらむだ𝑔\displaystyle\Lambda\left(g\right)roman_Λらむだ ( italic_g )	$\displaystyle=\left(\mathsf{E}\nabla\bar{g}^{\mathsf{t}}\right)\left(\mathsf{E% }\nabla\bar{g}^{\mathsf{t}}\right)^{\mathsf{t}}$
		$\displaystyle=\left(\mathsf{E}\nabla g^{\mathsf{t}}\right)\mathsf{V}^{-1}(g)% \left(\mathsf{E}\nabla g^{\mathsf{t}}\right)^{\mathsf{t}}.$

The scalar information for $\theta$ θしーた utilized by $g$ is

\lambda(g)=\mathop{tr}\Lambda(g).

Λらむだ𝑔\lambda(g)=\mathop{tr}\Lambda(g).italic_λらむだ ( italic_g ) = start_BIGOP italic_t italic_r end_BIGOP roman_Λらむだ ( italic_g ) .

Note $\Lambda(g)\in C^{1}(\text{$\underline{\Theta}$},\mathbb{R}^{k}\times\mathbb{R}% ^{k})$ Λらむだ𝑔superscript𝐶1¯Θしーたsuperscriptℝ𝑘superscriptℝ𝑘\Lambda(g)\in C^{1}(\text{$\underline{\Theta}$},\mathbb{R}^{k}\times\mathbb{R}% ^{k})roman_Λらむだ ( italic_g ) ∈ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( under¯ start_ARG roman_Θしーた end_ARG , blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). Using the Frobenius norm for matrix $A$ , $||A||=\sqrt{\mathop{tr}(A^{{\tt t}}A)}$ , we see that the scalar information is the square of the norm of $\mathsf{E}\nabla\bar{g}^{{\tt t}}$

\lambda(g)=||\mathsf{E}\nabla\bar{g}^{{\tt t}}||^{2}.

λらむだ ( italic_g ) = | | sansserif_E ∇ over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By replacing $\nabla$ with $\widetilde{\nabla}$ in Definition 8 we could define $\undertilde{\Lambda}(g)$ Λらむだ𝑔\undertilde{\Lambda}(g)under~ start_ARG roman_Λらむだ end_ARG ( italic_g ), the information for $\undertilde{\theta}$ θしーた end_ARG. However, equation (20) shows $\undertilde{\Lambda}(g)=0$ Λらむだ𝑔0\undertilde{\Lambda}(g)=0under~ start_ARG roman_Λらむだ end_ARG ( italic_g ) = 0 for all $g\in\mathcal{G}$ . Restricting estimators to be orthogonal to the space spanned by the nuisance parameters makes inferences independent of the choice of the nuisance parameter but also means that estimators for the parameter of interest have no information for the nuisance parameter.

Theorem 2.

For k-dimensional parameter $\theta$ θしーた let $s=(\nabla\ell)^{\perp}$ and let $I^{\perp}=\mathsf{V}(s)$ be the orthogonalized Fisher information for $\theta$ θしーた. For any $g\in\mathcal{G}$ , $\Lambda(g)\leq I^{\perp}$ Λらむだ𝑔superscript𝐼perpendicular-to\Lambda(g)\leq I^{\perp}roman_Λらむだ ( italic_g ) ≤ italic_I start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT and $s$ attains this bound, $\Lambda(s)=I^{\perp}$ Λらむだ𝑠superscript𝐼perpendicular-to\Lambda(s)=I^{\perp}roman_Λらむだ ( italic_s ) = italic_I start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Furthermore,

	$\displaystyle\Lambda(g)$ Λらむだ𝑔\displaystyle\Lambda(g)roman_Λらむだ ( italic_g )	$\displaystyle=$	$\displaystyle\mathsf{V}(\mathsf{P}_{g}s)=\mathsf{V}(\mathsf{P}_{g}\nabla\ell)$
		$\displaystyle=$	$\displaystyle(I^{\perp})^{1/2}\mathsf{R}\mathsf{R}^{{\tt t}}(I^{\perp})^{1/2}$

where $\mathsf{R}=\mathsf{E}(\bar{s}\bar{g}^{{\tt t}})$ is the correlation matrix between $s$ and $g$ .

Proof.

The displayed equations in the Theorem are obtained from the score equation (19) which gives

\Lambda(g)=\mathsf{E}(s\bar{g}^{{\tt t}})\mathsf{E}(\bar{g}s^{{\tt t}}).

Λらむだ𝑔𝖤𝑠superscript¯𝑔𝚝𝖤¯𝑔superscript𝑠𝚝\Lambda(g)=\mathsf{E}(s\bar{g}^{{\tt t}})\mathsf{E}(\bar{g}s^{{\tt t}}).roman_Λらむだ ( italic_g ) = sansserif_E ( italic_s over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT ) sansserif_E ( over¯ start_ARG italic_g end_ARG italic_s start_POSTSUPERSCRIPT typewriter_t end_POSTSUPERSCRIPT ) .

The first equation follows from the definition of the projection and its variance: $\mathsf{P}_{g}s=\mathsf{E}(s\bar{g}^{{\tt t}})\bar{g}$ so $\mathsf{V}(\mathsf{P}_{g}s)=\mathsf{E}(s\bar{g}^{{\tt t}})\mathsf{E}(\bar{g}s^% {{\tt t}})$ . The second equation follows because $\nabla\ell=s+(\nabla\ell)^{\top}$ and $g$ is orthogonal to $\left(\nabla\ell\right)^{\top}$ . The third equation follows from $\mathsf{E}(s\bar{g}^{{\tt t}})=(I^{\bot})^{1/2}\mathsf{E}(\bar{s}\bar{g}^{{\tt t% }})$ since $\mathsf{V}(s)=I^{\bot}$ . The inequality $\Lambda(g)\leq I^{\bot}$ Λらむだ𝑔superscript𝐼bottom\Lambda(g)\leq I^{\bot}roman_Λらむだ ( italic_g ) ≤ italic_I start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT follows because the squared length of a projection cannot be longer than the original vector. ∎

When there are no nuisance parameters Theorem 2 holds with $I^{\bot}=I$ and $s=\nabla\ell$ .

Definition 9.

The $\Lambda$ Λらむだ\Lambdaroman_Λらむだ-efficiency of $g$ is

\mbox{Eff}^{\Lambda}\left(g\right)=(I^{\perp}){}^{-1/2}\Lambda(g)(I^{\perp}){}% ^{-1/2}.

Λらむだ end_POSTSUPERSCRIPT ( italic_g ) = ( italic_I start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_FLOATSUPERSCRIPT - 1 / 2 end_FLOATSUPERSCRIPT roman_Λらむだ ( italic_g ) ( italic_I start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_FLOATSUPERSCRIPT - 1 / 2 end_FLOATSUPERSCRIPT .

Corollary 2 follows immediately from Theorem 2.

Corollary 2.

	$\displaystyle\mbox{Eff}^{\Lambda}\left(g\right)$ Λらむだ𝑔\displaystyle\mbox{Eff}^{\Lambda}\left(g\right)Eff start_POSTSUPERSCRIPT roman_Λらむだ end_POSTSUPERSCRIPT ( italic_g )	$\displaystyle=$	$\displaystyle\mathsf{V}(\mathsf{P}_{g}\bar{s})$
		$\displaystyle=$	$\displaystyle\mathsf{R}\mathsf{R}^{{\tt t}}.$

4 Examples

4.1 Normal and $t$ -distributions

We consider two one-dimensional manifolds: the normal family and the family of $t$ distributions with 3 degrees of freedom. Both are location families so Fisher information is the same at each distribution in the manifold. We compare three estimators: the sample mean, sample median and the mle obtained from the $t_{3}$ distribution. We did not include the score for the $t_{3}$ distribution since it is very close to the corresponding mle. The sample mean is the mle for normal data.

The sample mean attains the information bound for the normal family and the $t_{3}$ score attains the information bound for the $t_{3}$ family. We use the information of these estimators to assess the cost of model misspecification and explore the relationship between information and the tails of the distribution.

Figure 3 is based on 100,000 samples of size 10 from a normal distribution and another 100,000 samples of size 10 from the $t_{3}$ distribution. For the graph on the left, 99 quantiles, from .005 to .995, obtained from the 100,000 sample means for the normal data were calculated. Using the empirical cdf for the 100,000 medians these 99 quantiles gave 100 tail areas (the median was included in both tails). Each tail area $T\!A$ was converted to a $\zeta$ ζぜーた-score that measures the distance into the tail of the distribution. For continuous random variable $X$ define $\zeta:X\rightarrow\mathbb{R}$ ζぜーた : italic_X → blackboard_R by

\zeta=\left\{\begin{array}[]{ll}\log_{2}(2\mbox{Pr}(X\leq x))&\mbox{if }\mbox{% Pr}(X\leq x)\leq 1/2\\ -\log_{2}(2\mbox{Pr}(X\geq x))&\mbox{if }\mbox{Pr}(X\leq x)>1/2\end{array}.\right.

Generalized Estimation and Information

Abstract

1 Introduction

2 One Parameter Families

Definition 1.

Definition 2.

Definition 3.

Theorem 1.

Proof.

Definition 4.

Corollary 1.

Example 1.

Example 2.

3 Multi-parameter Families

Definition 5.

Definition 6.

Definition 7.

Definition 8.

Theorem 2.

Proof.

Definition 9.

Corollary 2.

4 Examples

4.1 Normal and t𝑡titalic_t-distributions

4.1 Normal and $t$ -distributions