Probabilistic Foundations of Econometrics, part 1

In a series of posts, I wanted to get into details of the history and foundations of econometric and machine learning models. It will be some sort of online version of our joint paper with Emmanuel Flachaire and Antoine Ly, Econometrics and Machine Learning (initially writen in French), that will actually appear soon in the journal Economics and Statistics. This is the first one…

The importance of probabilistic models in economics is rooted in Working’s (1927) questions and the attempts to answer them in Tinbergen’s two volumes (1939). The latter have subsequently generated a great deal of work, as recalled by Duo (1993) in his book on the foundations of econometrics, and more particularly in the first chapter “The Probability Foundations of Econometrics”. It should be recalled that Trygve Haavelmo was awarded the Nobel Prize in Economics in 1989 for his “clarification of the foundations of the probabilistic theory of econometrics”. Because as Haavelmo (1944) (initiating a profound change in econometric theory in the 1930s, as recalled in Morgan’s Chapter 8 (1990)) showed, econometrics is fundamentally based on a probabilistic model, for two main reasons. First, the use of statistical quantities (or “measures”) such as means, standard errors and correlation coefficients for inferential purposes can only be justified if the process generating the data can be expressed in terms of a probabilistic model. Second, the probability approach is relatively general, and is particularly well suited to the analysis of “dependent” and “non-homogeneous” observations, as they are often found on economic data.We will then assume that there is a probabilistic space $(\Omega,\mathcal{F},\mathbb{P})$ such that observations $(y_i,\mathbf{x}_i)$ are seen as realizations of random variables $(Y_i, \mathbf{X}_i)$ . In practice, however, we are not very interested in the joint law of the couple $(Y, \mathbf{X})$ : the law of $\mathbf{X}$ is unknown, and it is the law of Y conditional on $\mathbf{X}$ that will be interested in. In the following, we will note $x$ a single observation, $\mathbf{x}$ a vector of observations, $X$ a random variable, and $\mathbf{X}$ a random vector. Abusively, $\mathbf{X}$ may also designate the matrix of individual observations (denoted $\mathbf{x}_i$ ), depending on the context.

Foundations of mathematical statistics

As recalled in Vapnik’s (1998) introduction, inference in parametric statistics is based on the following belief: the statistician knows the problem to be analyzed well, in particular, he knows the physical law that generates the stochastic properties of the data, and the function to be found is written via a finite number of parameters[1]. To find these parameters, the maximum likelihood method is used. The purpose of the theory is to justify this approach (by discovering and describing its favorable properties). We will see that in learning, philosophy is very different, since we do not have a priori reliable information on the statistical law underlying the problem, nor even on the function we would like to approach (we will then propose methods to construct an approximation from the data at our disposal, as in (1998)). A “golden age” of parametric inference, from 1930 to 1960, laid the foundations for mathematical statistics, which can be found in all statistical textbooks, including today. As Vapnik (1998) states, the classical parametric paradigm is based on the following three beliefs:

To find a functional relationship from the data, the statistician is able to define a set of functions, linear in their parameters, that contain a good approximation of the desired function. The number of parameters describing this set is small.
The statistical law underlying the stochastic component of most real-life problems is the normal law. This belief has been supported by reference to the central limit theorem, which stipulates that under large conditions the sum of a large number of random variables is approximated by the normal law.
The maximum likelihood method is a good tool for estimating parameters.

In this section we will come back to the construction of the econometric paradigm, directly inspired by that of classical inferential statistics.

Conditional laws and likelihood

Linear econometrics has been constructed under the assumption of individual data, which amounts to assuming independent variables $(Y_i, \mathbf{X}_i)$ (if it is possible to imagine temporal observations – then we would have a process $(Y_t, \mathbf{X}_t)$ – but we will not discuss time series here). More precisely, we will assume that, conditionally to the explanatory variables $\mathbf{X}_i$ , the variables $Y_i$ are independent. We will also assume that these conditional laws remain in the same parametric family, but that the parameter is a function of $\mathbf{x}$ . In the Gaussian linear model it is assumed that: $(Y\vert \mathbf{X}=\mathbf{x})\overset{\mathcal{L}}{\sim}\mathcal{N}(\mu(\mathbf{x}),\sigma^2)~~~~ (1)$ where $\mu(\mathbf{x})=\beta_0+\mathbf{x}^T\mathbf{\beta}$ and $\mathbf{\beta}\in\mathbb{R}^{p}$ .

It is usually called a ‘linear’ model since $\mathbb{E}[Y\vert \mathbf{X}=\mathbf{x}]=\beta_0+\mathbf{x}^T\mathbf{\beta}$ is a linear combination of covariates[2]. It is said to be a homoscedastic model if $Var[Y|\mathbf{X}=\mathbf{x}]=\sigma^2$ , where $\sigma^2$ is a positive constant. To estimate the parameters, the traditional approach is to use the Maximum Likelihood estimator, as initially suggested by Ronald Fisher. In the case of the Gaussian linear model, log-likelihood is written: $\log\mathcal{L}(\beta_0, \mathbf{\beta},\sigma^2\vert \mathbf{y},\mathbf{x}) = -\frac{n}{2}\log[2\pi\sigma^2] - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\beta_0-\mathbf{x}_i^T\mathbf{\beta})^2$ Note that the term on the right, measuring a distance between the data and the model, will be interpreted as deviance in generalized linear models. Then we will set: $(\widehat{\beta}_0,\widehat{\mathbf{\beta}},\widehat{\sigma}^2)=\text{argmax}\left\lbrace\log\mathcal{L}(\beta_0, \mathbf{\beta},\sigma^2\vert \mathbf{y},\mathbf{x})\right\rbrace$ The maximum likelihood estimator is obtained by minimizing the sum of the error squares (the so-called “least squares” estimator) that we will find in the “machine learning” approach.

The first order conditions allow to find the normal equations, whose matrix writing is $\mathbf{X}^T[\mathbf{y}-\mathbf{X}\mathbf{\beta}]=\mathbf{0}$ , which can also be written $(\mathbf{X}^T \mathbf{X})\mathbf{\beta}=\mathbf{X}^T \mathbf{y}$ . If $\mathbf{X}$ is a full (column) rank matrix, then we find the classical estimator: $\widehat{\mathbf{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}=\mathbf{\beta}+(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^{-1}\mathbf{\varepsilon}~~~(2)$ using residual-based writing (as often in econometrics), $y=\mathbf{x}^T\mathbf{\beta}+\varepsilon$ . Gauss Markov’s theorem ensures that this estimator is the unbiased linear estimator with minimum variance. It can then be shown that $\widehat{\mathbf{\beta}}\sim\mathcal{N}(\mathbf{\beta},\sigma^2(\mathbf{X}^T\mathbf{X})^{-1})$ , and in particular, if we simply need the first two moments : $\mathbb{E}[\widehat{\mathbf{\beta}}]=\mathbf{\beta}~~~Var[\widehat{\mathbf{\beta}}]=\sigma^2 [\mathbf{X}^T\mathbf{X}]^{-1}$ In fact, the normality hypothesis makes it possible to make a link with mathematical statistics, but it is possible to construct this estimator given by equation (2) without that Gaussian assumption. Hence, if we assume that $Y|\mathbf{X}$ has the same distribution as $\mathbf{x}^T\mathbf{\beta}+\varepsilon$ , where $\mathbb{E}[\varepsilon]=0$ , $Var[\varepsilon]=\sigma^2$ and $Cov[X_j,\varepsilon]=0$ for all $j$ , then $\widehat{\mathbf{\beta}}$ is an unbiased estimator of $\mathbf{\beta}$ with smallest variance[3] among unbiased linear estimators. Furthermore, if we cannot get normality at finite distance, asymptotically this estimator is Gaussian, with $\sqrt{n}(\widehat{\mathbf{\beta}}-\mathbf{\beta})\overset{\mathcal{L}}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ as $n\rightarrow\infty$ , for some matrix $\mathbf{\Sigma}$ .
The condition of having a full rank $\mathbf{X}$ matrix can be (numerically) strong in large dimensions. If it is not satisfied, $(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T$ does not exist. If $\mathbb{I}$ denotes the identity matrix, however, it should be noted that $(\mathbf{X}^T \mathbf{X}+\lambda\mathbb{I})^{-1}\mathbf{X}^T$ still exists, whatever $\lambda>0$ . This estimator is called the ridge estimator of level \lambda (introduced in the 1960s by Hoerl (1962), and associated with a regularization studied by Tikhonov (1963)). This estimator naturally appears in a Bayesian econometric context.

Residuals

It is not uncommon to introduce the linear model from the distribution of the residuals, as we mentioned earlier. Also, equation (1) is written as often: $y_i=\beta_0+\mathbf{x}_i^T\mathbf{\beta}+\varepsilon_i~~~~(3)$ where $\varepsilon_i$ ’s are realizations of independent and identically distributed random variables (i.i.d.) from some $\mathcal{N}(0,\sigma^2)$ distribution. With a vector notation, we will write $\mathbf{\varepsilon}\overset{\mathcal{L}}{\sim}\mathcal{N}(\mathbf{0},\sigma^2\mathbb{I})$ . The estimated residuals are defined as: $\widehat{\varepsilon}_i =y_i-[\widehat{\beta}_0+\mathbf{x}_i^T\widehat{\mathbf{\beta}}]$ Those (estimated) residuals are basic tools for diagnosing the relevance of the model.

An extension of the model described by equation (1) has been proposed to take into account a possible heteroscedastic character: $(Y\vert \mathbf{X}=\mathbf{x})\overset{\mathcal{L}}{\sim}\mathcal{N}(\mu(\mathbf{x}),\sigma^2(\mathbf{x}))$ where $\sigma^2(\mathbf{x})$ is a positive function of the explanatory variables. This model can be rewritten as: $y_i=\beta_0+\mathbf{x}_i^T\mathbf{\beta}+\sigma^2(\mathbf{x}_i)\cdot\varepsilon_i$ where residuals are always i.i.d., with unit variance, $\varepsilon_i=\frac{y_i-[\beta_0+\mathbf{x}_i^T\mathbf{\beta}]}{\sigma(\mathbf{x}_i)}$ While residuals based equations are popular in linear econometrics (when the dependent variable is continuous), it is no longer popular in counting models, or logistic regression.

However, writing using an error term (as in equation (3)) raises many questions about the representation of an economic relationship between two quantities. For example, it can be assumed that there is a relationship (linear to begin with) between the quantities of a traded good, $q$ and its price $p$ . This allows us to imagine a supply equation $q_i=\beta_0+\beta_1 p_i+u_i$ ( $u_i$ being an error term) where the quantity sold depends on the price, but in an equally legitimate way, one can imagine that the price depends on the quantity produced (what one could call a demand equation), $p_i=\alpha_0+\alpha_1 q_i+v_i$ ( $v_i$ denoting another error term). Historically, the error term in equation (3) could be interpreted as an idiosyncratic error on the variable $y$ , the so-called explanatory variables being assumed to be fixed, but this interpretation often makes the link between an economic relationship and a complicated economic model difficult, the economic theory speaking abstractly about a relationship between a magnitude, the econometric model imposing a specific shape (what magnitude is $y$ and what magnitude is $x$ ) as shown in more detail in Morgan (1990) Chapter 7.

(references mentioned above are online here). To be continued…

[1] This approach can be compared to structural econometrics, as presented for example in Kean (2010).

[2] Here, we will try to distinguish $\beta_0$ , the intercept, and the other parameters $\mathbf{\beta}$ , since they are considered differently in many extensions (e.g. regularization). Nevertheless, in many expressions $\mathbf{\beta}$ will denote the joint vector $(\beta_0, \mathbf{\beta})$ , for general formulas, to avoid too heavy notations.

[3] In the sense that the difference between variance matrices is a positive matrix.