39
$\begingroup$

I have seen some questions here about what it means in layman terms, but these are too layman for for my purpose here. I am trying to mathematically understand what does the AIC score mean.

But at the same time, I don't want a rigor proof that would make me not see the more important points. For example, if this was calculus, I would be happy with infinitesimals, and if this was probability theory, I would be happy without measure theory.

My attempt

by reading here, and some notation sugar of my own, $\text{AIC}_{m,D}$ is the AIC criterion of model $m$ on dataset $D$ as follows: $$ \text{AIC}_{m,D} = 2k_m - 2 \ln(L_{m,D}) $$ where $k_m$ is the number of parameters of model $m$, and $L_{m,D}$ is the maximum likelihood function value of model $m$ on dataset $D$.

Here is my understanding of what the above implies:

$$ m = \underset{\theta}{\text{arg max}\,} \Pr(D|\theta) $$

This way:

  • $k_m$ is the number of parameters of $m$.
  • $L_{m,D} = \Pr(D|m) = \mathcal{L}(m|D)$.

Let's now rewrite AIC: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2k_m - 2 \ln(\Pr(D|m))\\ =& 2k_m - 2 \log_e(\Pr(D|m))\\ \end{split}$$

Obviously, $\Pr(D|m)$ is the probability of observing dataset $D$ under model $m$. So the better the model $m$ fits the dataset $D$, the larger $\Pr(D|m)$ becomes, and thus smaller the term $-2\log_e(\Pr(D|m))$ becomes.

So clearly AIC rewards models that fit their datasets (because smaller $\text{AIC}_{m,D}$ is better).

On the other hand, the term $2k_m$ clearly punishes models with more parameters by making $\text{AIC}_{m,D}$ larger.

In other words, AIC seems to be a measure that:

  • Rewards accurate models (those that fit $D$ better) logarithmically. E.g. it rewards increase in fitness from $0.4$ to $0.5$ more than it rewards the increase in fitness from $0.8$ to $0.9$. This is shown in the figure below.
  • Rewards reduction in parameters linearly. So decrease in parameters from $9$ down to $8$ is rewarded as much as it rewards the decrease from $2$ down to $1$.

enter image description here

In other words (again), AIC defines a trade-off between the importance of simplicity and the importance of fitness.

In other words (again), AIC seems to suggest that:

  • The importance of fitness diminishes.
  • But the importance of simplicity never diminishes but is rather always constantly important.

Q1: But a question is: why should we care about this specific fitness-simplicity trade-off?

Q2: Why $2k$ and why $2 \log_e(\ldots)$? Why not just: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2(k_m - \ln(L_{m,D}))\\ \frac{\text{AIC}_{m,D}}{2} =& k_m - \ln(L_{m,D})\\ \text{AIC}_{m,D,\text{SIMPLE}} =& k_m - \ln(L_{m,D})\\ \end{split}$$ i.e. $\text{AIC}_{m,D,\text{SIMPLE}}$ should in y view be equally useful to $\text{AIC}_{m,D}$ and should be able to serve for relatively comparing different models (it's just not scaled by $2$; do we need this?).

Q3: How does this relate to information theory? Could someone derive this from an information theoretical start?

$\endgroup$
9
  • 2
    $\begingroup$ What does your notation in $m=\arg \max_\theta Pr(D|\theta)$ mean? Are you implying something about model choice there? What you had above does not really imply that AIC requires you to choose a model. Q2, as you say, is something pretty arbitrary in some sense, but comes from making AIC an estimate for the Kullback-Leibler divergence, which also relates to the answer for Q1 and gives some meaning to quantities like $\exp((\text{AIC}_m-\min(\text{AIC}_1,\ldots,\text{AIC}_M))/2)$. $\endgroup$
    – Björn
    Commented Jun 1, 2016 at 5:49
  • $\begingroup$ $\text{arg max}_{\theta} \Pr(D|\theta)$ means keep looking for many $\theta$s until you find one that minimizes the probability $\Pr(D|\theta)$. Each $\theta$ is a tuple/vector of parameters that defines our model that tries to explain dataset $D$. So essentially it says: we have dataset $D$, what is the probability that it was generated by a model parametrized by $\theta$? Our model $m$ is essentially $\theta$ that solves this maximization problem. $\endgroup$
    – caveman
    Commented Jun 1, 2016 at 6:00
  • 3
    $\begingroup$ Sorry, but are you looking across multiple models (since you write $m=\ldots$), or are you talking about the maximum likelihood estimate $\hat{\theta} := \arg\max_\theta P_\text{given model}(D|\theta)$? Also note $P_\text{given model}(D|\theta)$ is the probability of the data haven arisen under the given model and for the given parameters, not the probability that the data was generated by that model parameterized by $\theta$. $\endgroup$
    – Björn
    Commented Jun 1, 2016 at 6:55
  • $\begingroup$ MLE is what I mean. But I'm just trying to say that the parameters tuple $\theta$ is so comprehensive that it also defines the model. Also I can have multiple models, say $m_1,m_2$ each with a different AIC score $\text{AIC}_1, \text{AIC}_2$. I am just making this notation up because I think it's simpler. Am I being terribly wrong, or unnecessarily confusing this? (and thank you for correcting me on what the MLE means) $\endgroup$
    – caveman
    Commented Jun 1, 2016 at 8:38
  • 3
    $\begingroup$ A derivation of AIC as an approximation to expected K-L information loss is given in Pawitan (2001), In All Likelihood, Ch 13. $\endgroup$ Commented Jun 21, 2016 at 9:41

3 Answers 3

16
$\begingroup$

This question by caveman is popular, but there were no attempted answers for months until my controversial one. It may be that the actual answer below is not, in itself, controversial, merely that the questions are "loaded" questions, because the field seems (to me, at least) to be populated by acolytes of AIC and BIC who would rather use OLS than each others' methods. Please look at all the assumptions listed, and restrictions placed on data types and methods of analysis, and please comment on them; fix this, contribute. Thus far, some very smart people have contributed, so slow progress is being made. I acknowledge contributions by Richard Hardy and GeoMatt22, kind words from Antoni Parellada, and valiant attempts by Cagdas Ozgenc and Ben Ogorek to relate K-L divergence to an actual divergence.

Before we begin let us review what AIC is, and one source for this is Prerequisites for AIC model comparison and another is from Rob J Hyndman. In specific, AIC is calculated to be equal to

$$2k - 2 \log(L(\theta))\,,$$

where $k$ is the number of parameters in the model and $L(\theta)$ the likelihood function. AIC compares the trade-off between variance ($2k$) and bias ($2\log(L(\theta))$) from modelling assumptions. From Facts and fallacies of the AIC, point 3 "The AIC does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead." The AIC is the penalized likelihood, whichever likelihood you choose to use. For example, to resolve AIC for Student's-t distributed residuals, we could use the maximum-likelihood solution for Student's-t. The log-likelihood usually applied for AIC is derived from Gaussian log-likelihood and given by

$$ \log(L(\theta)) =-\frac{|D|}{2}\log(2\pi) -\frac{1}{2} \log(|K|) -\frac{1}{2}(x-\mu)^T K^{-1} (x-\mu), $$

$K$ being the covariance structure of the model, $|D|$ the sample size; the number of observations in the datasets, $\mu$ the mean response and $x$ the dependent variable. Note that, strictly speaking, it is unnecessary for AIC to correct for the sample size, because AIC is not used to compare datasets, only models using the same dataset. Thus, we do not have to investigate whether the sample size correction is done correctly or not, but we would have to worry about this if we could somehow generalize AIC to be useful between datasets. Similarly, much is made about $K>>|D|>2$ to insure asymptotic efficiency. A minimalist view might consider AIC to be just an "index," making $K>|D|$ relevant and $K>>|D|$ irrelevant. However, some attention has been given to this in the form of proposing an altered AIC for $K$ not much larger than $|D|$ called AIC$_c$ see second paragraph of answer to Q2 below. This proliferation of "measures" only reinforces the notion that AIC is an index. However, caution is advised when using the "i" word as some AIC advocates equate use of the word "index" with the same fondness as might be attached to referring to their ontogeny as extramarital.

Q1: But a question is: why should we care about this specific fitness-simplicity trade-off?

Answer in two parts. First the specific question. You should only care because that was the way it was defined. If you prefer there is no reason not to define a CIC; a caveman information criterion, it will not be AIC, but CIC would produce the same answers as AIC, it does not effect the tradeoff between goodness-of-fit and positing simplicity. Any constant that could have been used as an AIC multiplier, including one times, would have to have been chosen and adhered to, as there is no reference standard to enforce an absolute scale. However, adhering to a standard definition is not arbitrary in the sense that there is room for one and only one definition, or "convention," for a quantity, like AIC, that is defined only on a relative scale. Also see AIC assumption #3, below.

The second answer to this question pertains to the specifics of AIC tradeoff between goodness-of-fit and positing simplicity irrespective of how its constant multiplier would have been chosen. That is, what actually effects the "tradeoff"? One of the things that effects this, is to degree of freedom readjust for the number of parameters in a model, this led to defining an "new" AIC called AIC$_c$ as follows:

$$\begin{align}AIC_c &= AIC + \frac{2k(k + 1)}{n - k - 1}\\ &= \frac{2kn}{n-k-1} - 2 \ln{(L)}\end{align} \,,$$

where $n$ is the sample size. Since the weighting is now slightly different when comparing models having different numbers of parameters, AIC$_c$ selects models differently than AIC itself, and identically as AIC when the two models are different but have the same number of parameters. Other methods will also select models differently, for example, "The BIC [sic, Bayesian information criterion] generally penalizes free parameters more strongly than the Akaike information criterion, though it depends..." ANOVA would also penalize supernumerary parameters using partial probabilities of the indispensability of parameter values differently, and in some circumstances would be preferable to AIC use. In general, any method of assessment of appropriateness of a model will have its advantages and disadvantages. My advice would be to test the performance of any model selection method for its application to the data regression methodology more vigorously than testing the models themselves. Any reason to doubt? Yup, care should be taken when constructing or selecting any model test to select methods that are methodologically appropriate. AIC is useful for a subset of model evaluations, for that see Q3, next. For example, extracting information with model A may be best performed with regression method 1, and for model B with regression method 2, where model B and method 2 sometimes yields non-physical answers, and where neither regression method is MLR, where the residuals are a multi-period waveform with two distinct frequencies for either model and the reviewer asks "Why don't you calculate AIC?"

Q3 How does this relate to information theory:

MLR assumption #1. AIC is predicated upon the assumptions of maximum likelihood (MLR) applicability to a regression problem. There is only one circumstance in which ordinary least squares regression and maximum likelihood regression have been pointed out to me as being the same. That would be when the residuals from ordinary least squares (OLS) linear regression are normally distributed, and MLR has a Gaussian loss function. In other cases of OLS linear regression, for nonlinear OLS regression, and non-Gaussian loss functions, MLR and OLS may differ. There are many other regression targets than OLS or MLR or even goodness of fit and frequently a good answer has little to do with either, e.g., for most inverse problems. There are highly cited attempts (e.g., 1100 times) to use generalize AIC for quasi-likelihood so that the dependence on maximum likelihood regression is relaxed to admit more general loss functions. Moreover, MLR for Student's-t, although not in closed form, is robustly convergent. Since Student-t residual distributions are both more common and more general than, as well as inclusive of, Gaussian conditions, I see no special reason to use the Gaussian assumption for AIC.

MLR assumption #2. MLR is an attempt to quantify goodness of fit. It is sometimes applied when it is not appropriate. For example, for trimmed range data, when the model used is not trimmed. Goodness-of-fit is all fine and good if we have complete information coverage. In time series, we do not usually have fast enough information to understand fully what physical events transpire initially or our models may not be complete enough to examine very early data. Even more troubling is that one often cannot test goodness-of-fit at very late times, for lack of data. Thus, goodness-of-fit may only be modelling 30% of the area fit under the curve, and in that case, we are judging an extrapolated model on the basis of where the data is, and we are not examining what that means. In order to extrapolate, we need to look not only at the goodness of fit of 'amounts' but also the derivatives of those amounts failing which we have no "goodness" of extrapolation. Thus, fit techniques like B-splines find use because they can more smoothly predict what the data is when the derivatives are fit, or alternatively inverse problem treatments, e.g., ill-posed integral treatment over the whole model range, like error propagation adaptive Tikhonov regularization.

Another complicated concern, the data can tell us what we should be doing with it. What we need for goodness-of-fit (when appropriate), is to have the residuals that are distances in the sense that a standard deviation is a distance. That is, goodness-of-fit would not make much sense if a residual that is twice as long as a single standard deviation were not also of length two standard deviations. Selection of data transforms should be investigated prior to applying any model selection/regression method. If the data has proportional type error, typically taking the logarithm before selecting a regression is not inappropriate, as it then transforms standard deviations into distances. Alternatively, we can alter the norm to be minimized to accommodate fitting proportional data. The same would apply for Poisson error structure, we can either take the square root of the data to normalize the error, or alter our norm for fitting. There are problems that are much more complicated or even intractable if we cannot alter the norm for fitting, e.g., Poisson counting statistics from nuclear decay when the radionuclide decay introduces an exponential time-based association between the counting data and the actual mass that would have been emanating those counts had there been no decay. Why? If we decay back-correct the count rates, we no longer have Poisson statistics, and residuals (or errors) from the square-root of corrected counts are no longer distances. If we then want to perform a goodness-of-fit test of decay corrected data (e.g., AIC), we would have to do it in some way that is unknown to my humble self. Open question to the readership, if we insist on using MLR, can we alter its norm to account for the error type of the data (desirable), or must we always transform the data to allow MLR usage (not as useful)? Note, AIC does not compare regression methods for a single model, it compares different models for the same regression method.

AIC assumption #1. It would seem that MLR is not restricted to normal residuals, for example, see this question about MLR and Student's-t. Next, let us assume that MLR is appropriate to our problem so that we track its use for comparing AIC values in theory. Next we assume that have 1) complete information, 2) the same type of distribution of residuals (e.g., both normal, both Student's-t) for at least 2 models. That is, we have an accident that two models should now have the type of distribution of residuals. Could that happen? Yes, probably, but certainly not always.

AIC assumption #2. AIC relates the negative logarithm of the quantity (number of parameters in the model divided by the Kullback-Leibler divergence). Is this assumption necessary? In the general loss functions paper a different "divergence" is used. This leads us to question if that other measure is more general than K-L divergence, why are we not using it for AIC as well?

The mismatched information for AIC from Kullback-Leibler divergence is "Although ... often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric." We shall see why shortly.

The K-L argument gets to the point where the difference between two things the model (P) and the data (Q) are

$$D_{\mathrm{KL}}(P\|Q) = \int_X \log\!\left(\frac{{\rm d}P}{{\rm d}Q}\right) \frac{{\rm d}P}{{\rm d}Q} \, {\rm d}Q \,,$$

which we recognize as the entropy of ''P'' relative to ''Q''.

AIC assumption #3. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm. The constant multiplier might have more meaning if AIC were relating more than one data set at at time. As it stands when comparing methods, if $AIC_{data,model 1}<AIC_{data,model 2}$ then any positive number times that will still be $<$. Since it is arbitrary, setting the constant to a specific value as a matter of definition is also not inappropriate.

AIC assumption #4. That would be that AIC measures Shannon entropy or self information." What we need to know is "Is entropy what we need for a metric of information?"

To understand what "self-information" is, it behooves us to normalize information in a physical context, any one will do. Yes, I want a measure of information to have properties that are physical. So what would that look like in a more general context?

The Gibbs free-energy equation ($\Delta G = ΔH – TΔS$) relates the change in energy to the change in enthalpy minus the absolute temperature times the change in entropy. Temperature is an example of a successful type of normalized information content, because if one hot and one cold brick are placed in contact with each other in a thermally closed environment, then heat will flow between them. Now, if we jump at this without thinking too hard, we say that heat is the information. But is it the relative information that predicts behaviour of a system. Information flows until equilibrium is reached, but equilibrium of what? Temperature, that's what, not heat as in particle velocity of certain particle masses, I am not talking about molecular temperature, I am talking about gross temperature of two bricks which may have different masses, made of different materials, having different densities etc., and none of that do I have to know, all I need to know is that the gross temperature is what equilibrates. Thus if one brick is hotter, then it has more relative information content, and when colder, less.

Now, if I am told one brick has more entropy than the other, so what? That, by itself, will not predict if it will gain or lose entropy when placed in contact with another brick. So, is entropy alone a useful measure of information? Yes, but only if we are comparing the same brick to itself thus the term "self-information."

From that comes the last restriction: To use K-L divergence all bricks must be identical. Thus, what makes AIC an atypical index is that it is not portable between data sets (e.g., different bricks), which is not an especially desirable property that might be addressed by normalizing information content. Is K-L divergence linear? Maybe yes, maybe no. However, that does not matter, we do not need to assume linearity to use AIC, and, for example, entropy itself I do not think is linearly related to temperature. In other words, we do not need a linear metric to use entropy calculations.

It has been said that, "In itself, the value of the AIC for a given data set has no meaning." On the optimistic side models that have close results can be differentiated by smoothing to establish confidence intervals, and much much more.

$\endgroup$
37
  • 2
    $\begingroup$ Could you indicate the main difference between the new answer and the old deleted answer? It seems there is quite some overlap. $\endgroup$ Commented Sep 13, 2016 at 11:51
  • 3
    $\begingroup$ I was in the middle of editing my answer for some hours when it was deleted. There were a lot of changes compared to when I started as it was a work in-progress, took a lot of reading and thinking, and my colleagues on this site do not seem to care for it, but are not helping answer anything. AIC it seems is too good for critical review, how dare I? I completed my edit and re-posted it. I want to know what is incorrect about my answer. I worked hard on it, and have tried to be truthful, and, no-one else has bothered. $\endgroup$
    – Carl
    Commented Sep 13, 2016 at 14:44
  • 4
    $\begingroup$ Don't get upset. My first experience here was also frustrating, but later I learned to ask questions in an appropriate way. Keeping a neutral tone and avoiding strong opinions that are not based on hard facts would be a good first step, IMHO. (I have upvoted your question, by the way, but still hesitate about the answer.) $\endgroup$ Commented Sep 13, 2016 at 14:53
  • 3
    $\begingroup$ +1 Just for your preamble. Now I'll keep on reading the answer. $\endgroup$ Commented Sep 13, 2016 at 22:12
  • 2
    $\begingroup$ @AntoniParellada You have helped just by keeping the question from being deleted, which I appreciate. Working thru AIC has been difficult, and I do need help with it. Sure some of my insights are good, but I also have hoof in mouth disease, which other minds are better at catching than I. $\endgroup$
    – Carl
    Commented Sep 13, 2016 at 22:34
5
$\begingroup$

AIC is an estimate of twice the model-driven additive term to the expected Kullback-Leibler divergence between the true distribution $f$ and the approximating parametric model $g$.

K-L divergence is a topic in information theory and works intuitively (though not rigorously) as a measure of distance between two probability distributions. In my explanation below, I'm referencing these slides from Shuhua Hu. This answer still needs a citation for the "key result."

The K-L divergence between the true model $f$ and approximating model $g_{\theta}$ is $$ d(f, g_{\theta}) = \int f(x) \log(f(x)) dx -\int f(x) \log(g_{\theta}(x)) dx$$

Since the truth is unknown, data $y$ is generated from $f$ and maximum likelihood estimation yields estimator $\hat{\theta}(y)$. Replacing $\theta$ with $\hat{\theta}(y)$ in the equations above means that both the second term in the K-L divergence formula as well as the K-L divergence itself are now random variables. The "key result" in the slides is that the average of the second additive term with respect to $y$ can be estimated by a simple function of the likelihood function $L$ (evaluated at the MLE), and $k$, the dimension of $\theta$: $$ -\text{E}_y\left[\int f(x) \log(g_{\hat{\theta}(y)}(x)) \, dx \right] \approx -\log(L(\hat{\theta}(y))) + k.$$

AIC is defined as twice the expectation above (HT @Carl), and smaller (more negative) values correspond to a smaller estimated K-L divergences between the true distribution $f$ and the modeled distribution $g_{\hat{\theta}(y)}$.

$\endgroup$
3
  • $\begingroup$ As you know, the term deviance when applied to log-likelihood is jargon and inexact. I omitted discussion of this because only monotonicity is required for AIC differences to have comparative worth not linearity. So, I fail to see the relevance of trying overly hard to "visualize" something that likely is not there, and not needed anyway. $\endgroup$
    – Carl
    Commented Sep 21, 2016 at 17:10
  • 2
    $\begingroup$ I see your point that the last paragraph adds a red herring, and I realize that nobody needs to be convinced that 2 * x ranks the same as x. Would if be fair to say that the quantity is multiplied by 2 "by convention"? $\endgroup$
    – Ben Ogorek
    Commented Sep 21, 2016 at 23:16
  • 2
    $\begingroup$ Something like that. Personally, I would vote for "is defined as," because it was initially chosen that way. Or to put this in temporal perspective, any constant that could have been used, including one times, would have to have been chosen and adhered to, as there is no reference standard to enforce a scale. $\endgroup$
    – Carl
    Commented Sep 21, 2016 at 23:57
5
$\begingroup$

A simple point of view for your first two questions is that the AIC is related to the expected out-of-sample error rate of the maximum likelihood model. The AIC criterion is based on the relationship (Elements of Statistical Learning equation 7.27) $$ -2 \, \mathrm{E}[\ln \mathrm{Pr}(D|\theta)] \approx -\frac{2}{N} \, \mathrm{E}[\ln L_{m,D}] + \frac{2k_m}{N} = \frac{1}{N} E[\mathrm{AIC}_{m,D}] $$ where, following your notation, $k_m$ is the number of parameters in the model $m$ whose maximum likelihood value is $L_{m,D}$.

The term on the left is the expected out-of-sample "error" rate of the maximum likelihood model $m = \{ \theta \}$, using the log of the probability as the error metric. The -2 factor is the traditional correction used to construct the deviance (useful because in certain situations it follows a chi-square distribution).

The right hand consists of the in-sample "error" rate estimated from the maximized log-likelihood, plus the term $2k_m/N$ correcting for the optimism of the maximized log-likelihood, which has the freedom to overfit the data somewhat.

Thus, the AIC is an estimate of the out-of-sample "error" rate (deviance) times $N$.

$\endgroup$
0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.