Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

What does the Akaike Information Criterion (AIC) score of a model mean?

I have seen some questions here about what it means in layman terms, but these are too layman for for my purpose here. I am trying to mathematically understand what does the AIC score mean.

But at the same time, I don't want a rigor proof that would make me not see the more important points. For example, if this was calculus, I would be happy with infinitesimals, and if this was probability theory, I would be happy without measure theory.

My attempt

by reading here, and some notation sugar of my own, $\text{AIC}_{m,D}$ is the AIC criterion of model $m$ on dataset $D$ as follows: $$ \text{AIC}_{m,D} = 2k_m - 2 \ln(L_{m,D}) $$ where $k_m$ is the number of parameters of model $m$, and $L_{m,D}$ is the maximum likelihood function value of model $m$ on dataset $D$.

Here is my understanding of what the above implies:

$$ m = \underset{\theta}{\text{arg max}\,} \Pr(D|\theta) $$

This way:

  • $k_m$ is the number of parameters of $m$.
  • $L_{m,D} = \Pr(D|m) = \mathcal{L}(m|D)$.

Let's now rewrite AIC: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2k_m - 2 \ln(\Pr(D|m))\\ =& 2k_m - 2 \log_e(\Pr(D|m))\\ \end{split}$$

Obviously, $\Pr(D|m)$ is the probability of observing dataset $D$ under model $m$. So the better the model $m$ fits the dataset $D$, the larger $\Pr(D|m)$ becomes, and thus smaller the term $-2\log_e(\Pr(D|m))$ becomes.

So clearly AIC rewards models that fit their datasets (because smaller $\text{AIC}_{m,D}$ is better).

On the other hand, the term $2k_m$ clearly punishes models with more parameters by making $\text{AIC}_{m,D}$ larger.

In other words, AIC seems to be a measure that:

  • Rewards accurate models (those that fit $D$ better) logarithmically. E.g. it rewards increase in fitness from $0.4$ to $0.5$ more than it rewards the increase in fitness from $0.8$ to $0.9$. This is shown in the figure below.
  • Rewards reduction in parameters linearly. So decrease in parameters from $9$ down to $8$ is rewarded as much as it rewards the decrease from $2$ down to $1$.

enter image description here

In other words (again), AIC defines a trade-off between the importance of simplicity and the importance of fitness.

In other words (again), AIC seems to suggest that:

  • The importance of fitness diminishes.
  • But the importance of simplicity never diminishes but is rather always constantly important.

Q1: But a question is: why should we care about this specific fitness-simplicity trade-off?

Q2: Why $2k$ and why $2 \log_e(\ldots)$? Why not just: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2(k_m - \ln(L_{m,D}))\\ \frac{\text{AIC}_{m,D}}{2} =& k_m - \ln(L_{m,D})\\ \text{AIC}_{m,D,\text{SIMPLE}} =& k_m - \ln(L_{m,D})\\ \end{split}$$ i.e. $\text{AIC}_{m,D,\text{SIMPLE}}$ should in y view be equally useful to $\text{AIC}_{m,D}$ and should be able to serve for relatively comparing different models (it's just not scaled by $2$; do we need this?).

Q3: How does this relate to information theory? Could someone derive this from an information theoretical start?

Answer

Cancel
3
  • $\begingroup$ As you know, the term deviance when applied to log-likelihood is jargon and inexact. I omitted discussion of this because only monotonicity is required for AIC differences to have comparative worth not linearity. So, I fail to see the relevance of trying overly hard to "visualize" something that likely is not there, and not needed anyway. $\endgroup$
    – Carl
    Commented Sep 21, 2016 at 17:10
  • 2
    $\begingroup$ I see your point that the last paragraph adds a red herring, and I realize that nobody needs to be convinced that 2 * x ranks the same as x. Would if be fair to say that the quantity is multiplied by 2 "by convention"? $\endgroup$
    – Ben Ogorek
    Commented Sep 21, 2016 at 23:16
  • 2
    $\begingroup$ Something like that. Personally, I would vote for "is defined as," because it was initially chosen that way. Or to put this in temporal perspective, any constant that could have been used, including one times, would have to have been chosen and adhered to, as there is no reference standard to enforce a scale. $\endgroup$
    – Carl
    Commented Sep 21, 2016 at 23:57

-