Minimum description length

From Academic Kids

The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the largest compression of the data. MDL is important in information theory and learning theory.

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet. "The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998. See the link below.) Since we want to select the hypothesis that captures the most regularity in the data, we look for the hypothesis with which the best compression can be achieved.

In order to do this, we must first fix a code to compress the data. The most general way to do this is to choose a (Turing-complete) computer language. We then write a program in that language, that outputs the data. This program thus represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference.

However, this mathematical theory does not provide a practical way of doing inference. The most important reasons for this are:

  • Kolmogorov complexity is uncomputable: there exists no computer program that, when input an arbitrary sequence of data, outputs the shortest program that produces the data. Even if we accidentally find the shortest program that outputs the data, it is in general not possible to know that it is the shortest.
  • The Kolmogorov complexity depends on what computer language is used to describe programs. It is only defined up to a constant number of bits. If only a small amount of data is available, then such constants may have a very large influence on the inference results: good results cannot be guaranteed when one is working with limited data.

MDL is an attempt to remedy these, by:

  • Restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed codes, and
  • Choosing a code that it is reasonably efficient whatever the data at hand. This point is somewhat elusive and much research is still going on in this area.

Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class. (To confuse matters, some authors refer to the model class as the model.) The code of which the description, together with the description of the data, has the shortest length is then selected.

MDL was not the first attempt to do hypothesis selection through minimizing description length; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length (MML). MDL was introduced by Jorma Rissanen in 1978; it differs from MML in several ways, most notably (at least in most of J. Rissanen's early MDL papers) in its extensive use of one-part rather than two-part codes.

Central to MDL theory is the 1-1 correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.) For any probability distribution <math>P<math>, it is possible to construct a code <math>C<math> such that the length (in bits) of <math>C(x)<math> is equal to <math>-log_2 P(x)<math>; this code minimizes the expected code length. Vice versa, given a code <math>C<math>, one can construct a probability distribution <math>P<math> such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa.

This has led some researchers to view MDL as being equivalent to Bayesian inference. Code length of the model and code length of model and data together in the MDL framework correspond to prior probability and marginal likelihood respectively in the Bayesian framework. This point of view is expressed for example in David MacKay's Information Theory, Inference, and Learning Algorithms (see link below). However, while Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework sometimes uses other codes that do not fit into a Bayesian framework. An example is the Shtarkov `normalized maximum likelihood code', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, the MDL Principle prefers some priors over others. While the same priors tend to be favored in so-called objective Bayesian analysis, they are favored for different reasons.

External links

  • Homepage of Jorma Rissanen (, containing lecture notes and other recent material on MDL.
  • P. Grunwald, M. A. Pitt and I. J. Myung (eds.), Advances in Minimum Description Length: Theory and Applications (, M.I.T. Press (MIT Press), April 2005, ISBN ( (

Academic Kids Menu

  • Art and Cultures
    • Art (
    • Architecture (
    • Cultures (
    • Music (
    • Musical Instruments (
  • Biographies (
  • Clipart (
  • Geography (
    • Countries of the World (
    • Maps (
    • Flags (
    • Continents (
  • History (
    • Ancient Civilizations (
    • Industrial Revolution (
    • Middle Ages (
    • Prehistory (
    • Renaissance (
    • Timelines (
    • United States (
    • Wars (
    • World History (
  • Human Body (
  • Mathematics (
  • Reference (
  • Science (
    • Animals (
    • Aviation (
    • Dinosaurs (
    • Earth (
    • Inventions (
    • Physical Science (
    • Plants (
    • Scientists (
  • Social Studies (
    • Anthropology (
    • Economics (
    • Government (
    • Religion (
    • Holidays (
  • Space and Astronomy
    • Solar System (
    • Planets (
  • Sports (
  • Timelines (
  • Weather (
  • US States (


  • Home Page (
  • Contact Us (

  • Clip Art (
Personal tools