WOLFRAM|DEMONSTRATIONS PROJECT

Prediction and Entropy of Languages

​
corpus language
Spanish
n-grams
Approaching entropy (H) of a language through a corpus n-grams
Shannon's information entropy [1] is defined by
H(Q)=-∑P(n)
log
2
P(n)
, where
P(n)
is the probability of
n
, and the sum is over the elements of
Q
. Shannon's entropy is a measure of uncertainty.
An application is the redundancy in a language relating to the frequencies of letter
n
-grams (letters, pairs of letters, triplets, etc.). This Demonstration shows the frequency of
n
-grams calculated from the United Nations' Universal Declaration of Human Rights in 20 languages and illustrates the entropy rate calculated from these
n
-gram frequency distributions. The entropy of a language is an estimation of the probabilistic information content of each letter in that language and so is also a measure of its predictability and redundancy.