# Collocation by Symmetric Conditional Probability

Collocation by Symmetric Conditional Probability

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to use the notion of "symmetric conditional probability". For a bigram (i.e., an -gram with two parts), its symmetric conditional probability is the product of (1) the conditional probability that the second half of the bigram would appear given the first half of the bigram and (2) the conditional probability that the first half of the bigram would appear given the second half of the bigram. Bigrams with high specific conditional probability may be thought of as collocations. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the symmetric conditional probability of this "pseudo-bigram", and then taking the mean of the results.

n

n

n

n

n

1…n-1

n

n

This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram. You select the elementary cellular automaton that generates the sentences. You select whether 3 or 4 is the maximum length of -grams examined. You further select which -gram of length 2 or more you wish to consider. And you select where you want to slice the -gram to convert it into a pseudo-bigram. The system responds with a figure showing the conditional probability of the second part of the pseudo-bigram given the first part of the bigram and a figure showing the conditional probability of the first part of the pseudo-bigram given the second part of the pseudo-bigram. The green circle shows the -gram you selected. The purple dashed -gram shows the relevant parent of that -gram and the blue dashed -grams show all appropriate children of the parent. The bottom row of the output shows how symmetric conditional probability (the mean of the products of the conditional probabilities over all possible slicings of the -gram) is done.

n

n

n

n

n

n

n

n

n

n

n