# Collocation by Chi Square

Collocation by Chi Square

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to create a matrix. This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram.

n

n

χ

2

n

n

n

You select the elementary cellular automaton that generates the sentences. You select whether three or four is the maximum length of -grams examined. You further select which -gram you wish to consider. And you select where you want to slice the -gram to convert it into a pseudo-bigram. The system responds with two figures: a trie and a reverse trie, both of which label edges according to the number of times the destination -gram follows the source -gram. In the trie, the -gram you selected is circled in green. The first part of your -gram is circled in magenta. -grams that share the first part of your -gram but that do not share the second part are circled in blue. In the reverse trie, the -gram you selected is again circled in green. The second part of your -gram is circled in magenta. -grams that share the second part of your -gram but that do not share the first part are circled in cyan. -grams that share neither the first nor second part of your -gram are circled in orange. The system then produces the corresponding matrix with the elements of the matrix color coded in the same way as the -grams in the figures. Thus, if blue nodes have incoming edge counts of 7, 7, 15, 17, 15, 11, 21, the top right element of the matrix will be the sum of these edge counts (93) and colored blue. The bottom row of the output shows the values associated with the matrix and shows the likelihood that the sequence of values in the -gram would occur as frequently as would be the case if the values were chosen at random.

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

χ

2

n

χ

2

χ

2

n