Compute the Profile

The standard algorithm for computing the Profile – the baseline – scans through the text and counts the trigrams. The Latin alphabet of 26 letters and the space give rise to 27^3 = 19,683 possible trigrams, and so we can Accumulate the Trigram Counts into a 19,683-dimensional vector and compare such vectors to find the language with the most similar profile. This is straightforward and simple with trigrams but it gets complicated with higher-order n-grams when the number of possible n-grams grows into the millions (the number of possible pentagrams is 275 = 14,348,907). The standard algorithm generalizes poorly.