Identify the Language of a Sentence

The Task is to Identify the Language of a Sentence from its three-letter sequences called Trigrams.

We compare the trigram profile of the sentence to the trigram profiles of 21 languages and chose the language with the most similar profile. A Profile is essentially a histogram of trigram frequencies in the text in question.

The standard algorithm for computing the profile – the baseline – scans through the text and counts the trigrams. The Latin alphabet of 26 letters and the space give rise to 27^3 = 19,683 possible trigrams, and so we can accumulate the trigram counts into a 19,683-dimensional vector and compare such vectors to find the language with the most similar profile. This is straightforward and simple with trigrams but it gets complicated with higher-order n-grams when the number of possible n-grams grows into the millions (the number of possible pentagrams is 275 = 14,348,907). The standard algorithm generalizes poorly.

RAHIMI, Abbas, DATTA, Sohum, KLEYKO, Denis, FRADY, Edward Paxon, OLSHAUSEN, Bruno, KANERVA, Pentti and RABAEY, Jan M., 2017. High-Dimensional Computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems I: Regular Papers. 2017. Vol. 64, no. 9, p. 2508–2521. ieee [Accessed 4 March 2024]. doi

A. Joshi, J. T. Halseth, and P. Kanerva, “Language geometry using random indexing,” in Proc. 10th Int. Conf., Quant. Interact., San Francisco, CA, USA, Jul. 2016, pp. 265–274. doi pdf

//wiki.ralfbarkow.ch/assets/pages/parse-page-paragraphs/speed-bot.html HEIGHT 222 SOURCE graph