The Task is to Identify the Language of a Sentence from its three-letter sequences called Trigrams.
We compare the trigram profile of the sentence to the trigram profiles of 21 languages and chose the language with the most similar profile. A Profile is essentially a histogram of trigram frequencies in the text in question.
The standard algorithm for computing the profile – the baseline – scans through the text and counts the trigrams. The Latin alphabet of 26 letters and the space give rise to 27^3 = 19,683 possible trigrams, and so we can accumulate the trigram counts into a 19,683-dimensional vector and compare such vectors to find the language with the most similar profile. This is straightforward and simple with trigrams but it gets complicated with higher-order n-grams when the number of possible n-grams grows into the millions (the number of possible pentagrams is 275 = 14,348,907). The standard algorithm generalizes poorly.
~
RAHIMI, Abbas, DATTA, Sohum, KLEYKO, Denis, FRADY, Edward Paxon, OLSHAUSEN, Bruno, KANERVA, Pentti and RABAEY, Jan M., 2017. High-Dimensional Computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems I: Regular Papers. 2017. Vol. 64, no. 9, p. 2508–2521. ieee
[Accessed 4 March 2024]. doi ![]()
A. Joshi, J. T. Halseth, and P. Kanerva, “Language geometry using random indexing,” in Proc. 10th Int. Conf., Quant. Interact., San Francisco, CA, USA, Jul. 2016, pp. 265–274. doi
pdf ![]()
?
?
//wiki.ralfbarkow.ch/assets/pages/parse-page-paragraphs/speed-bot.html HEIGHT 222 SOURCE graph