Distributed Representations of Sentences and Documents

We propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a Dense Vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

LE, Quoc and MIKOLOV, Tomas, 2014. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR. 2014. p. 1188–1196. page [Accessed 20 March 2024]. pdf

This page is a Reference and a Topic. dmx