Word Embeddings and Document Vectors: Part 1. Similarity
This similarity can be as simple as a categorical feature value such as the color or shape of the objects we are classifying, or a more complex function of all categorical and/or continuous feature values that these objects possess. Documents can be classified as well using their quantifiable attributes such as size, file extension etc… Easy! But unfortunately it is the meaning/import of the text contained in the document is what we are usually interested in for classification. The ingredients of text are words (and throw in punctuation as well) and the meaning of a text snippet is not a deterministic function of these constituents. We know that the same set of words but in a different order, or simply with different punctuation can convey different meanings.