How to create a Bag of Words embedding in R?

Thibaut
9 min readAug 25, 2021
Bag of Words — Photo by Glen Carrie
Unsplash — By Glen Carrie

Bag of Word embedding is a Natural Language Processing technic to embed sentences into a fixed-size numeric vector. The goal is to use this vector as an input for a machine learning algorithm.

Bag of Words is simple to understand an is a great technic when you want to keep track of the exact contribution of each word in you algorithm. It is also quite challenging to use because it creates very long and sparse vectors. Many algorithms are not really friend with highly sparse high dimensional data.

There is different ways to weight the words in a bag of words. The easiest to catch maybe the term frequency.

Let’s imagine we have these 3 sentences, forming what we call a corpus of 3 documents:

I love chocolate!

I love you, I do love you…

Do you love chocolate?

Here is how each document would be embedded in a Bag of Words vector, weighting the words by count:

Bag of Words with Term Frequency — Own work — CC BY 3.0

The matrix formed by all the vectors of the corpus’ documents is called a Document Term Matrix. In real life, these matrices are extremely big and sparse.

--

--

Thibaut

Publications in English & French about Data Science, Artificial Intelligence, and Innovation.