An introduction to classification algorithms

Thibaut
15 min readAug 27, 2021
Photo by Vinicius “amnx” Amano on Unsplash

In the precedent article, I spoke about the creation of a Bag of Words in R. Then, I made tests on two different datasets. This is a good occasion to tell a little about some commonly used classification algorithms.

I will give some code samples in R, but this is easily transposable to Python or Julia.

All the models are available in this github repository.

Our task

We are working with the dataset from the Toxic Comments Classification Challenge on Kaggle.

We need to guess if a new comment is toxic or not using the labelled corpus of toxic and not toxic comments made on Wikipedia.

The goal of the Toxic Comments classification challenge was to build the perspective API, that identify toxicity in online discussions.

Comments are transformed into numerical vectors using Bag of Words

Each comment is embedded as a numerical vector using the Bag of Word technic. Each meaningful word or pair of words (bigram) in our dictionary is a coordinate of the vector. The result is very long and sparse vectors.

For each comment, we list the words and bigrams used. If the bigram “you are” is used 3 times in the document, we attribute a…

--

--

Thibaut

Publications in English & French about Data Science, Artificial Intelligence, and Innovation.