The animation above illustrates an example of a 2D plane separating setosa and versicolor points in 3D space. Our parameters define the plane — also known as the decision boundary — and when these are multiplied by the coordinates of a point (the features), the result will be positive if the point is located above the plane, and negative when located below the plane. This is the core idea that makes our algorithm possible. Our job is to find a set of parameters that separate the classes perfectly (like the one above).
The four features of every sample will be represented by a four-dimensional vector $x$. The vector $\theta$ will be a four-dimensional vector containing the weights. If $b$ is the bias, we can mathematically represent our model like this:
€€ \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4 + b €€
With four coordinates ($x_1$ to $x_4$), this expression defines a cube in 4D space rather than a plane in 3D space like the one shown in the animation. Impossible to visualize, yes, but that doesn’t change a thing. Pretty cool, right?
A brief note about terminology: We can refer to a line in 2D space, a plane in 3D space, a cube in 4D space, and so on by the same name: hyperplane. More generally, a hyperplane is simply a space of dimension $n-1$ in an enclosing space of dimension $n$. As we will see, this will be a convenient term as we expand our algorithms to higher dimensions.
It is worth repeating that – for 3D space – when this expression is positive, the point $x$ is located above the plane, and when the expression is negative, $x$ is located beneath the plane. The plane itself is defined by the parameters $\theta$ and $b$, where $\theta$ is responsible for “tilting” the plane, and $b$ either pushes or pulls the plane up or down in the coordinate space. Note that all of this holds true for higher dimensional hyperplanes as well.
Keep in mind that our weights and features are represented as vectors. This allows us to rewrite the above expression by taking the dot product of $x$ and $\theta$, and then adding $b$ to the resulting scalar:
€€ (x \cdot \theta) + b €€
To make things even more concise, we will add another element $x_0$ to our vector $x$ and make sure that it is always equal to 1. We will also add a corresponding weight $\theta_0$ to our vector $\theta$. This allows us to treat the bias (now represented as $\theta_0$) as part of our vector operations, so we can simplify the above expression even further:
€€ x \cdot \theta €€
Spelled out, this is now equal to:
€€ \begin{Bmatrix}1 \cr x_1 \cr x_2\cr x_3 \cr x_4 \end{Bmatrix} \cdot \begin{Bmatrix}\theta_0 \cr \theta_1 \cr \theta_2 \cr \theta_3 \cr \theta_4\end{Bmatrix} = \theta_0 + \theta_1x_1 + \cdots + \theta_4x_4 €€
All of this may seem like mere mathematical housekeeping, but as we will see, it will keep our code neat and tidy, and — much more importantly — it will allow us to vectorize our computations. Vectorization is important, since it enables significant computational speedups, as soon as we begin to handle more complex models with larger amounts of data. (For a small dataset like ours, it won’t really make a difference, but it is good practice to keep it in mind.)
We will now define a function known as the heaviside step function or unit step function:
€€ \sigma(z)=\begin{cases} 1 & {\small z > 0,} \cr 0 & \text{otherwise} \end{cases} €€
If it isn’t immediately obvious from this definition, the heaviside step function takes a scalar input $z$ and returns 1 if $z$ is positive, and 0 if $z$ is nonpositive. The sole purpose of the heaviside step function is to classify points below the decision boundary as 0, and points above the decision boundary as 1. The output will thus be our prediction of the Iris type.
For each sample, the output $\hat{y}$ of our model will be:
€€ \hat{y} = \sigma(x \cdot \theta) €€
Now to the most important part. In order for the network to ‘learn’ a function — which is to say a bias and a set of weights — that produces a plane which separates setosa and versicolor points, we will need a learning algorithm. Fortunately, the perceptron learning algorithm is super simple:
€€ \Delta \theta_i = (y^{(j)} - \hat{y}^{(j)})x_i^{(j)} €€
$j$ represents the $j$th sample, and $i$ represents the $i$th feature. $y^{(j)}$ is the true label.
If you are not used to mathematical notation, this may seem a little intimidating. Not to worry; it is easy to see what is happening here by looking at an example.
Let’s assume that, during training, the model erroneously interprets setosa features for versicolor features. The heaviside step function would then output a 1 when it should have given us a 0. In this case, $\Delta \theta_i$ would be:
€€ \Delta \theta_i = (0 - 1)x_i^{(j)} €€
If $x_i^{(j)}$ were, say, the petal length of our misclassified sample with a value of 1.9, the equation would look like this:
€€ \Delta \theta_i = (-1) \cdot 1.9 = -1.9 €€
We would then subtract 1.9 from $\theta_i$’s current value.
To see why we subtract, remember that the output of the model was 1 when it should have been 0. This means that the result of the dot product $(x \cdot \theta)$ was positive when it should have been negative, and therefore that the parameters are currently adjusted too high. Had we predicted a 0 when the true label was 1, the value of $\theta_i$ would have been increased by 1.9.
This same learning algorithm holds for the bias. However, since the bias’ corresponding “feature” is 1, this leaves us with:
€€ \Delta \theta_0 = (y^{(j)} - \hat{y}^{(j)}) €€
And that’s it!
A cool thing about the perceptron algorithm is that it is guaranteed to find exactly the right parameters that separate the two classes at 100% accuracy (granted that the data is linearly separable of course). The question is whether these parameters generalize and enable the model to correctly predict the classes of samples that weren’t in the training set.
To start off, make sure that you have Python and NumPy installed on your machine. If you are not already familiar with NumPy, it is an essential package for Python for doing matrix and vector operations. There will be tons of these going forward.
To import and handle the Iris dataset, we will use Pandas. NumPy as well as Pandas can be installed by running pip3 install numpy
and pip3 install pandas
in the command line. Once you’ve got everything set up, we’ll create a new Python file and do the requisite imports at the top. We will also be including the Iris dataset made available from UCI’s website. As a courtesy, you may want to download the file and reference it locally in order to spare UCI of unnecessary traffic.
I hope the comments make it clear what is going on here.
We will use the training samples to train the model, and our test samples to verify whether the model generalizes and correctly predicts the classes of unseen samples. A 50/50 split between training and test samples is not very common (usually it is 60/40 or 70/30 with the majority going to the training samples), but it should suffice for this example.
Note that all training samples are now stored within a 50x5 matrix $X$. If you are unfamiliar with matrices, they can simply be thought of as containers of multiple vectors. Thus, the 50x5 matrix contains 50 samples/vectors, and each of those vectors contains four features and a bias. In other words, our matrix $X$ contains 50 $x$s.
Now that we have our data organized and ready, it is time to code the actual perceptron. Let’s start by creating a Perceptron
class with a fit
method that takes three parameters: X
, y
and epochs
. X
will be our matrix of features, y
will be a vector containing the corresponding target values, and epochs
will be an integer specifying how many times to loop over the training data. If we didn’t specify the number of epochs, we would only have 50 iterations to update our parameters, which probably wouldn’t be enough for them to converge on the most optimal values.
Believe it or not, this is the whole perceptron algorithm!
We start by initializing five random weights between 0 and 1. Then, the fit
method loops over all samples epochs
number of times. In the inner-most loop, we take the dot product between every feature vector X[i]
and the parameter vector self.theta
. The result of the dot product is handed over to the heaviside step function yielding our y_hat
, and the parameters are then updated.
To benchmark how well our model performs, we will add a predict
method to our class that uses the learned parameters self.theta
to predict the classes of our test samples:
All there is left to do is feed some data to a Perceptron
object:
When running the script, you should see the following output:
Because the data is linearly separable, the algorithm found the optimal values for $\theta$, and is now able to correctly classify all 50 test samples. Had the data not been linearly separable, it would have been a whole other story.
In the next post, we will cover some slightly more complex terrain in order to see how we can get closer to modelling data that is not linearly separable.