By Andrew C. Oliver, Columnist, InfoWorld |

How classification and clustering work: the easy way

People are often confused about what these are and what the difference is. So here is an explanation using the old-fashioned way: in an Excel spreadsheet

Machine learning gets a lot of buzz. The two most talked about classes of algorithms are classification and clustering. Classification is assigning things a label. Clustering is grouping things that look like they go together. Yet people are often confused about what these are and what the difference is.

That confusion is partly because many explanations quickly go into a bunch of formulas. Instead, here is an explanation of clustering and classifying things the old-fashioned way: in an Excel spreadsheet.

How classification works

Let’s say that you want to predict which students will likely graduate and which students will likely drop out. Perhaps you want to flag them so you can assign a counselor. So, you have two labels: risk and low-risk. To do this using classification, you need a training set of students already known to have graduated.

(Please note that I acquired this data the same way a stable genius does: I made it up. Don’t use it for anything but understanding what classification means.)

Forget the algorithm for now. Let’s use this spreadsheet:

In the sheet’s data are some patterns among GPA, number of suspensions, and whether the student has been expelled. Mentally, you can make some correlations and note some exceptions.

download

Classification example in Excel

The source data for the classification example.

So, based on the following data, can you decide who is likely to graduate? If so, congratulations! You’re a classification algorithm.

How clustering works

Now let’s look at clustering. I have no labels for this data set. I just want the computer to effectively find the ones that are like the other ones and group them.

This data also has some patterns in it that you can see: The first and last column are probably meaningless for grouping purposes. However, there are several that have 1 1 1 in the first field. In fact, there are some that have 1 1 1 and then 0 0 0 and then 1 1 1. Now group those rows a cluster.

You can probably find the opposite pattern as well. That is another cluster.

You may also find some smaller matches, like 1 1 1 0 0 0 1 1 (it’s not in the sample data here, so you’re not missing something). Group that one; it can also be a cluster.

download

Clustering example in Excel

The source data for the clustering example

There are various algorithms that do this computationally. Some even do different forms of classification and clustering. However, the basic idea is that this something you can do in Excel.

Next read this:

Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver has helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He advises startups on marketing, growth, and outreach.

Student AA	1	no	4
Student BB	1	yes	1.5
Student CC	0	no	3.2