An Overview of K-Means & DBSCAN Clustering
Data scientists are always on the hunt for newer ways to sort data quickly and ‘classify’ it in the most consumable manner possible. There are a handful of methods that scientists can use, but it all boils down to two basic clustering methodologies; k-means and DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
The two python methodologies use what is known as unsupervised learning to classify data in terms of their likeness. Where K-Means is a relatively simple clustering algorithm, DBSCAN is slightly more complicated; and for good reason. DBSCAN results are much more reliable than the results from k-means, but that doesn’t mean that K-means is not useful.
What Is Clustering?
Before we get into the details of what k-means and DBSCAN represent, it is important to understand what clustering is.
Clustering is an ML (machine learning) technique that groups different datasets together and sorts the information within based on the similarity of information. The technology is used by a wide range of industries, including but not limited to:
· Uber for route optimization,
· Amazon, Spotify, YouTube, Netflix, and other similar platforms to ‘recommend’ products,
· Data mining and more.
Here is a simple diagram to explain how data clustering works:
Both k-means and DBSCAN sort the data out, but how they sort it out is different and therefore, so are the results. DBSCAN takes a bit more time and effort, but the results are much more reliable.
What Is K-Means Clustering?
K-means clustering (devised by Macqueen, 1967) is the most basic type of clustering algorithm out there. The beauty of this algorithm lies in the speed and relative accuracy, because of how easy it is to implement, k-means clustering is much more widely used than DBSCAN.
It is a centroid (partition) based clustering algorithm that uses python as its base language when implemented in systems. Once the algorithm is executed, it sorts data in a given sample space into k similarity groups. This is usually measured using the Euclidian Distance method.K-means clustering’s main objective is to minimize an objective function.
Here is a chart to show the workings of the algorithm:
The answer you get is in the form of sorted clusters within the same group. Unfortunately, k-means clustering assumes that for every cluster, each direction is equally important. Although most of the time this doesn’t present an issue, sometimes data can be oddly shaped, i.e., one data point may be drastically different from others.
Some of the most common parameters used for k means clustering when plotting via python include:
· init — This parameter is responsible for how the algorithm is initiated. By default, the initialization setup reflects a “random” pattern. If you set this to “k-means++” you can direct the algorithm toward a specific direction and speed up convergence.
· n_clusters — this parameter is responsible for setting the value of k for the clustering step. This is perhaps the most important parameter in this algorithm.
· n_init — This is used to set the number of initializations to perform. By default, scikit-learn algorithms perform 10 runs, presenting the result which had the lowest SSE (sum of squares due to error).
A visual representation of what k-means clustering does for you is as follows:
Assuming four data groups are placed in the three circles in a random fashion, after running the k-means algorithm, it will be classified as the image shows.
DBSCAN Clustering
DBSCAN, or Density-Based Spatial Clustering of Applications with Noise algorithm, is one that classifies data (and noise) and separates the different classes spatially. The reason DBSCAN results are much more reliable (and in theory, take a lot of time) is that when sorting out data, it detects outliers and identifies noise as well, rooting it out separately.
For that to happen, it only considers data points with a minimum number of points (M) within a radius (R).
The main factor that makes DBSCAN so reliable is that it visits each data point several times, improving classification each time. This also makes the model a bit slow, though. It sorts objects within a radius around the center, which can be calculated as ε (epsilon).
DBSCAN doesn’t rely on guesswork but instead, defines two hyper-parameters to determine the number of clusters; epsilon and minPoints.
- Epsilon (ε): This is the distance measure to check the density of points in any given neighborhood.
- minPoints(n): This is the minimum number of points that need to be in a neighborhood for it to be considered dense.
Here is a chart showing how the algorithm works:
For reference, core points are those data points that have the least M points within a specified radius (R). Border points are those that have less than M data points in them or are within R-distance from a core point. Outlier points are those that aren’t core points nor are they within their reach. They are identified as noise by the algorithm.
Core points are identified with the help of two main parameters; min_samples and eps. When classifying data, the DBSCAN algorithm checks if there are “min_sample” number of data points within a distance, “eps”. If there are any data points within the range, the core data point is given core status.
Data points that fall within the eps range of core samples are then put into the same cluster by DBSCAN.
This is what the result looks like if the same sample as the one put in the k-means algorithm is taken into account. The distinction between data points is much clearer here, and the results are much more consistent with this model.
There is still some noise, but it is (relatively) easy to sort out. By employing the algorithm a few times, you can further refine results.
Neither k-means nor DBSCAN can be considered the best clustering algorithm out there, but these are the more commonly used ones because of the ease of implementation associated with them. Other models can be more reliable, yes, but they take much more computing power, have extensive storage requirements, and may take more time. After all, efficient resource allocation and results are part of a data scientist’s job!