This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

Clustering is a widespread unsupervised machine learning technique whose aim is to group together “similar” data points in any data space. There are many types of different applications where clustering is used to group unlabeled data sets, such as market research, image processing, pattern recognition, etc. In the DEEP-EST project, clustering is used to remove noise and to identify buildings and other structures in a LiDAR point cloud. The picture below shows the result of clustering a LiDAR data set: different colors represent different clusters.

 

Image (CC BY-NC-ND 4.0) from Götz, Richerzhagen, Bodenstein, Cavallaro, Glock, Riedel, Benediktsson: On Scalable Data Mining Techniques, Procedia Computer Science, Volume 51, Elsevier, 2015.

Since its conception, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) has found extensive use in both academia and industry. It remains to this day one of the most cited algorithms in its field and its main strengths lie in its ability to discover clusters of various shapes while simultaneously being resistant to the effects of outliers and noise. However, due to its inherent sequential formulation, the parallelization of DBSCAN remains an open and active research topic. This is mainly because most DBSCAN algorithms rely on established methods for data partitioning that are sequential in nature and difficult to parallelize, or lack an algorithmic design that can take advantage of the massive parallelism offered by GPGPU accelerators.


The University of Iceland, a member of the DEEP-EST consortium, is developing a new DBSCAN algorithm, NextDBSCAN, which uses a novel data-structure suitable for all types of parallel processing, including GPGPU accelerators, which can perform very fast data partitioning and comparison in parallel. Recent benchmarks performed on the latest version of NextDBSCAN vastly outperforms previous state-of-the-art parallel DBSCAN algorithms used in HPC, where NextDBSCAN was the fastest algorithm for all benchmarked cases and measured up to 400x faster than the second fastest algorithm.


The figure below shows the total runtime difference of the current CPU and GPGPU versions of NextDBSCAN, which also includes I/O. The measurements were carried out on a single node of the DEEP-EST Data Analytics Modules (DAM), which is well suited for the task with strong multi-core CPUs, abundant RAM, and powerful GPGPU accelerators. A tile from a LiDAR data set containing the digital elevation of the whole of the Netherlands (AHN3), was used as the input data set and consists of 527 million points in three dimensions. The output contains approximately 100.000 clusters that were found by DBSCAN for the used input parameters. Notably, for the described case, NextDBSCAN completed in minutes where the other parallel DBSCAN algorithms needed hours. Additionally, as the figure illustrates, the difference between the CPU and GPGPU versions were significant, especially when a lower number of CPU cores is used. This difference is expected to be even greater with a larger input data set or one which has more than three dimensions. Note that as the GPGPU-based version mainly used the GPGPU, the number of used CPU cores does not matter that much and thus allows computing even with a less powerful CPU.