The CLUE algorithm is a clustering algorithm written at CERN (https://www.frontiersin.org/articles/10.3389/fdata.2020.591315/full).
The original algorithm was designed to work in 2 dimensions, with the data distributed in parallel layers. Unlike other clustering algorithms, CLUE takes the coordinates of the points and also their weight, which represents their energy, and calculates the energy density of each point. This energy density is used to find the seeds for each cluster, their followers and the outliers, which are dismissed as noise. CLUE takes 4 parameters in input:
dc
, which is the side of the box inside of which the density of a point is calculated;rhoc
, which is the minimum energy density that a point must have to not be considered an outlier,dm
, which is the side of the box inside of which the followers of a point are searched;pointsPerBin
, which is the average number of points that are to be found inside a bin. This value allows to control the size of the bins.
This library generalizes the original algorithm, making it N-dimensional, and turns it into a general purpose algorithm, usable by any user and applicaple to a wider range of applications, in particular outside particle physics.
The C++ code is binded using PyBind11, and the module is created locally during the installation of the library.
In the library is defined the clusterer
class, which contains the methods for reading the data, running the algorithm, plotting the data both in input and output, and others.
Outside of the class is also defined the function test_blobs, which takes the number of points and the number of dimensions, and is a way to test quickly the library, producing some N-dimensional blobs.
Below is shown a basic example of how the library can be used:
import CLUEstering as clue
clust = clue.clusterer(.4, 2., 1.5)
clust.read_data(clue.test_blobs(1000,2))
clust.run_clue()
clust.cluster_plotter()
The main dependencies of CLUEstering are Boost (version 1.75.0+) and Alpaka. If alpaka is not found, it will be automatically fetched from the official repository, so it's not mandatory to install it manually.
To install the library, first clone the repository recursively:
git clone --recursive https://github.com/cms-patatrack/CLUEstering.git
alternatively, clone and update the submodules manually:
git clone https://github.com/cms-patatrack/CLUEstering.git
git submodule update --init --recursive
Then, inside the root directory install the library with pip:
pip install -v .
where the -v
flag is optional but suggested because provides more details during the compilation process.
The library is also available on the PyPi repository, and can be installed with:
pip install -v CLUEstering
Since version 2.0.0
the pybind module is compiled for all the supported backends using the Alpaka
portability library (https://github.com/alpaka-group/alpaka).
Currently the supported backends include:
- CPU serial
- CPU parallel using TBB
- NVIDIA GPUs
- AMD GPUs
The modules are compiled automatically at the moment of installation, and the user can choose the backend to use when running by passing a parameter to the
run_clue
method.
clust.run_clue("cpu serial")
clust.run_clue("cpu tbb")
clust.run_clue("gpu cuda")
clust.run_clue("gpu hip")
If no argument is passed, by default the serial backend is used.
It is possible to list all the available devices with the list_devices
method. If no argument is passed, the method lists all the devices for all the backends,
but it's also possible to specify the backend whose devices want to be listed.
# list devices for all backends
c.list_devices()
# specify the backend
c.list_devices('cpu serial')
c.list_devices('cpu tbb')
c.list_devices('gpu cuda')
c.list_devices('gpu hip')
The clusterer
class represents a wrapper class around the method mainRun
, which is binded from C++
and that is the method that runs the CLUE algorithm.
When an instance of this class is created, it requires at least three parameters: dc
, rhoc
and dm
. There is a fourth parameter, pPBin
, which represents the desired average number of points found in each of the bins that the clustering space is divided into. This parameter has a default value of 10
.
The parameters dc
, rhoc
and dm
must be floats
or a type convertible to a float
. ppBin
, on the other hand, is an integer
.
The class has several methods:
read_data
, which takes the data in input and inizializes the class members. The data can be in the form of list, numpy array, dictionary, string containing the path to a csv file or pandas DataFrame;change_coordinates
, which allows to change the coordinate system used for clustering;change_domains
, which allows to change the domain ranges of any eventual periodic coordinates;choose_kernel
, which allows to change the convolution kernel used when calculating the local density of each point. The default kernel is a flat kernel with parameter0.5
, but it can be changed to an exponential or gaussian kernel, or a custom kernel, which is user defined and can be any continuous function;run_clue
, which runs the CLUE algorithm;list_devices
, which lists all the available devices for the supported backends;input_plotter
, which plots all the points in input. This method is useful for getting an idea of the shape of the dataset before clustering. In addition to some plot customizations (like the colour or the size of the points, the addition of a grid, the axis labels and so on) it's also possible to pass the functions for the change of coordinates and change the coordinate system used for plotting.cluster_plotter
, which plots the data using a different colour for each cluster. The seeds are indicated by stars and the outliers by small grey crosses.to_csv
, which takes two strings, the first containing the path to a folder and the second containing the desired name for the csv file (also with the .csv suffix) and produces the csv file containing the cluster informations.
Data is read with the read_data
method.
For the data to be acceptable, it must contain the values of at least one coordinate for all the points, and their weights
. The weights
of the points represent their relative importance in the dataset, so in most cases they can all be set to 1. There are several accepted formats for providing the data:
string
, where the string contains the relative path to acsv
file. The file must contain at least one column for the coordinates, which must be named asx*
(x0
,x1
, ecc.) and one column for theweight
pandas.DataFrame
, where the columns for the coordinates must be namedx*
(x0
,x1
, ecc.) and one column should contain theweight
list
ornp.ndarray
, where the coordinate data should be provided as a list of lists (or array of arrays), and the weights inserted in a second, separate list/arraydictionary
, where the coordinate data must be contained in lists, each with keyx*
(x0
,x1
, ecc.) and the weights in another list with keyweight
If a user wants to test the library without using real data, they can easily do so using the test_blobs
method.
test_blobs
generates a dataset containing any number of gaussian blobs (4 by default), i.e. regular distributions of points distributed gaussianly in a round shape. These blobs can be generated in 2 or 3 dimensions.
It is possible to customize:
- the number of points in the dataset, through the parameter
n_samples
- how spread out the points are, through the
mean
andsigma
parameters - the spatial span over which these blobs are placed, throught the parameters
x_max
andy_max
. By modifying this parameters it's possible to make the blobs more crammed or more well separated, thus making it harder or easier to clusterize them
Since version 1.3.0
it is possible to choose what kernel to use when calculating the local density of each point.
The default choices are gaussian
, exponential
and flat
. Each of the kernels require a list of parameters:
flat
, takes a single parameterexp
, takes two parameters, the amplitude and the mean.gaus
, takes three parameters, the amplitude, the mean and the standard deviation.
The default kernel is a flat kernel with parameter 0.5
.
The different kernels can be chosen using the choose_kernel
method. It is also possible to use a user-defined kernel, by passing it as a function object. Custom kernels require an empty list to be passed as parameters.
The functions used to define a custom kernel should take three parameters:
- The distance
dist_ij
between the two points. - The id
point_id
of the fixed point. - The id
j
of the neighbouring points.
import CLUEstering as clue
clust = clue.clusterer(1., 5., 1.5) # the default kernel is flat(0.5)
# Change to an exponential kernel
clust.choose_kernel('exp', [1. 1.5])
# Now use a custom kernel, a linear function
clust.choose_kernel('custom', [], lambda x, y, z: 2 * x)
The input_plotter
and cluster_plotter
methods two plotting methods based on matplotlib. input_plotter
is intenteded to be used as a way to observe the data before clustering and getting an idea of the expected result, whereas cluster_plotter
plots the results of the clustering, plotting the points corresponding to the same cluster with the same colour and the outliers as small grey crosses.
Both methods allow for a wide range of customizations:
- the title of the plot and its size can be changed with the
plot_title
andtitle_size
parameters - the labels of the axis can be changed with the
x_label
,y_label
andz_label
parameters - the size and colour of the points. In the cluster plotter it is possible to change the size of the three classes of points (normal, seed and outlier) singularly
- it's possible to add a grid to the point with the
grid
boolean parameter, and change its style and size with thegrid_style
andgrid_size
parameters - the ticks on the axis can be changed with the
x_ticks
,y_ticks
andz_ticks
parameters
Since version 1.4.0
both plotting methods can take function objects as kwargs, where the functions represent equations for the change of coordinates, thus allowing to change the coordinate system used for plotting.
import CLUEstering as clue
clust = clue.clusterer(1., 5., 1.5)
clust.read_data(data)
# Plot the data in polar coordinates
clust.input_plotter(x0=lambda x: np.sqrt(x[0]**2 + x[1]**2),
x1=lambda x: np.arctan2(x[1], x[0]))
clust.run_clue()
This is particularly useful when the data has some kind of simmetry, which allows it to be clustered more easily in another coordinate system.
The results of the clustering can be saved to a csv file for later analysis, using the method to_csv
.
This method taks as parameter the path to the output_folder
and the file_name
, where the name of the file must include the .csv
suffix.
The data saved on the file include the entire input, so the points' coordinates and weights, as well as their cluster_id
s, an integer that indicates to what cluster they belong, and is_seed
, a boolean value indicating whether a point is a seed or not.
import CLUEstering as clue
clust = clue.clusterer(1., 5., 1.5)
clust.read_data(data)
clust.run_clue()
clust.to_csv('./output/', 'data_results.csv')