TD 1 - Clustering via Mode Seeking



Your opinion about today's lecture: Your opinion about the last exercise or lab session:

The goal of this lab session is to implement the graph-based hill-climbing algorithm of Koontz et al., together with its variant ToMATo based on topological persistence. Both were presented during the lecture.

Graph-based hill-climbing

Class Point

Create a class Point with the following fields and methods: Add the following constructor Point(Scanner sc), which creates a point from an input stream (e.g. file or standard input):
Point(Scanner sc){
	String line = sc.nextLine();
	String[] tokens = line.replaceFirst("[ |\\t]+", "").split("[ |\\t]+");
	int d = tokens.length;  // number of dimensions
	coords = new double[d];
	for (int i=0; i<d; i++)
		coords[i] = Double.parseDouble(tokens[i]);
}

Class HillClimbing

Create now a class HillClimbing with the following fields: Add the following methods:

Tests

We provide 3 data sets for your tests: test.xy (256 points), crater.xy (same distribution, 10000 points), and spirals.xy (about 70000 points). All 3 of them are 2-dimensional and therefore easy to visualize, which should make your life easier for debugging. Here are some pretty pictures (from left to right: test, crater, spirals):




These pictures were captured from the window produced by the following class: ClusteringWindow.java, which you can download and put together with the other source files. To use it, simply add the following instruction:
new ClusteringWindow(cloud, label, neighbors, k);
where each parameter corresponds to one of the fields of your class HillClimbimg, except for k which you must choose smaller than the number of neighbors in your neighborhood graph.

We recommend you test your code on the data set test first, to have fast runs and easy debugging. Then you can proceed with crater (takes a few dozens of seconds to handle on a decent machine), and then finally spirals (takes about half an hour). Try the following sets of parameters:

// parameters for test.xy
kDensity = 10;
kGraph = 5;

// parameters for crater.xy
kDensity = 50;
kGraph = 15;

// parameters for spirals.xy
kDensity = 100;
kGraph = 30;
You should obtain the following results (19 clusters for test, 92 clusters for crater, 273 clusters for spirals):




ToMATo

Add the following method to your class HillClimbing: Modify now your method computePersistence as follows: You can test your code on the previous data sets using the following sets of parameters:
// parameters for test.xy
String filename = "test.xy";
kDensity = 10;
kGraph = 5;
tau = 0.35;

// parameters for crater.xy
String filename = "crater.xy";
kDensity = 50;
kGraph = 15;
tau = 2;

// parameters for spirals.xy
kDensity = 100;
kGraph = 30;
tau = 0.03;
You should obtain the following results (6 clusters for test and crater, 2 clusters for spirals):




You can also make your method computePersistence output the sorted list of the prominences of the various density peaks. If you plot this sequence of prominences, then you will see the `prominence gap' corresponding to the aforementioned numbers of clusters.

Optimizations

The pacing phase of the pipeline is clearly the computation of the neighborhood graph. It takes quadratic time in the input point, regardless of the sparsity of the resulting graph. To optimize this phase you have several options:
  1. Implement your own data structure for efficient proximity queries, kd-tree for instance. For this you can refer to Section 1 of the following TD from INF562 for building and queryig a kd-tree, and if you have more time you can even do the whole TD which is about Mean-Shift, the other mode-seeking algorithm mentioned in today's lecture. :-)
  2. If your data points lie in 2D or 3D, then you can build their Delaunay triangulation. Then, the k-NN of each data point are found among its neighbors, their own neighbors, etc. within the triangulation, hence by a simple graph traversal. For this you can refer to the session on Delaunay triangulations in the course INF562.
  3. You can also convert your code to C++ and use one of the existing libraries for NN search, such as ANN (which is based on a variant of the kd-trees and is extremely efficient in low and medium dimensions). This is probably the best option for optimizing this part of you code.
The other part of the code that can be optimized is the persistence calculation. Until now we merely used the field parent to store the trees of the clusters during the merging process. The entailed complexity may be up to quadratic, although this rarely happens (if at all) in practice. Now you can implement a full-fledged union-find data structure, with intelligent merges and path compression. This should keep the persistence calculation to a near-linear complexity.
Here is our solution in Java (Point.java and HillClimbing.java) and in Python (HillClimbing.zip) without optimizations.