TD 1 - Nearest Neighbor Search
In this practical session we explore a variant of the Random-Projection trees (RP-trees) that has been designed to solve the quantization problem efficiently. The idea is to change the splitting rule at each node of the tree, so that the resulting quantization error grows with the intrinsic dimension of the data, as opposed to growing with its ambient dimension.
The paper under study can be found here:
Reading the paper
Let us first read the paper together. Our goal is to spot the important parts quickly, and to get a fairly precise idea of their content within a limited amount of time. Typically, we should ask ourselves the following questions about the context of the paper:
- What is the problem considered? What is its mathematical formulation?
- What are the main bottlenecks? What are the usual workarounds?
- What are the challenges associated with these workarounds?
And now about the contribution:
- What is the proposed approach? How new is it?
- What are its contenders? Its specificities?
Finally about the method itself:
- What are the details of the method?
- Does it come with theoretical guarantees? Any restrictions or limitations?
- How does it compare with the state of the art, both in theory and in practice?
Going beyond the paper
Pick one or more questions left open in the paper. These can be either of a theoretical nature, or of a practical nature, or both. In the present case you can for instance investigate:
- The practical behavior of the method: implement it on your computer (for this you can get inspiration from here), and then test it against competing methods (typically k-means) on a range of datasets (for those you can look into the list of popular data sets on the UCI Machine Learning Repository). Don't hesitate to provide constructive criticism regarding the practical behavior of the method, in particular its dependence on the ambient and intrinsic dimensions, in view of what is claimed in the paper. Note: you can use PCA/MDS or non-linear variants to get an idea of the intrinsic dimension (see e.g. here)
- The theoretical guarantees associated with the method. These are fairly specific here, comparing the performance of the method against 1-means instead of k-means. Can you get bounds that involve the optimum of k-means in the low intrinsic dimension case?
- Variants of the method. Can you improve the cutting rule, so that better bounds can be achieved? What about variants of the methods for different problems (e.g. k-centers)?
- Any other question that we collectively consider interesting and relevant.
Last modification: Sep. 28 2018