The goal of this lab session is to test the approaches for linear and non-linear dimensionality reduction that we saw during the lecture, to see their practical interest and limitations.
Before you start the lab, please refer to this page to install and configure R. To run R, simply type R in a terminal. You can write and execute commands in the R environment directly, or in dedicated source files (whose names should have the .R extension). To run a source file file.R in R, simply type source('file.R') in the R environment. To edit and execute source files you can use Emacs together with the ESS extension. Alternatively, you can use a dedicated IDE such as R-Studio. The choice of a particular IDE is always a matter of personal taste...
We will need the following packages:
The R command to know which packages are installed is library(). To use package pkg, just type library(pkg). The package must be installed in order to be used. The command to install the package is install.packages("pkg", dependencies = TRUE). The installation itself requires an access to the Internet. It downloads either source or binary files, depending on your OS and architecture. In case source files are downloaded, you may need to have make and a C++ compiler like gcc installed to compile the package. Please refer to the dedicated webpage for more information, and note that some OSes like Linux conveniently allow you to install packages globally (i.e. for all users) through their own package manager.Implement PCA in the language R using the diagonalization of the covariance matrix, as seen during the lecture. Your function PCA should take a point cloud and a boolean as input. The cloud of n points in d dimensions will be given as a data frame with n lines and d columns. The boolean will indicate whether the data should be normalized in addition to being centered as a preprocessing.
Your function should output a list whose first four cells contain respectively:
Here is our solution. If you decide not to implement PCA then you should take the time to look at the solution carefullly.
Now you can test your function on the data sets decathlon and NBA. To read in the data you can use the function read.table, which reads the content of a data file and stores it in a data frame. For plotting the results in 2d you can use the functions plot and text. An interesting diagram to plot as well is the correlation circle, which gives the correlations of the input variables to the new variables. To draw a circle in the plane you can use the function draw.circle from the plotrix package. Here are the results obtained on the decathlon data set using our PCA function (from left to right: spectrum of covariance matrix, cumulated variance, 2d embedding, correlation circle):
Implement MDS in the language R using the diagonalization of the Gram matrix, as seen during the lecture. Your function MDS should take the same input as PCA. It should output a list with three cells only, containing respectively:
You can now apply MDS to the data set decathlon and should obtain the same result as with PCA. You can also apply it to the data set COIL-100 and should obtain the following result (from left to right: spectrum of the Gram matrix, 2d embedding, 3d embedding; to plot the result in 3d you can use the plot3d fuction from the rgl package):
If you want you can also implement the variant Metric MDS and then check that you obtain the same results on the above data sets using the Euclidean distance matrix. Here is our solution.
Consider now the swiss-roll data set. Apply PCA to it and analyze the result. To visualize the embedding with colors, you can use the option col=V in the plot3d function, whose effect is to apply to data point i the color indexed by the floor of the value stored in v[i]. Thus, you can for instance plot the final 2d embedding using colors from the original 2d embedding swiss_roll_sol, to see if both embeddings are consistent. Or, you can plot the original 3d data using colors from the final embedding, to see how it maps the data. Here is our solution.
In order to cope with the non-linearity of the original 3d embedding, we will use Isomap. Instead of implementing it (which would take too long for the session), we will use the implementation available in the vegan package. After installation, type ?isomap for more information on how to use the function. To use it you will need to compute the Euclidean distance matrix from the input data, for which the function dist can be used. The rest of the pipeline is the same as with MDS. Here is our solution.
If you have some spare time (e.g. at home after the session), we recommend you play around with Isomap a bit, and test it against the faces data set illustrated in the top-left corner of the examples slide.
As we saw earlier with the COIL-100 data set, dimensionality reduction can help visualize the structure of the data even though it may not capture that structure direcly.
Consider now the natural images data set. Apply PCA to it and analyze the result. In particular, visualize the embedding in 2d and 3d, look at the spectrum of the covariance matrix, study the correlations between the 8 new variables and the 8 initial ones. Here is our solution. What is your opinion on how well PCA performed?
If you have some time, try to run Isomap on the data set or some subsampling of it. You can also try to use Kernel-PCA, e.g. MDS with as input the Gram matrix obtained using the Gaussian kernel. What is your conclusion regarding the geometric structure of the data? Regarding its topology? In an upcoming session we will see that these are just false impressions and not reality.