In this practical session, we will use the various TDA tools presented in class in order to run data science tasks (inference, clustering, classification) on a data set of 3D shapes. As in the first practical session, we will use Gudhi
(see first practical session for installation instructions). The different sections of this notebook can be run independently (except Section 0 which is mandatory), so feel free to start with the project that sounds the more interesting to you :-)
Note also that if you choose to switch from a section to another, make sure to clear all variables first (and run Section 0 again) since some variable names are shared between sections.
import gudhi as gd
print(gd.__version__)
3.6.0a0
import gudhi.clustering.tomato as gdt
import gudhi.representations as gdr
The TensorFlow
module of Gudhi
is only required in Section 4.
import gudhi.tensorflow as gdtf
Other than that, you are free to use whatever other Python package you feel comfortable with :-) We make some suggestions below (these dependencies are also required to run our solutions to the exercises).
import os
import sys
We will use three standard Python libraries: NumPy
, Scipy
and Matplotlib
.
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib notebook
In order to visualize 3D shapes, we will use meshplot
.
import meshplot as mp
Finally, some dependencies are section-specific: we list those below.
In Section 1, when running bootstrap tests on ToMATo, we will use the statistics
Python module. These tests will be based on the Laplace-Beltrami operator, which can be computed with robust_laplacian
.
import statistics
import robust_laplacian as rlap
In Section 2, we will use the networkx
package to visualize and run computations on Mapper graphs.
import networkx as nx
In Sections 3 and 4, when computing vectorizations and performing supervised machine learning and deep learning tasks, we will use various modules of Scikit-Learn
and TensorFlow
.
import sklearn.preprocessing as skp
import sklearn.neighbors as skn
import sklearn.model_selection as skm
import sklearn.decomposition as skd
import sklearn.manifold as skf
import sklearn.pipeline as skl
import sklearn.svm as sks
import sklearn.ensemble as ske
import itertools
import tensorflow as tf
We are good to go! First things first, we have to download the data set. It can be obtained here. Extract it, and save its path in the dataset_path
variable.
dataset_path = './3dshapes/'
As you can see, the data set in split in several categories (Airplane
, Human
, Teddy
, etc), each category having its own folder. Inside each folder, some 3D shapes (i.e., 3D triangulations) are provided in .off
format, and face (i.e., triangle) labels are provided in text files (extension .txt
).
Every data science project begins by some preprocessing ;-)
Write a function off2numpy
that reads information from an .off
file and store it in two NumPy
arrays, called vertices
(type float and shape number_of_vertices x 3---the 3D coordinates of the vertices) and faces
(type integer and shape number_of_faces x 3---the IDs of the vertices that create faces). Write also a function get_labels
that stores the face labels of a given 3D shape in a NumPy
array (type string or integer and shape [number_of_faces].
def off2numpy(shape_name):
with open(shape_name, 'r') as S:
S.readline()
num_vertices, num_faces, _ = [int(n) for n in S.readline().split(' ')]
info = S.readlines()
vertices = np.array([[float(coord) for coord in l.split(' ')] for l in info[0:num_vertices]])
faces = np.array([[int(coord) for coord in l.split(' ')[1:]] for l in info[num_vertices:]])
return vertices, faces
def get_labels(label_name, num_faces):
L = np.empty([num_faces], dtype='|S100')
with open(label_name, 'r') as S:
info = S.readlines()
labels, face_indices = info[0::2], info[1::2]
for ilab, lab in enumerate(labels):
indices = [int(f)-1 for f in face_indices[ilab].split(' ')[:-1]]
L[ np.array(indices) ] = lab[:-1]
return L
You can now apply your code and use meshplot
to visualize a given 3D shape, say 61.off
in Airplane
, and the labels on its faces.
vertices, faces = off2numpy(dataset_path + 'Airplane/61.off')
label_faces = get_labels(dataset_path + 'Airplane/61_labels.txt', len(faces))
mp.plot(vertices, faces, c=skp.LabelEncoder().fit_transform(label_faces))
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(0.0636740…
<meshplot.Viewer.Viewer at 0x7f2afa66ba90>
If meshplot
does not work, we also provide a fix with matplotlib
(it requires converting the face labels into point labels though).
def face2points(vals_faces, faces, num_vertices):
vals_points = np.empty([num_vertices], dtype=type(vals_faces))
for iface, face in enumerate(faces):
vals_points[face] = vals_faces[iface]
return vals_points
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(vertices[:,0], vertices[:,1], vertices[:,2], s=1,
c=skp.LabelEncoder().fit_transform(
face2points(label_faces, faces, len(vertices))))
plt.show()
In this section, our goal is to use the ToMATo algorithm to compute segmentations of 3D shapes, i.e., to assign labels to 3D shape vertices in an unsupervised way, that is, without training on known labels. This task was initially explored in this article.
Overall, the main idea is to run ToMATo on the neighborhood graph given by the triangulation, with the so-called Heat Kernel Signature (HKS) as the filter. This is motivated by the fact that the HKS function typically takes higher values on the parts of the 3D shape that are very curved (such as, e.g., the tips of fingers in human hand shapes).
The HKS was defined in this article. It is related to the heat equation on a given 3D shape $S$:
$$\Delta_S f = -\frac{\partial f}{\partial t}.$$More formally, the HKS function with parameter $t >0$ on a vertex $v\in S$, and denoted by ${\rm HKS}_t(v)$, is computed as:
$${\rm HKS}_t(v) = \sum_{i=0}^{+\infty} {\rm exp}(-\lambda_i\cdot t)\cdot \phi_i^2(v),$$where $\{\lambda_i, \phi_i\}_i$ are the eigenvalues and eigenvectors of $\Delta_S$.
Intuitively, ${\rm HKS}_t(v)$ is the amount of heat remaining on $v$ at time $t$, after unit sources of heat have been placed on each vertex at time t=0
.
Let's first pick a 3D shape. For instance, use Hand/181.off
(or any other one you would like to try).
vertices, faces = off2numpy(dataset_path + 'Hand/181.off')
mp.plot(vertices, faces)
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(-0.042500…
<meshplot.Viewer.Viewer at 0x7f2a7ad2b820>
Now, use robust_laplacian
to compute the first 200 eigenvalues and eigenvectors of its Laplacian (you can use the eigsh
function of SciPy
for diagonalizing the Laplacian).
laplacian, mass = rlap.mesh_laplacian(vertices, faces)
egvals, egvecs = sp.sparse.linalg.eigsh(laplacian, 200, mass, sigma=1e-8)
Write a function HKS
that uses these eigenvalues and eigenvectors, as well as a time parameter, to compute the HKS value on a given vertex.
def HKS(t, egvals, evecs):
return np.sum(np.multiply( np.exp(-egvals * t)[None,:], np.square(egvecs) ), axis=1)
Visualize the function values with meshplot
for different time parameters.
mp.plot(vertices, faces, c=HKS(1e-1, egvals, egvecs))
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(-0.042500…
<meshplot.Viewer.Viewer at 0x7f2a7ad36d90>
Recall that ToMATo requires, in addition to the filter, a neighborhood graph built on top of the data. Fortunately, we can use the triangulations of our 3D shapes as input graphs! Write a function get_neighborhood_graph_from_faces
that computes a neighborhood graph (in the format required by ToMATo) from the faces of a triangulation.
def get_neighborhood_graph_from_faces(faces, num_vertices):
NG = [[] for _ in range(num_vertices)]
for face in faces:
[i1, i2, i3] = face
NG[i1].append(i2)
NG[i2].append(i1)
NG[i2].append(i3)
NG[i3].append(i2)
NG[i1].append(i3)
NG[i3].append(i1)
NG = [np.unique(neighbs) for neighbs in NG]
return NG
Finally, apply ToMATo (with no prior on the number of clusters or merging threshold) on the neighborhood graph and the HKS function associated to a given time parameter.
neighborhood_graph = get_neighborhood_graph_from_faces(faces, len(vertices))
function = HKS(1e-1, egvals, egvecs)
tomato = gdt.Tomato(graph_type='manual', density_type='manual', n_clusters=None, merge_threshold=0)
tomato = tomato.fit(X=neighborhood_graph, weights=function)
Visualize the persistence diagram produced by ToMATo.
plt.figure()
tomato.plot_diagram()
plt.show()
How many points do you see standing out from the diagonal? Use this number to re-cluster.
tomato.n_clusters_ = 5
Visualize the 3D shape with the ToMATo labels.
mp.plot(vertices, faces, c=tomato.labels_)
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(-0.042500…
<meshplot.Viewer.Viewer at 0x7f2a79de35e0>
Does our segmentation make sense? Can you interpret the boundaries between labels?
Since the boundaries are driven by the elder rule, they can seem a bit shaggy. In order to fix this, we can use bootstrap-like smoothing. The idea is to first save the current ToMATo clustering obtained with filter $f$ (let's call it the initial clustering), and then perturb $f$ a little bit into another function $\tilde f$, and finally recompute clustering with ToMATo using $\tilde f$. Since clusters are now created with the maxima of $\tilde f$ (which will be different in general from those of $f$), we can use the initial clustering to relate the clusters of $\tilde f$ to those of $f$, by simply looking at which (initial) clusters do the maxima of $\tilde f$ belong to. If we repeat this procedure $N$ times, we will end up with a distribution (of size $N$) of candidate clusters for each vertex $v$. It suffices to pick the most frequent one for each vertex to get a smooth segmentation for the 3D shape. See also Section 6 in the article.
In order to implement this, write first a function get_indices_of_maxima
which computes the indices of the maxima associated to a set of ToMATo clusters.
def get_indices_of_maxima(label_points, function):
Li = np.copy(label_points)
for lab in np.unique(label_points):
inds = np.argwhere(label_points == lab).ravel()
imax = np.argmax(function[inds]).ravel()
Li[inds] = inds[imax]
return Li
Compute and plot these maxima on the 3D shape to make sure your code is working.
tomato_maxima = get_indices_of_maxima(tomato.labels_, function)
unique_tomato_maxima = np.unique(tomato_maxima)
vertex_show = np.concatenate([unique_tomato_maxima] + [neighborhood_graph[m] for m in unique_tomato_maxima])
mp.plot(vertices, faces, c=np.array([1 if i in vertex_show else 0 for i in range(len(vertices))]))
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(-0.042500…
<meshplot.Viewer.Viewer at 0x7f2a7ad66c10>
Now, use this function to write another function bootstrap_tomato
that perform a bootstrap smoothing of a set to ToMATo labels. This function will also take as arguments a number $N$ of bootstrap iterations, and a parameter $\epsilon$ controlling the amplitude of the uniform noise used to perturb the filter.
def bootstrap_tomato(tomato_maxima, epsilon, N=100, num_labels=None):
distribution_maxima = np.zeros(shape=[len(tomato_maxima), N])
for n in range(N):
if (n+1) % 100 == 0:
print(str(n+1) + '/' + str(N))
np.random.seed(n)
noisy_function = function + np.random.uniform(low=-epsilon, high=epsilon, size=function.shape)
tomato_boot = gdt.Tomato(graph_type='manual', density_type='manual', n_clusters=num_labels)
tomato_boot.fit(X=neighborhood_graph, weights=noisy_function)
maxima_boot = get_indices_of_maxima(tomato_boot.labels_, noisy_function)
distribution_maxima[:,n] = tomato_maxima[maxima_boot]
final_labels = np.array([int(statistics.mode(distribution_maxima[i,:])) for i in range(len(tomato_maxima))])
return skp.LabelEncoder().fit_transform(final_labels)
Apply the bootstrap smoothing and visualize the segmentation.
final_labels = bootstrap_tomato(tomato_maxima, epsilon=12, N=300, num_labels=5)
100/300 200/300 300/300
mp.plot(vertices, faces, c=final_labels)
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(-0.042500…
<meshplot.Viewer.Viewer at 0x7f2a79de37c0>
Is the segmentation any better? How does the result depend on the noise amplitude?
In this section, our goal is to use Mapper to produce 1-skeletons (i.e., graphs) of 3D shapes. We will also see how to partition this graph into different parts and run statistical tests to decide whether these parts should be considered as numerical artifacts or true signal.
Let's first pick a 3D shape. For instance, use Human/3.off
(or any other one you would like to try).
vertices, faces = off2numpy(dataset_path + 'Human/4.off')
mp.plot(vertices, faces, c=vertices[:,1])
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(0.0249469…
<meshplot.Viewer.Viewer at 0x7f2a79de3760>
In Gudhi
, Mapper is implemented as a specific case of Graph Induced Complex (GIC), see Definition 2.1 in this article. Indeed, given a fixed vertex cover, Mapper computed with hierarchical clustering with parameter $\delta > 0$ is (roughly) the same as GIC computed with neighborhood graph with parameter $\delta$.
Initiate a CoverComplex
from Gudhi
, and set its type to "GIC"
.
mapper_complex = gd.CoverComplex()
mapper_complex.set_verbose(True)
mapper_complex.set_type('GIC')
Define the point cloud on which Mapper is computed with the array vertices
, and the filter function as the height coordinate.
mapper_complex.set_point_cloud_from_range(vertices)
mapper_complex.set_function_from_coordinate(1)
Define the node color function (used only for visualization) as the height coordinate as well.
mapper_complex.set_color_from_coordinate(1)
Define the clustering algorithm by automatically tuning the $\delta$ parameter. This can be done by setting the neighborhood graph automatically with the set_graph_from_automatic_rips
function.
optimal_delta = mapper_complex.set_graph_from_automatic_rips()
print('Optimal delta = ' + str(optimal_delta))
Optimal delta = 0.060308542179891185
Finally, define the cover parameters: 20 intervals for the range of the filter (this parameter is called resolution), and 30% overlap (this one is called gain). Then, compute the cover using preimages of the intervals.
mapper_complex.set_resolution_with_interval_number(20)
mapper_complex.set_gain(0.3)
mapper_complex.set_cover_from_function()
We can now compute Mapper!
mapper_complex.find_simplices()
During computations, the pairwise distances are saved in a binary file matrix_dist
in order to save time for further computations. Hence, if you want to apply Mapper again on a different shape, make sure to remove this file!
The simplicial complex produced by Mapper can be obtained with the create_simplex_tree
function. However, its vertices are given integer IDs associated to the cover used to compute Mapper. For convenience, rename the vertices from 0 to number_of_vertices in increasing order of the initial IDs.
old_simplex_tree = mapper_complex.create_simplex_tree()
inv_name = {k[0][0]: i for i,k in enumerate(old_simplex_tree.get_skeleton(0))}
simplex_tree = gd.SimplexTree()
for s,f in old_simplex_tree.get_simplices():
simplex_tree.insert([inv_name[v] for v in s],f)
Gudhi
also computes the mean of the midpoint of the interval associated to each Mapper vertex, and store it as a filtration value. Check that you have correct filtration values in your simplex tree (at least by eye ;-)).
print('Mapper is of dimension ' + str(simplex_tree.dimension()) + ' - ' + \
str(simplex_tree.num_simplices()) + ' simplices - ' + \
str(simplex_tree.num_vertices()) + ' vertices.')
for simplex, filtration in simplex_tree.get_simplices():
print(simplex, filtration)
Mapper is of dimension 1 - 64 simplices - 32 vertices. [0, 2] -0.8407081688311687 [0] -0.8407081688311687 [1, 3] -0.833067465116279 [1] -0.833067465116279 [2, 4] -0.7668185750000001 [2] -0.8407081688311687 [3, 5] -0.7583590784313724 [3] -0.833067465116279 [4, 7] -0.6919797179487178 [4] -0.7668185750000001 [5, 6] -0.6849420547945205 [5] -0.7583590784313724 [6, 8] -0.6211396388888886 [6] -0.6849420547945205 [7, 9] -0.6207633522727273 [7] -0.6919797179487178 [8, 10] -0.5544817619047621 [8] -0.6211396388888886 [9, 11] -0.5484222588235294 [9] -0.6207633522727273 [10, 12] -0.47382282222222233 [10] -0.5544817619047621 [11, 13] -0.4763913452380952 [11] -0.5484222588235294 [12, 14] -0.4040252926829269 [12] -0.47382282222222233 [13, 14] -0.4040252926829269 [13] -0.4763913452380952 [14, 15] -0.3301595530726258 [14] -0.4040252926829269 [15, 16] -0.25517813559322017 [15] -0.3301595530726258 [16, 17] -0.17047311083743844 [16] -0.25517813559322017 [17, 18] -0.11781780035650609 [17] -0.17047311083743844 [18, 19] -0.047231902173913036 [18, 20] -0.03880118702290085 [18] -0.11781780035650609 [19, 23] 0.029644087499999992 [19] -0.047231902173913036 [20, 21] 0.03804111387900341 [20] -0.03880118702290085 [21, 22] 0.03804111387900341 [21, 24] 0.10657868303571423 [21, 25] 0.06448500000000014 [21] 0.03804111387900341 [22] 0.03804111387900341 [23, 27] 0.1136419285714285 [23] 0.029644087499999992 [24, 26] 0.17734622661870503 [24] 0.10657868303571423 [25] 0.06448500000000014 [26, 27] 0.17734622661870503 [26, 28] 0.2498483145833333 [26] 0.17734622661870503 [27] 0.1136419285714285 [28, 29] 0.3351474811912226 [28] 0.2498483145833333 [29, 30] 0.39582064645601994 [29] 0.3351474811912226 [30, 31] 0.44160636476426857 [30] 0.39582064645601994 [31] 0.44160636476426857
With the write_info
function, Gudhi can produce a .txt
file containing information about the 1-skeleton of the Mapper, that can be processed by an utility function, available here. Download and apply this utility function. This will produce an .html
file that you can visualize in your browser.
mapper_complex.write_info()
!python ../../Git/gudhi/src/Nerve_GIC/utilities/KeplerMapperVisuFromTxtFile.py -f matrix_sc.txt
'matrix_sc.html' is generated. You can now use your favorite web browser to visualize it.
Another (more convenient) way to visualize our complex is to plot its 1-skeleton in a Python figure with networkx
. Using networkx
documentation, write a function get_networkx
that turns a simplicial complex into a networkx
graph corresponding to the 1-skeleton. Make it so the networkx
graph has two attributes, "color"
and "size"
that contain the mean of the filter values of the points associated to the Mapper vertices, and the number of these points (i.e., the size of the preimages) respectively. For this, you can use subpopulation
method of Mapper, which returns the point IDs corresponding to every Mapper vertex.
def get_networkx(mapper_complex, simplex_tree, get_attrs=False):
G = nx.Graph()
for (splx,_) in simplex_tree.get_skeleton(1):
if len(splx) == 1:
G.add_node(splx[0])
if len(splx) == 2:
G.add_edge(splx[0], splx[1])
if get_attrs:
attrs = {k: vertices[np.array(mapper_complex.subpopulation(k)),1].mean() for k in G.nodes()}
nx.set_node_attributes(G, attrs, name='color')
attrs = {k: len(mapper_complex.subpopulation(k)) for k in G.nodes()}
nx.set_node_attributes(G, attrs, name='size')
return G
Apply your function and plot your graph with networkx.draw
.
G = get_networkx(mapper_complex, simplex_tree, get_attrs=True)
plt.figure()
nx.draw(G, node_size=[G.nodes[k]['size'] for k in G.nodes()],
node_color=[G.nodes[k]['color'] for k in G.nodes()])
plt.show()
As seen in class, we can now compute a bag-of-feature descriptor for Mapper, defined as the extended persistence diagram of the Mapper complex associated to the filter. Compute and visuzalize this descriptor with the compute_PD
function.
dgm = mapper_complex.compute_PD()
plt.figure()
plt.scatter([p[0] for p in dgm],[p[1] for p in dgm])
plt.plot([np.array(dgm).min(), np.array(dgm).max()], [np.array(dgm).min(), np.array(dgm).max()])
plt.show()
Can you guess the parts of the 3D shape that are associated to each persistence diagram point?
In order to understand the parts that are relevant, we can use the (empirical) bootstrap to generate a distribution of bottleneck distances (computed as the distances between our current persistence diagram and a distribution of persistence diagrams obtained from bootstrapped Mapper complexes), and use this distribution to derive confidence regions. Compute first such a distribution with the compute_distribution
function.
mapper_complex.compute_distribution()
Now, fix a confidence threshold, say 90%, and retrieve the bottleneck distance value $d_b^*$ such that 90% of distances are below this value. You can use the compute_distance_from_confidence_level
function for that.
dist_to_diag = mapper_complex.compute_distance_from_confidence_level(.9)
print('90% level bottleneck distance = ' + str(dist_to_diag))
90% level bottleneck distance = 0.05760889500379551
Finally, retrieve the points of our current persistence diagram whose distance to the diagonal is larger than $d_b^*$.
dgm_dists_to_diag = np.array([ np.abs(p[1]-p[0])/2 for p in dgm ])
robust_points = np.argwhere(dgm_dists_to_diag >= dist_to_diag).ravel()
print('Robust point indices are: ' + str(robust_points))
Robust point indices are: [0 2 4]
Some points were assessed as non robust, can you guess why?
Finally, one might ask whether there is a direct map from the points of the persistence diagrams to the parts of the 3D shape. It is actually a non-trivial question for Mappers of dimension greater than 2, but for Mappers in dimension 1, it is easier. Indeed, connected components and loops (corresponding to persistence diagram points in ${\rm Ext}_0$ and ${\rm Ext}_1$ respectively---see class) are standard graph features.
Compute and visualize the connected components with the connected_components
function of networkx
.
list_ccs = nx.connected_components(G)
for cc in list_ccs:
plt.figure()
nx.draw(G, node_size=[G.nodes[k]['size'] for k in G.nodes()],
node_color=[1 if k in cc else 0 for k in G.nodes()])
plt.show()
Compute and visualize the loops with the cycle_basis
function of networkx
.
list_cycles = nx.cycle_basis(G)
for cycle in list_cycles:
plt.figure()
nx.draw(G, node_size=[G.nodes[k]['size'] for k in G.nodes()],
node_color=[1 if k in cycle else 0 for k in G.nodes()])
plt.show()
Now, concerning branches, i.e., points in ${\rm Ord}_0$ and ${\rm Rel}_1$, the question is a bit more tricky, but fortunately one can use ToMATo as an approximate solution. This is because ToMATo keeps track of the points forming connected components that are merged later on, wich correspond to branches! Hence, one can apply ToMATo with the filter (resp. the opposite of the filter) to obtain the upward (resp. downward) branches.
Since ToMATo requires neighborhood graphs as inputs, write a function get_neighborhood_graph_from_simplex_tree
that computes the neighborhood graph associated to the 1-skeleton of a simplex tree, in a format that is acceptable for ToMATo.
def get_neighborhood_graph_from_simplex_tree(simplex_tree):
NG = [[] for _ in range(simplex_tree.num_vertices())]
for s,_ in simplex_tree.get_skeleton(1):
if len(s) == 2:
NG[s[0]].append(s[1])
NG[s[1]].append(s[0])
return NG
Now, compute this neighborhood graph and apply ToMATo using both the filter function and its opposite (with no prior on the number of clusters or merging threshold).
neighborhood_graph = get_neighborhood_graph_from_simplex_tree(simplex_tree)
function = np.array([G.nodes[k[0]]['color'] for k,_ in simplex_tree.get_skeleton(0)])
tomato_1 = gdt.Tomato(graph_type='manual', density_type='manual', n_clusters=None, merge_threshold=-1)
tomato_1 = tomato_1.fit(X=neighborhood_graph, weights=function)
tomato_2 = gdt.Tomato(graph_type='manual', density_type='manual', n_clusters=None, merge_threshold=-1)
tomato_2 = tomato_2.fit(X=neighborhood_graph, weights=-function)
Finally, visualize the ToMATo labels on the graph.
plt.figure()
nx.draw(G, node_size=[G.nodes[k]['size'] for k in G.nodes()], node_color=tomato_1.labels_[G.nodes()])
plt.show()
plt.figure()
nx.draw(G, node_size=[G.nodes[k]['size'] for k in G.nodes()], node_color=tomato_2.labels_[G.nodes()])
plt.show()
The branches should be detected and highlighted with different colors!
In this section, our goal is to compute confidence regions associated to the persistence diagram of a given 3D shape. We will study such regions for both the persistence diagram, and one of its representation, the persistence landscape.
Let's first pick a 3D shape. Let's first pick a 3D shape. For instance, use Hand/181.off
(or any other one you would like to try).
vertices, faces = off2numpy('3dshapes/Vase/361.off')
mp.plot(vertices, faces, c=vertices[:,1])
Renderer(camera=PerspectiveCamera(children=(DirectionalLight(color='white', intensity=0.6, position=(0.0170675…
<meshplot.Viewer.Viewer at 0x7f2a981e2a60>
The first standard way of obtaining confidence regions for (geometric) persistence diagrams is through the stability theorem (see class):
$$\mathbb{P}(d_b(D_{\rm Rips}(X),D_{\rm Rips}(\hat X_n)) \geq \delta)\leq \mathbb{P}(d_H(X,\hat X_n)\geq \delta/2),$$$$\mathbb{P}(d_b(D_{\rm Cech}(X),D_{\rm Cech}(\hat X_n)) \geq \delta)\leq \mathbb{P}(d_H(X,\hat X_n)\geq \delta),$$where $d_H(\cdot,\cdot)$ is the Hausdorff distance, defined, for any two compact spaces $X,Y\subset \mathbb{R}^d$, as
$$d_H(X,Y)={\rm min}\{{\rm max}_{x\in X}{\rm min}_{y\in Y}\|x-y\|, {\rm max}_{y\in Y}{\rm min}_{x\in X}\|y-x\|\}.$$Hence, it suffices to estimate $\mathbb{P}(d_H(X,\hat X_n)\geq \delta)$ in order to derive confidence regions for persistence diagrams. There exists an upper bound for this probability when $\hat X_n$ is drawn from an $(a,b)$-standard probability measure, however this bound depends on $a$ and $b$. In the following, we will rather use the subsampling method, that allows to estimate the probability solely from subsampling $\hat X_n$ with $s(n) =o\left(\frac{n}{{\rm log}(n)}\right)$ points, and computing $d_H(\hat X_n, \hat X_{s(n)})$. The exact procedure is described in Section 4.1 in this article.
Write a function hausdorff_distance
that computes the Hausdorff distance between the vertices of our 3D shape and a subset of these vertices.
def hausdorff_distance(vertices, sn):
n = len(vertices)
I = np.random.choice(n, sn, replace=False)
Icomp = np.setdiff1d(np.arange(n), I)
tree = skn.KDTree(vertices[I], leaf_size=2)
distances, _ = tree.query(vertices[Icomp], k=1)
hdist = max(distances)
return(hdist)
Now, write a function hausdorff_interval
that computes this Hausdorff distance many times and uses the corresponding distribution of Hausdorff distances in order to output the bottleneck distance value associated to a given confidence level (by computing the quantile---corresponding to this confidence level---of the distribution).
def hausdorff_interval(vertices, level=0.95, sn=100, N=1000):
distribution_hausdorff_distances = [hausdorff_distance(vertices, sn) for _ in range(N)]
diagram_quantile = np.quantile(distribution_hausdorff_distances, level)
return diagram_quantile
Apply your code to obtain a bottleneck distance associated to, say, 90% confidence.
n = len(vertices)
conf_bottleneck = hausdorff_interval(vertices=vertices, level=0.9, N=100, sn=int(10*n/np.log(n)**2))
print('Bottleneck distance associated to confidence level = ' + str(conf_bottleneck))
Bottleneck distance associated to confidence level = 0.16691767027658425
All right, now let's see which points of the persistence diagram are we going to label non-significant and discard. Compute the Rips and Alpha persistence diagrams of the points.
simplex_tree = gd.RipsComplex(points=vertices, max_edge_length=9*1e-2).create_simplex_tree(max_dimension=2)
pers_rips = simplex_tree.persistence()
simplex_tree = gd.AlphaComplex(points=vertices).create_simplex_tree()
for splx, filt in simplex_tree.get_filtration():
simplex_tree.assign_filtration(splx, filtration=np.sqrt(filt))
pers_alpha = simplex_tree.persistence()
Now, visualize the persistence diagrams with a band of size the previously computed bottleneck distance times 2 (for Alpha filtration) and 4 (for Rips filtration).
gd.plot_persistence_diagram(pers_alpha, band=2*conf_bottleneck)
<AxesSubplot:title={'center':'Persistence diagram'}, xlabel='Birth', ylabel='Death'>
gd.plot_persistence_diagram(pers_rips, band=4*conf_bottleneck)
<AxesSubplot:title={'center':'Persistence diagram'}, xlabel='Birth', ylabel='Death'>
Are you discarding many points? If yes, this could be because the confidence region is computed only from the stability property of persistence diagrams: subsampling the Hausdorff distance can sometimes be quite conservative. It would be more efficient to bootstrap the persistence diagrams themselves---this is the approach advocated in Section 6 of this article. However, this method was only proved for persistence diagrams obtained through the sublevel sets of kernel density estimators... But let's try it anyway! ;-)
Similarly than before, write bottleneck_distance_bootstrap
and bottleneck_interval
functions that compute the bottleneck distances between our current persistence diagram (in homology dimension 1) and the persistence diagrams of many bootstrap iterates.
dgm = simplex_tree.persistence_intervals_in_dimension(1)
def bottleneck_distance_bootstrap(dgm, vertices):
n = len(vertices)
I = np.random.choice(n, n, replace=True)
simplex_tree = gd.AlphaComplex(points=vertices[I]).create_simplex_tree()
for splx, filt in simplex_tree.get_filtration():
simplex_tree.assign_filtration(splx, filtration=np.sqrt(filt))
simplex_tree.persistence()
dgm_subsample = simplex_tree.persistence_intervals_in_dimension(1)
return gd.bottleneck_distance(dgm, dgm_subsample)
def bottleneck_interval(dgm, vertices, level=0.95, N=1000):
distribution_bottleneck_distances = [bottleneck_distance_bootstrap(dgm, vertices) for _ in range(N)]
bottleneck_diagram_quantile = np.quantile(distribution_bottleneck_distances, level)
return bottleneck_diagram_quantile
Compute the bottleneck distance associated to a confidence level and visualize it.
n = len(vertices)
conf_bottleneck_empirical = bottleneck_interval(dgm=dgm, vertices=vertices, level=0.9, N=100)
print('Bottleneck distance associated to confidence level = ' + str(conf_bottleneck_empirical))
Bottleneck distance associated to confidence level = 0.018634322836806633
gd.plot_persistence_diagram(pers_alpha, band=2*conf_bottleneck_empirical)
<AxesSubplot:title={'center':'Persistence diagram'}, xlabel='Birth', ylabel='Death'>
Are you discarding less points in the persistence diagram now?
Another approach with more theoretical guarantees is to use the persistence landscapes associated to the persistence diagram. Indeed, valid confidence regions can be easily obtained using, e.g., algorithm 1 in this article. In the following, we will fix a subsample size $s(n)$, and estimate $\mathbb{E}[\Lambda_{s(n)}]$, where $\Lambda_{s(n)}$ is the landscape of a random subsample of size $s(n)$ (i.e., drawn from a probability measure $\mu$ such as, e.g., the empirical measure).
Let's first make sure that we can compute landscapes ;-) Use Gudhi
to compute and plot the first six persistence landscapes associated to the persistence diagram computed above in homology dimension 1. Landscapes (and other vectorizations) are implemented with the API of Scikit-Learn
estimators, which means that you have to call the fit_transform
method on a list of persistence diagrams in order to get their landscapes.
dgm1 = simplex_tree.persistence_intervals_in_dimension(1)
landscape1 = gdr.Landscape(num_landscapes=6, resolution=100).fit_transform([dgm1])
plt.figure()
plt.plot(landscape1[0,0:100])
plt.plot(landscape1[0,100:200])
plt.plot(landscape1[0,200:300])
plt.plot(landscape1[0,300:400])
plt.plot(landscape1[0,400:500])
plt.plot(landscape1[0,500:600])
plt.show()
Write a function landscape_interval
that implements the landscape bootstrap procedure, that is, drawing many subsamples of size $s(n)$, computing their Alpha persistence diagrams and landscapes, computing the distribution of distances between each single landscape and their mean (multiplied by a random normal variable), and finally using the quantiles of this distribution in order to obtain confidence regions for the mean landscape.
def landscape_interval(vertices, sn=100, N=100, B=100, num_landscapes=6, resolution=100, landscape_estimator=None):
n = len(vertices)
list_sub_dgm1 = []
for _ in range(N):
sub_vertices = vertices[np.random.choice(n, sn, replace=True)]
sub_simplex_tree = gd.AlphaComplex(points=sub_vertices).create_simplex_tree()
for splx, filt in sub_simplex_tree.get_filtration():
sub_simplex_tree.assign_filtration(splx, filtration=np.sqrt(filt))
sub_simplex_tree.persistence()
sub_dgm1 = sub_simplex_tree.persistence_intervals_in_dimension(1)
list_sub_dgm1.append(sub_dgm1)
if landscape_estimator is None:
landscape_estimator = gdr.Landscape(num_landscapes=num_landscapes, resolution=resolution)
landscape_distrib = landscape_estimator.fit_transform(list_sub_dgm1)
mean_landscape = np.mean(landscape_distrib, axis=0)
landscape_differences = landscape_distrib - mean_landscape[None,:]
theta_distrib = [[] for _ in range(num_landscapes)]
for _ in range(B):
xi = np.random.normal(size=[N,1])
random_landscape_differences = np.abs(np.multiply(xi, landscape_differences).sum(axis=0))/np.sqrt(N)
for nl in range(num_landscapes):
theta_distrib[nl].append( random_landscape_differences[nl*resolution:(nl+1)*resolution].max() )
return landscape_estimator, mean_landscape, theta_distrib
Apply and visualize the confidence interval around the different landscapes.
N, B, num_landscapes, resolution = 10, 10, 6, 100
landscape_estimator, mean_landscape, theta_distrib = landscape_interval(vertices, sn=int(.9*len(vertices)),
N=N, B=B, num_landscapes=num_landscapes,
resolution=resolution)
q_alpha = [np.quantile(theta_distrib[nl], .9) for nl in range(num_landscapes)]
nl = 5
mean_curve = mean_landscape[nl*resolution:(nl+1)*resolution]
upper_curve = mean_curve+q_alpha[nl]/np.sqrt(N)
lower_curve = np.maximum(0,mean_curve-q_alpha[nl]/np.sqrt(N))
plt.figure()
plt.plot(mean_curve)
plt.fill_between(np.arange(resolution), lower_curve, upper_curve, alpha=0.2, color='tab:orange')
plt.show()
The confidence regions are much better now!
Another interesting property of mean landscapes is their robustness to noise:
$$\|\mathbb{E}[\Lambda_{s(n)}^X]-\mathbb{E}[\Lambda_{s(n)}^Y]\|_\infty\leq 2 \cdot s(n) \cdot d_{GW}(\mu,\nu),$$where $d_{GW}$ is the 1-Gromov-Wasserstein distance between probability measures. See Remark 6 in this article. We will now confirm this by adding outlier noise to the 3D shape and looking at the resulting mean landscape.
Create a noisy version of vertices
with some outlier noise.
num_noisy_points = 100
noisy_vertices = np.vstack([vertices, np.random.uniform(vertices.min(), vertices.max(), [num_noisy_points,3])])
Let's first compare the persistence landscapes of the two sets of vertices. Compute and visualize these landscapes on the same plot.
simplex_tree = gd.AlphaComplex(points=vertices).create_simplex_tree()
for splx, filt in simplex_tree.get_filtration():
simplex_tree.assign_filtration(splx, filtration=np.sqrt(filt))
simplex_tree.persistence()
dgm1 = simplex_tree.persistence_intervals_in_dimension(1)
noisy_simplex_tree = gd.AlphaComplex(points=noisy_vertices).create_simplex_tree()
for splx, filt in noisy_simplex_tree.get_filtration():
noisy_simplex_tree.assign_filtration(splx, filtration=np.sqrt(filt))
noisy_simplex_tree.persistence()
noisy_dgm1 = noisy_simplex_tree.persistence_intervals_in_dimension(1)
landscape1 = gdr.Landscape(num_landscapes=6, resolution=100).fit_transform([dgm1, noisy_dgm1])
plt.figure()
plt.plot(landscape1[0,0:100])
plt.plot(landscape1[0,100:200])
plt.plot(landscape1[0,200:300])
plt.plot(landscape1[0,300:400])
plt.plot(landscape1[0,400:500])
plt.plot(landscape1[0,500:600])
plt.plot(landscape1[1,0:100], linestyle='--')
plt.plot(landscape1[1,100:200], linestyle='--')
plt.plot(landscape1[1,200:300], linestyle='--')
plt.plot(landscape1[1,300:400], linestyle='--')
plt.plot(landscape1[1,400:500], linestyle='--')
plt.plot(landscape1[1,500:600], linestyle='--')
plt.show()
As one can see, they are quite different. By contrast, computing the mean landscape with subsamples is much more robust, as we will now see.
Compute the mean persistence landscape of the noisy point cloud, and visualize it next to the mean persistence landscape of the clean point cloud.
_, noisy_mean_landscape, noisy_theta_distrib = landscape_interval(noisy_vertices, sn=int(.9*len(vertices)),
N=N, B=B, num_landscapes=num_landscapes,
resolution=resolution,
landscape_estimator=landscape_estimator)
noisy_q_alpha = [np.quantile(noisy_theta_distrib[nl], .9) for nl in range(num_landscapes)]
nl = 5
mean_curve = mean_landscape[nl*resolution:(nl+1)*resolution]
upper_curve = mean_curve+q_alpha[nl]/np.sqrt(N)
lower_curve = np.maximum(0,mean_curve-q_alpha[nl]/np.sqrt(N))
noisy_mean_curve = noisy_mean_landscape[nl*resolution:(nl+1)*resolution]
noisy_upper_curve = noisy_mean_curve+noisy_q_alpha[nl]/np.sqrt(N)
noisy_lower_curve = np.maximum(0,noisy_mean_curve-noisy_q_alpha[nl]/np.sqrt(N))
plt.figure()
plt.plot(mean_curve, c='red')
plt.fill_between(np.arange(resolution), lower_curve, upper_curve, alpha=0.2, color='tab:orange')
plt.plot(noisy_mean_curve, c='black')
plt.fill_between(np.arange(resolution), noisy_lower_curve, noisy_upper_curve, alpha=0.2, color='tab:blue')
plt.show()
Now, these landscapes looks much more in agreement!
In this section, our goal is to use persistence diagrams for classifying and segmenting 3D shapes with supervised machine learning.
Let's start with classification. We will compute persistence diagrams for all shapes in different categories, and train a classifier from Scikit-Learn
to predict the category from the persistence diagrams. Since Gudhi
requires simplex trees from the persistence diagram computations, write a get_simplex_tree_from_faces
function that builds a simplex tree from the faces of a given 3D shape triangulation.
def get_simplex_tree_from_faces(faces):
simplex_tree = gd.SimplexTree()
for face in faces:
simplex_tree.insert(face, -1e10)
return simplex_tree
Now, compute all the persistence diagrams (in homology dimension 0) associated to the sublevel sets of the third coordinate from a few categories, and retrieve their corresponding labels.
all_categories = os.listdir(dataset_path)
print(all_categories)
categories_to_classify = [all_categories[1], all_categories[2], all_categories[12]]
['Airplane', 'Ant', 'Armadillo', 'Bearing', 'Bird', 'Bust', 'Chair', 'Cup', 'Fish', 'FourLeg', 'Glasses', 'Hand', 'Human', 'Mech', 'Octopus', 'Plier', 'Table', 'Teddy', 'Vase']
dgms, labels = [], []
for label in all_categories[:3]:
shapes = os.listdir(dataset_path + label + '/')
for file in shapes:
if file[-4:] == '.off':
vertices, faces = off2numpy(dataset_path + label + '/' + file)
st = get_simplex_tree_from_faces(faces)
filtration = vertices[:,2]
for v in range(len(vertices)):
st.assign_filtration([v], filtration[v])
st.make_filtration_non_decreasing()
st.persistence()
dgms.append(st.persistence_intervals_in_dimension(0))
labels.append(label)
le = skp.LabelEncoder().fit(labels)
As discussed in class, it is not very convenient to use persistence diagrams directly for machine learning purposes (except for a few methods such as $K$-nearest neighbors). What we need is to define a vectorization, that is, a map $\Phi:\mathcal{D}\rightarrow\mathcal{H}$ sending persistence diagrams into a Hilbert space, or equivalently, a symmetric kernel function $k:\mathcal{D}\times \mathcal{D} \rightarrow \mathbb{R}$ such that $k(D,D')=\langle \Phi(D),\Phi(D')\rangle$. Fortunately, there are already a bunch of such maps and kernels in Gudhi
:-)
In the following we will compute and visualize the most popular kernels on some persistence diagrams. Pick first a specific persistence diagram and use DiagramSelector
to remove its points with infinite coordinates.
diagram = dgms[30]
gd.plot_persistence_diagram(diagram)
<AxesSubplot:title={'center':'Persistence diagram'}, xlabel='Birth', ylabel='Death'>
[diagram] = gdr.DiagramSelector(use=True, point_type='finite').fit_transform([diagram])
Now, let's see what Gudhi
has to offer to vectorize persistence diagrams with Scikit-Learn
estimator-like classes, that is, with classes that have fit
, transform
, and fit_transform
methods, see this article for more details. For each vectorization mentioned below, we recommend you to play with its parameters and infer their influence on the ouput in order to get some intuition.
The first vectorization method that was introduced historically is the persistence landscape. A persistence landscape is basically obtained by rotating the persistence diagram by $-\pi/4$ (so that the diagonal becomes the $x$-axis), and then putting tent functions on each point. The $k$th landscape is then defined as the $k$th largest value among all these tent functions. It is eventually turned into a vector by evaluating it on a bunch of uniformly sampled points on the $x$-axis.
Compute and visualize the first landscape of the persistence diagram for various parameters.
LS = gdr.Landscape(resolution=1000, num_landscapes=3)
L = LS.fit_transform([diagram])
plt.figure()
plt.plot(L[0][:1000])
plt.plot(L[0][1000:2000])
plt.plot(L[0][2000:3000])
plt.show()
A variation, called the silhouette, takes a weighted average of these tent functions instead. Here, we weight each tent function by the distance of the corresponding point to the diagonal.
SH = gdr.Silhouette(resolution=1000, weight=lambda x: np.power(x[1]-x[0],2))
sh = SH.fit_transform([diagram])
plt.figure()
plt.plot(sh[0])
plt.show()
The second method is the persistence image. A persistence image is obtained by rotating by $-\pi/4$, centering Gaussian functions on all diagram points (usually weighted by a parameter function, such as, e.g., the squared distance to the diagonal) and summing all these Gaussians. This gives a 2D function, that is pixelized into an image.
PI = gdr.PersistenceImage(bandwidth=5*1e-2, weight=lambda x: x[1]**0, \
im_range=[-.5,.5,0,.5], resolution=[100,100])
pi = PI.fit_transform([diagram])
plt.figure()
plt.imshow(np.flip(np.reshape(pi[0], [100,100]), 0))
plt.show()
Gudhi
also has a variety of metrics and kernels, which sometimes perform better than explicit vectorizations such as the ones described above. Pick another persistence diagram, and get familiar with the bottleneck and the Wasserstein distances between them. Note that you can call them in different ways in Gudhi
, there are bottleneck_distance
and wasserstein_distance
functions for instance, but there are also wrappers of these functions into estimator classes BottleneckDistance
and WassersteinDistance
(with fit
and transform
methods). These classes are especially useful when doing model selection with Scikit-Learn
, see below.
[diagram_bis] = gdr.DiagramSelector(use=True, point_type='finite').fit_transform([dgms[20]])
gd.plot_persistence_diagram(diagram_bis)
<AxesSubplot:title={'center':'Persistence diagram'}, xlabel='Birth', ylabel='Death'>
BD = gdr.BottleneckDistance(epsilon=.001)
BD.fit([diagram])
bd = BD.transform([diagram_bis])
print("Bottleneck distance is " + str(bd[0][0]))
Bottleneck distance is 0.08450990809339204
WD = gdr.WassersteinDistance(internal_p=2, order=2)
WD.fit([diagram])
wd = WD.transform([diagram_bis])
print("Wasserstein distance is " + str(wd[0][0]))
Wasserstein distance is 0.15298343201471196
Gudhi
also allows to use standard kernels such as, among others, the persistence scale space kernel, persistence Fisher kernel, sliced Wasserstein kernel, etc. Try computing the kernel values for your pair of diagrams.
PSS = gd.representations.PersistenceScaleSpaceKernel(bandwidth=1.)
PSS.fit([diagram])
pss = PSS.transform([diagram_bis])
print("PSS kernel is " + str(pss[0][0]))
PSS kernel is 1.4115709564014576
PF = gd.representations.PersistenceFisherKernel(bandwidth_fisher=.1, bandwidth=.1, kernel_approx=None)
PF.fit([diagram])
pf = PF.transform([diagram_bis])
print("PF kernel is " + str(pf[0][0]))
PF kernel is 0.6536253395828673
SW = gd.representations.SlicedWassersteinKernel(bandwidth=1, num_directions=100)
SW.fit([diagram])
sw = SW.transform([diagram_bis])
print("SW kernel is " + str(sw[0][0]))
SW kernel is 0.7619287943683518
Before trying to classify the persistence diagrams, let's do a quick dimension reduction with PCA. Apply PCA
, KernelPCA
or MDS
(available in Scikit-Learn
) on the explicit maps (landscapes, images, etc), kernel matrices (Fisher, sliced Wasserstein, etc) and distance matrices (bottleneck, Wasserstein, etc) respectively.
dgms = gdr.DiagramSelector(use=True, point_type='finite').fit_transform(dgms)
silhouettes = gdr.Silhouette(resolution=1000, weight=lambda x: np.power(x[1]-x[0],0)).fit_transform(dgms)
sliced_wass_kernel = gdr.SlicedWassersteinKernel(num_directions=100).fit_transform(dgms)
bottleneck_matrix = gdr.BottleneckDistance().fit_transform(dgms)
pca_silhouettes = skd.PCA(n_components=2).fit_transform(silhouettes)
kpca_sliced_wass_kernel = skd.KernelPCA(n_components=2).fit_transform(sliced_wass_kernel)
mds_bottleneck_matrix = skf.MDS(n_components=2, dissimilarity='precomputed').fit_transform(bottleneck_matrix)
integer_labels = le.transform(labels)
label_indices = [(l,le.classes_[l],np.argwhere(integer_labels==l).ravel()) for l in range(integer_labels.max()+1)]
plt.figure(figsize=(10,5))
ax1 = plt.subplot(131)
for il,l,li in label_indices:
ax1.scatter(pca_silhouettes[li,0], pca_silhouettes[li,1], label=l)
ax1.legend()
ax1.set_title('Silhouette')
ax2 = plt.subplot(132)
for il,l,li in label_indices:
ax2.scatter(kpca_sliced_wass_kernel[li,0], kpca_sliced_wass_kernel[li,1], label=l)
ax2.legend()
ax2.set_title('SWK')
ax3 = plt.subplot(133)
for il,l,li in label_indices:
ax3.scatter(mds_bottleneck_matrix[li,0], mds_bottleneck_matrix[li,1], label=l)
ax3.legend()
ax3.set_title('Bottleneck')
plt.show()
Is there any method that looks better in separating the categories, at least by eye?
All right, let's try classification now! Shuffle the data, and create a random 80/20 train/test split.
np.random.seed(0)
test_size = 0.2
perm = np.random.permutation(len(labels))
limit = int(test_size * len(labels))
test_sub, train_sub = perm[:limit], perm[limit:]
train_labs = np.array(labels)[train_sub]
test_labs = np.array(labels)[test_sub]
train_dgms = [dgms[i] for i in train_sub]
test_dgms = [dgms[i] for i in test_sub]
Here is the best thing about having estimator-like classes: they can be integrated flawlessly in a Pipeline
of Scikit-Learn
for model selection and cross-validation! A Pipeline
is itself an estimator, and is initialized as with a list of estimators. It will just sequentially apply the fit_transform
methods of the estimators in the list.
Define a Pipeline
with four estimators: one for selecting the finite persistence diagram points, one for scaling (or not) the persistence diagrams (with DiagramScaler
), one for vectorizing persistence diagrams, and one for performing the final prediction. See the documentation.
pipe = skl.Pipeline([("Separator", gdr.DiagramSelector(limit=np.inf, point_type="finite")),
("Scaler", gdr.DiagramScaler(scalers=[([0,1], skp.MinMaxScaler())])),
("TDA", gdr.PersistenceImage()),
("Estimator", sks.SVC())])
Now, define a grid of parameter that will be used in cross-validation.
param = [{"Separator__use": [True],
"Scaler__use": [False, True],
"TDA": [gdr.SlicedWassersteinKernel()],
"TDA__bandwidth": [0.1, 1.0],
"TDA__num_directions": [20],
"Estimator": [sks.SVC(kernel="precomputed", gamma="auto")]},
{"Separator__use": [True],
"Scaler__use": [False, True],
"TDA": [gdr.Silhouette()],
"TDA__resolution": [100],
"Estimator": [ske.RandomForestClassifier()]},
{"Separator__use": [True],
"Scaler__use": [False, True],
"TDA": [gdr.BottleneckDistance()],
"TDA__epsilon": [0.1],
"Estimator": [skn.KNeighborsClassifier(metric="precomputed")]}
]
Define and train the model.
model = skm.GridSearchCV(pipe, param, cv=3)
model = model.fit(train_dgms, train_labs)
Check the parameters that were chosen during model selection, and evaluate your model on the test set.
print(model.best_params_)
{'Estimator': SVC(gamma='auto', kernel='precomputed'), 'Scaler__use': True, 'Separator__use': True, 'TDA': SlicedWassersteinKernel(bandwidth=0.1, num_directions=20), 'TDA__bandwidth': 0.1, 'TDA__num_directions': 20}
print("Train accuracy = " + str(model.score(train_dgms, train_labs)))
print("Test accuracy = " + str(model.score(test_dgms, test_labs)))
Train accuracy = 1.0 Test accuracy = 0.75
How is your score? If it is bad, you can always increase the parameter and/or classifier search, but this can quickly become quite cumbersome. Moreover, a potential source of error comes from the fact that the third coordinate do not necessarily correspond to the height (i.e., the 3D shapes are not embedded in a consistent way). This is where persistence differentiation can come to the rescue! Indeed, instead of picking a specific coordinate, we can try to optimize a linear combination of the coordinates:
$$f_\alpha: x\mapsto \sum_{i=1}^d \alpha_i x_i,$$such that the persistence diagrams of the same category are close, while persistence diagrams from different categories are far away from each other. This means minimizing a loss of the form:
$$\alpha^* = {\rm min}_\alpha \sum_l \frac{\sum_{y_i=y_j=l}d(D_{f_\alpha}(x_i),D_{f_\alpha}(x_j))}{\sum_{y_i=l,y_j}d(D_{f_\alpha}(x_i),D_{f_\alpha}(x_j))},$$where $d$ is any (pseudo)-distance between persistence diagrams, that can be differentiated through a deep learning library (such as TensorFlow
or PyTorch
). For instance, the sliced Wasserstein distance is quite easy to compute with standard deep learning libraries since it only involves projecting points onto lines. See this article.
Write a deep_swd
function that computes the sliced Wasserstein distance between persistence diagrams with TensorFlow
or PyTorch
operations.
def deep_swd(dgms, ccards, thetas):
projected_dgms = tf.linalg.matmul(tf.concat(dgms,axis=0), .5*tf.ones([2,2], tf.float32))
dgms_big = tf.concat([tf.reshape(
tf.concat([dgm, projected_dgms[:ccards[idg]], projected_dgms[ccards[idg+1]:]], axis=0),
[-1,2,1,1]) for idg, dgm in enumerate(dgms)], axis=2)
cosines, sines = tf.math.cos(thetas), tf.math.sin(thetas)
vecs = tf.concat([tf.reshape(cosines,[1,1,1,-1]), tf.reshape(sines,[1,1,1,-1])], axis=1)
theta_projs = tf.sort(tf.math.reduce_sum(tf.math.multiply(dgms_big, vecs), axis=1), axis=0)
t1 = tf.reshape(theta_projs, [ccards[-1],-1,1,100])
t2 = tf.reshape(theta_projs, [ccards[-1],1,-1,100])
dists = tf.math.reduce_mean(tf.math.reduce_sum(tf.math.abs(t1-t2), axis=0), axis=2)
return dists
Now, just as before, split the data into train/test, but this time, collect the vertices, simplex trees and labels (it is useless to compute persistence diagrams since they will be recomputed after each gradient descent iteration and update of $\alpha$).
shape_vertices, simplex_trees, labels = [], [], []
for label in all_categories[:3]:
shapes = os.listdir(dataset_path + label + '/')
for file in shapes:
if file[-4:] == '.off':
vertices, faces = off2numpy(dataset_path + label + '/' + file)
st = get_simplex_tree_from_faces(faces)
shape_vertices.append(vertices)
simplex_trees.append(st)
labels.append(label)
labels = le.fit_transform(labels)
test_size = 0.2
perm = np.random.permutation(len(labels))
limit = int(test_size * len(labels))
test_sub, train_sub = perm[:limit], perm[limit:]
train_labs = np.array(labels)[train_sub]
test_labs = np.array(labels)[test_sub]
train_sts = [simplex_trees[i] for i in train_sub]
test_sts = [simplex_trees[i] for i in test_sub]
train_3ds = [shape_vertices[i] for i in train_sub]
test_3ds = [shape_vertices[i] for i in test_sub]
Initialize the alpha values, as well as the angles used for computing the sliced Wasserstein distances (and make sure these angles are not optimized during training). Define also the iteration number, batch size, learning rate and optimizer.
alphas = tf.Variable(initial_value=np.array([0,0,1], dtype=np.float32), trainable=True)
thetainit = np.linspace(-np.pi/2, np.pi/2, num=100)
thetas = tf.Variable(initial_value=np.array(thetainit, dtype=np.float32), trainable=False)
num_epochs = 10
batch_size = 100
lr = tf.keras.optimizers.schedules.ExponentialDecay(0.1, decay_steps=1e5, decay_rate=0.99, staircase=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
Run gradient descent! For this, you can use the LowerStarSimplexTreeLayer
class from Gudhi
, which computes persistence diagrams from simplex trees in a differentiable way with TensorFlow
operations. Make sure to save the loss value at each iteration.
losses = []
for epoch in range(num_epochs+1):
print(str(epoch) + '/' + str(num_epochs))
np.random.seed(int(1e2*epoch))
npts, total_n, batch_i = 0, len(train_labs), 0
perm = np.random.permutation(total_n)
Lgrads = []
while npts < total_n:
batch = perm[batch_i*batch_size:(batch_i+1)*batch_size]
npts += batch_size
batch_i += 1
weight = len(batch)/total_n
batch_labs = [train_labs[i] for i in batch]
with tf.GradientTape() as tape:
dgms, cards = [], [0]
for i in batch:
filtration = tf.math.reduce_sum(tf.math.multiply(alphas, train_3ds[i]), axis=1)
sl = gdtf.LowerStarSimplexTreeLayer(simplextree=train_sts[i], homology_dimensions=[0])
dgm = sl.call(filtration)[0][0]
dgms.append(dgm)
cards.append(dgm.shape[0])
ccards = np.cumsum(cards)
dists = deep_swd(dgms, ccards, thetas)
loss = 0.
classes = np.unique(batch_labs)
for l in classes:
lidxs = np.argwhere(np.array(batch_labs) == l).ravel()
idxs1 = list(itertools.product(lidxs, lidxs))
idxs2 = list(itertools.product(lidxs, range(len(batch))))
cost1 = tf.math.reduce_sum(tf.gather_nd(dists, idxs1))
cost2 = tf.math.reduce_sum(tf.gather_nd(dists, idxs2))
if cost2 > 0:
loss += cost1 / cost2
else:
loss += cost1
gradients = tape.gradient(loss, [alphas])
Lgrads.append(weight*gradients[0])
final_grad = gradients
final_grad[0] = tf.math.add_n(Lgrads)
optimizer.apply_gradients(zip(final_grad, [alphas]))
losses.append(loss.numpy())
0/10 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
Visualize the losses. Is it decreasing? What are the final alpha values?
plt.figure()
plt.plot(losses)
plt.show()
final_alphas = alphas.numpy()
print(final_alphas)
[ 0.38129476 -0.00895035 1.4207355 ]
We can now use these values to train a model again with this new filtration, and check whether the accuracy is better now!
dgms, labels = [], []
for label in all_categories[:3]:
shapes = os.listdir(dataset_path + label + '/')
for file in shapes:
if file[-4:] == '.off':
vertices, faces = off2numpy(dataset_path + label + '/' + file)
st = get_simplex_tree_from_faces(faces)
filtration = np.multiply(final_alphas[None,:], vertices).sum(axis=1)
for v in range(len(vertices)):
st.assign_filtration([v], filtration[v])
st.make_filtration_non_decreasing()
st.persistence()
dgms.append(st.persistence_intervals_in_dimension(0))
labels.append(label)
test_size = 0.2
perm = np.random.permutation(len(labels))
limit = int(test_size * len(labels))
test_sub, train_sub = perm[:limit], perm[limit:]
train_labs = np.array(labels)[train_sub]
test_labs = np.array(labels)[test_sub]
train_dgms = [dgms[i] for i in train_sub]
test_dgms = [dgms[i] for i in test_sub]
model = skm.GridSearchCV(pipe, param, cv=3)
model = model.fit(train_dgms, train_labs)
print(model.best_params_)
{'Estimator': SVC(gamma='auto', kernel='precomputed'), 'Scaler__use': False, 'Separator__use': True, 'TDA': SlicedWassersteinKernel(num_directions=20), 'TDA__bandwidth': 1.0, 'TDA__num_directions': 20}
print("Train accuracy = " + str(model.score(train_dgms, train_labs)))
print("Test accuracy = " + str(model.score(test_dgms, test_labs)))
Train accuracy = 0.9583333333333334 Test accuracy = 0.9166666666666666
Yay! That's definitely better!
If you managed to go that far, congrats, you are basically a TDA expert now ;-) Do not hesitate to reuse these pieces of code for your own research, and let us know if you have any comment/question/suggestion!