CATS: Computations And Topological Statistics<br> INRIA Associated Team<br> DataShape

CATS: Computations And Topological Statistics

INRIA Associated Team between DataShape and Carnegie Mellon University (CMU)

Team members

On the INRIA side:

Jean-Daniel Boissonnat, Directeur de recherche, DataShape group.
Frédéric Chazal (PI), Directeur de recherche, DataShape group
Steve Oudot, Chargé de recherche, DataShape group.
Bertrand Michel, MdC en délégation, DataShape group - Sept. 2013 to Sept. 2015.
Clément Levrard, post-doc, DataShape group - Sept. 2014 to Sept. 2015.
Eddie Aamari, Ph.D. student, DataShape group - Sept. 2014 to Sept. 2017.
Thomas Bonnis, Ph.D. student, DataShape group - Dec. 2013 to Dec. 2016.
Claire Brecheteau, PhD student, DataShape group - Sept. 2015 to Sept. 2018.
Mathieu Carrière, Ph.D. student, DataShape group - Nov. 2014 to Nov. 2017.

On the CMU side:

Larry Wasserman (co-PI), Professor, Department of Statistics and Machine Learning Department.
Alessandro Rinaldo, Associate Professor, Department of Statistics.
Jessi Cisewski, Visiting Assistant Professor, Department of Statistics.
Yen-Chi Chen, Ph.D student, Dept. of Statistics, 2013-Present.
Jisu Kim, Ph.D. student, Dept. of Statistics, Aug 2013-Present.
Fabrizio Lecci, Ph.D. student, Dept. of Statistics, Aug 2011-July 2015.
Jaehyeok Shi, Ph.D student, Dept. of Statistics, 2016-Present.

Motivations and cientific objectives

With the recent explosion in the amount and variety of available data, identifying, extracting and exploiting their underlying geometric structures has become a problem of fundamental im- portance for data analysis and statistical learning. On the theoretical side, the topological and geometric properties of data are of great help to analyze them and can be used for further learn- ing or classification tasks. On the algorithmic and applied side, understanding the underlying geometric structure of data can help face the so-called curse of dimensionality phenomenon, and down the road lead to drastic improvements in the complexity of algorithms. There exist various statistical and machine learning methods that intend to uncover the geometric structure of data, such as clustering, manifold learning and non linear dimensionality reduction, principal curves, sets estimation, to name a few. Most of them assume the underlying structure to have a very simple geometry, diffeomorphic to a disc or isometric to an open set of a Euclidean space. Furthermore the only topological information they seek for is connectivity. On another hand, with the emergence of distance based approaches and persistent topology, geometric inference and computational topology have recently known an important development. New mathematically well-founded theories gave birth to the field of Topological Data Analysis (TDA) which is knowing an increasing interest both in academy and industry. So far the obtained results rely mostly on deterministic assumptions which are not satisfactory from a statistical viewpoint. As a consequence the corresponding methods remain exploratory and do not benefit from a sound probabilistic framework. Despite a few notable attempts to overcome this issue, the development of a statistical approach to Topological Data Analysis is still in its infancy. As data are becoming larger and larger, the development of rigorous statistical approaches for TDA but also of algorithmic tools implementing them are challenges of fundamental importance that we intend to address.

Our conviction, at the root of this project, is that there is a real need to combine statistical and topological/geometric approaches in a common framework, in order to face the challenges raised by the inference and the study of topological and geometric properties of the wide variety of larger and larger available data. We are also convinced that these challenges need to be addressed from both the mathematical side and the algorithmic/application side. Our main objectives for the three years are two-fold. First on the theoretical side, we intend to set up and develop the mathematical and algorithmic foundations of Statistical Topological Data Analysis (Statistical TDA); second, we intend to develop software packages implementing state-of-the-art methods and making our tools accessible and easy to use to a broad audience of data scientists.

1. Mathematical foundations of Statistical TDA. Our objective is to show that, thanks to stability properties coming from geometric inference and persistent homology theory, topological and geometric properties of data, such as e.g. persistence diagrams, can be efficiently inferred in various statistical settings. We intend to propose various estimators coming with statistical guarantees (convergence results, convergence rates,...) to which classical statistical tools and constructions can be applied (bootstrap, confidence bands, tests,...). Our ultimate objective is to provide a well-founded and effective statistical toolbox for the understanding of the topology and geometry of data.

2. Efficient easy-to-use software tools and applications. The DataShape group recently released Gudhi , a C++ library -- supported by the ERC project of Jean-Daniel Boissonnat -- dedicated to state-of-the-art computational topology and geometry in high dimensional spaces. On the other hand, building on the first results of our starting collaboration, the CMU group recently released a R package, TDA for doing statistical analysis of persistent homology. We will combine our software developments efforts to design software tools for Statistical TDA that include robust and efficient computational topology algorithms for statistical TDA through a simple interface that can be used by data scientists without strong expertize in topology and geometry. We also intend to use this tools to validate our methods on real data.

Scientific results and publications 2016

Approximation and geometry of the reach
[1. E. Aamari, J. Kim, F. Chazal, B. Michel, A. Rinaldo, L. Wasserman]
In preparation (Expected on Dec. 2016).

The reach of a submanifold M in Rd is the smallest distance between a point of the medial axis of M – i.e. the sets of points of Rd that have at least 2 closest points on Rd – and M. It plays a fundamental role in many geometric inference problems but, as it depends both on the curvature of M and of the way M is embedded in Rd, it is known to be difficult to estimate. In this direction, we have designed a new algorithm to estimate the reach of a submanifold from a point cloud randomly sampled on it. We have studied the convergence rates of our algorithm (as a function of the sample size) under a rather general statistical model, and we have obtained almost minimax rates.

The distance-to-a-measure signature for a bootstrap test of isomorphism between metric measure spaces
[1. C. Brecheteau]
In preparation (Expected on Dec. 2016).

Building on the results for Distance-To-Measure (DTM) functions obtained during the first year, Claire Brécheteau (PhD student, Inria) proposed a new statistical test to compare, point clouds randomly sampled from metric spaces endowed with a probability measure. The main idea of this test is to compare probability distributions on the real line obtained as the push-forward of the empirical measures associated to the considered samples by the DTM function. As it relies on the DTM, this test takes into account the geometric structure of the data. Moreover, it allows to compare point clouds that are not necessarily embedded in an Euclidean space and to compare them up to isomorphism.

On-going projects and objectives 2016-17

Several projects are currently running and are expected to lead to new results by the end of 2015 or in 2016:

Statistical approaches for multi-dimensional persistence.
Persistent homology for density estimates and distance-to-measure.
Topological denoising.
Interfacing the Gudhi library to the R-package TDA (continuing).

Visits, exchanges and collaborations 2016

The EA activities and collaborations are organized around mutual visits between members of the two groups. Long-term visits (2-3 months) of Ph.D. students are of fundamental importance to maintain the dynamism of the group. Regular Skype meetings (every 1 or 2 weeks) are also organized to maintain close collaborations. The following visits have been organized in 2015:

Jisu Kim: visit to INRIA - 1 week (May) - Reach estimation and TDA package.
Eddie Aamari: visit to CMU - 1 month (July) - reach estimation (noise and unknown tangent space).
Jisu Kim: visit to INRIA - 1 week (December) - Reach estimation and TDA package.
Jaehyeok Shi : visit to INRIA - 2 weeks (October). Confidence bands for persistence landscape of kernel density estimators.

Scientific results and publications 2015

Subsampling Methods for Persistent Homology
[F. Chazal, B.T. Fasy, F. Lecci, B. Michel, A. Rinaldo, L. Wasserman]
In proc. International Conference on Machine Learning (ICML 2015).

Persistent homology is a multiscale method for analyzing the shape of sets and functions from point cloud data arising from an unknown distribution supported on those sets. When the size of the sample is large, direct computation of the persistent homology is prohibitive due to the combinatorial nature of the existing algorithms. We propose to compute the persistent homology of several subsamples of the data and then combine the resulting estimates. We study the risk of two estimators and we prove that the subsampling approach carries stable topological information while achieving a great reduction in computational complexity.

Robust Topological Inference: Distance To a Measure and Kernel Distance
[F. Chazal, B.T. Fasy, F. Lecci, B. Michel, A. Rinaldo, L. Wasserman]
Submitted.

Let P be a distribution with support S. The salient features of S can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point x to S). Given a sample from P we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by Chazal et al. (2011), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers. Chazal et al. (2014) derived concentration bounds for DTM. Building on these results, we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.

Visits, exchanges and collaborations 2015

Jisu Kim: visit to INRIA - 2 months (May 11 to July 17) - Reach estimation and TDA package.
Eddie Aamari: visit to CMU - 3 months (Sept. to December) - tangent space estimation, reach estimation (noise and unknown tangent space).
Frédéric Chazal: visit to CMU - 9 days (Sept. 10 to Sept. 18) - Meeting about on-going works and 2016 objectives.
Bertrand Michel: visit to CMU - 10 days (Sept. 9 to Sept. 18) - Meeting about on-going works and 2016 objectives.
Jessi Cisewski: visit to INRIA - 1 week (Oct.) - TDA applied to astrophysics.
Yen-Chi Chen: visit to INRIA - 10 days (Dec. expected).