ICML UDL POSTER
Deeper Connections between Neural Networks and Gaussian Processes
Speed-up Active Learning
Evgenii Tsymbalov, Sergei Makarychev, Alexander Shapeev, Maxim Panov
Skoltech, Russia
Online poster
ICML UDL Workshop
June 14, 2019
Navigation
Active learning
-
In many applications labeled (annotated) data is very limited.
-
Unlabelled data is usually widely available.
-
Labeling (annotation) is often expensive.
-
Thus, a clever choice of points to annotate is needed.
-
Active learning: use machine learning model to select the points to annotate.
-
Applications: industrial design, chemoinformatics, material design, human annotation (NLP, images), . . .
-
We focus on the large-dimensional regression problems
Problem statement
-
natural for Gaussian Processes;
-
easy to compute for Random Forest;
-
...things are harder for Neural Networks.
UE for NNs
Types of uncertainty estimates for NNs:
-
ensembling (accurate but costly);
-
Bayesian NN (natural for might be complicated to achieve state-of-the-art);
-
Dropout-based (stochastic output: MC-Dropout).
[source]
Yet some problems with MC-Dropout occur:
-
hard to sample more than one point efficiently
-
overconfident prediction for out-of-sample points
[source]
From NN to GP
-
Connections between neural networks and Gaussian processes recently gain significant attention [Matthews et al., 2018], [Lee at al., 2017] (both ICLR'18).
-
They show that NNs with purely random weights can be approximated by GP in an infinite network width limit.
Here we consider simple FC NN with dropout between the hidden layers. We focus on the output values at different input points for different realizations of dropout mask.
-
When two points x1 and x2 are pretty close to each other in the feature space, we can see the correlation between the MC-Dropout realizations for the trained NN. The distributions are also Gaussian-like.
-
In case when two points x1 and x3 are far each other in the feature space, the correlation is lost yet distributions are still pretty Gaussian.
Algorithm
Based on an observable near-Gaussian behaviour, we propose the following
Schematic representation
Benefits of the proposed approach:
-
GPs allow to sample points sequentially by recomputing uncertainty estimates.
-
GP uncertainty estimate values are high for out-of-sample points.
Experiments
Airline delays
-
Comparison on a airline delays dataset [Hensman et al., 2017] w with the Bayes-by-Backprop and Noise Contrastive Prior from [Hafner et al., 2018].
-
Small 50 x 2 NN, NCP-based loss function
-
50K train set, 100K test set (shifted in time).
-
Here and below, MCDUE refers to Monte-Carlo Dropout Uncertainty Estimate, NNGP - to proposed approach, NCP suffix means the NCP-based loss function. Other labels like in [Hafner et al., 2018].
UCI datasets
We compare the proposed approach with the MCDUE and random sampling on a variety of UCI regression datasets:
Comparison is made in means of Dolan-More curves.
6 x 25 = 125 experiments result in a following curve:
SchNet
-
SchNet is a state-of-the-art deep learning architecture for the molecules and materials.
-
For UE, we used a dropout placed between the FC layers.
-
For the energy prediction in QM9 dataset, experiment showed a 25% RMSE error decrease.
-
Another look: to reach the error of 2 kcal/mol, we need twice less additional data, which is very nice for a costly quantum-mechanical calculations!
Summary
-
A novel method for the UE in deep neural networks;
-
State-of-the-art results in context of active learning for different problems and architectures;
-
Will be presented on IJCAI'2019.
Growth areas: -
applicability of modern methods for GP speed-up to improve the scalability and accuracy;
-
CNN applications for images, RNN application for text;
-
Calibrated UE;
-
Cost-sensitive AL.
Acknowledgements
E.T. and A.S. were supported by by the Skoltech NGP Program No. 2016-7/NGP (a Skoltech-MIT joint project).