Table of contents for Large scale kernel machines / edited by Leon Bottou ... [et al.].

Bibliographic record and links to related information available from the Library of Congress catalog.

Note: Contents data are machine generated based on pre-publication provided by the publisher. Contents may have variations from the printed book or be incomplete or contain other coding.


Counter
Contents
Preface v
1 Support Vector Machine Solvers 1
L´eon Bottou, Chih-Jen Lin
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Early SVM Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 The Decomposition Method . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 A Case Study: LIBSVM . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Training a Support Vector Machine in the Primal 29
Olivier Chapelle
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Links between primal and dual optimization . . . . . . . . . . . . . . 30
2.3 Primal objective function . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Newton optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Advantages of primal optimization . . . . . . . . . . . . . . . . . . . 44
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Fast Kernel Learning with Sparse Inverted Index 51
Patrick Haffner, Stephan Kanthak
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Sequential Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Sparse Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . 55
3.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Speeding up SVM training . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Large-Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Large Scale Learning with String Kernels 73
ii CONTENTS
S¿oren Sonnenburg, Gunnar R¿atsch, Konrad Rieck
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Sparse Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Speeding up SVM Training and Testing . . . . . . . . . . . . . . . . 83
4.5 Benchmark Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Large Scale Parallel SVM Implementation 105
Igor Durdanovic, Eric Cosatto, Hans-Peter Graf
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Accelerating SVMs: Previous approaches . . . . . . . . . . . . . . . . 107
5.3 The Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4 Spread-Kernel Optimization (full data) (SKOFD) . . . . . . . . . . . 118
5.5 Spread-Kernel Optimization (split data) (SKOSD) . . . . . . . . . . 121
5.6 Networking Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7 Theoretical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6 A Distributed Sequential Solver for Large Scale SVMs 141
Elad Yom-Tov
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 A distributed solution of the SVM problem using a sequential solver 146
6.3 Metrics for comparing SVMs . . . . . . . . . . . . . . . . . . . . . . 147
6.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 A model of the computational time . . . . . . . . . . . . . . . . . . . 153
6.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7 Newton Methods for Fast Semi-supervised Linear SVMs 157
Vikas Sindhwani, S. Sathiya Keerthi
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Modified Finite Newton Linear l2-SVM . . . . . . . . . . . . . . . . 159
7.3 Fast Multi-switch Transductive SVMs . . . . . . . . . . . . . . . . . 163
7.4 Semi-supervised SVMs based on Deterministic Annealing . . . . . . 166
7.5 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8 The Improved Fast Gauss Transform with Applications to
Machine Learning 177
Vikas Chandrakant Raykar, Ramani Duraiswami
8.1 Computational curse of non-parametric methods . . . . . . . . . . . 177
8.2 Bottleneck computational primitive¿Weighted superposition of kernels178
8.3 Structured matrices and ?-exact approximation . . . . . . . . . . . . 179
CONTENTS iii
8.4 Motivating example¿polynomial kernel . . . . . . . . . . . . . . . . . 180
8.5 Sum of Gaussian kernels¿the discrete Gauss transform . . . . . . . . 181
8.6 Bringing computational tractability to the discrete Gauss transform 182
8.7 Multi-index notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.8 The improved fast Gauss transform . . . . . . . . . . . . . . . . . . . 187
8.9 IFGT vs FGT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.10 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.11 Fast multivariate kernel density estimation . . . . . . . . . . . . . . 198
8.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9 Approximation Methods for Gaussian Process Regression 205
Joaquin Qui¿nonero-Candela, Carl Edward Rasmussen,
Christopher K. I. Williams
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.2 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . 206
9.3 Sparse Approximations Based on Inducing Variables . . . . . . . . . 208
9.4 Fast Matrix Vector Multiplication (MVM) Approximations . . . . . 219
9.5 Selecting the Inducing Variables . . . . . . . . . . . . . . . . . . . . 220
9.6 Approximate Evidence and Hyperparameter Learning . . . . . . . . 222
9.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10 Brisk Kernel ICA 227
Stefanie Jegelka, Arthur Gretton
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.2 Independent component analysis . . . . . . . . . . . . . . . . . . . . 229
10.3 Independence measures based on RKHS covariance operators . . . . 230
10.4 Gradient descent on the orthogonal group . . . . . . . . . . . . . . . 234
10.5 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . 250
11 Building SVMs with Reduced Classifier Complexity 253
S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
11.2 The Basic optimization . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.3 Selection of new basis element . . . . . . . . . . . . . . . . . . . . . . 259
11.4 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.5 Comparison with Kernel Matching Pursuit . . . . . . . . . . . . . . 266
11.6 Additional tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.7 Comparison with standard SVM training . . . . . . . . . . . . . . . 272
11.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12 Trading Convexity for Scalability 277
Ronan Collobert, Fabian Sinz, Jason Weston, L´eon Bottou
iv CONTENTS
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
12.2 The Concave-Convex Procedure . . . . . . . . . . . . . . . . . . . . . 278
12.3 Non-Convex SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
12.4 Experiments with Non-Convex SVMs . . . . . . . . . . . . . . . . . . 283
12.5 Non-Convex Transductive SVMs . . . . . . . . . . . . . . . . . . . . 287
12.6 Experiments with TSVMs . . . . . . . . . . . . . . . . . . . . . . . . 294
12.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13 Training Invariant SVMs using Selective Sampling 303
Ga¿elle Loosli, L´eon Bottou, St´ephane Canu
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.2 Online algorithm with selective sampling . . . . . . . . . . . . . . . . 306
13.3 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14 Scaling Learning Algorithms towards AI 323
Yoshua Bengio, Yann LeCun
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
14.2 Learning Models Towards AI . . . . . . . . . . . . . . . . . . . . . . 326
14.3 Learning Architectures, Shallow and Deep . . . . . . . . . . . . . . . 331
14.4 Fundamental Limitation of Local Learning . . . . . . . . . . . . . . . 339
14.5 Deep Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
14.6 Experiments with Visual Pattern Recognition . . . . . . . . . . . . . 350
14.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
References 363
Contributors 389
Index 393

Library of Congress Subject Headings for this publication:

Data structures (Computer science).
Machine learning.