Practical and Fast Momentum-Based Power Methods

Year
2021
Type(s)
Author(s)
Tahseen Rabbani, Apollo Jain, Arjun Rajkumar, Furong Huang
Source
2nd Annual Conference on Mathematical and Scientific Machine Learning, Proceedings of Machine Learning Research vol 145:1-36, 2021.
Url
https://msml21.github.io/papers/id65.pdf
BibTeX
BibTeX

The power method is a classical algorithm with broad applications in machine learning tasks, including streaming PCA, spectral clustering, and low-rank matrix approximation. The distilled purpose of the vanilla power method is to determine the largest eigenvalue (in absolute modulus) and its eigenvector of a matrix. A momentum-based scheme can be used to accelerate the power method, but achieving an optimal convergence rate with existing algorithms critically relies on additional spectral information that is unavailable at run-time, and sub-optimal initializations can result in divergence. In this paper, we provide a pair of novel momentum-based power methods, which we call the delayed momentum power method (DMPower) and a streaming variant, the delayed momentum streaming method (DMStream). Our methods leverage inexact deflation and are capable of achieving near-optimal convergence with far less restrictive hyperparameter requirements. We provide convergence analyses for both algorithms through the lens of perturbation theory. Further, we experimentally demonstrate that DMPower routinely outperforms the vanilla power method and that both algorithms match the convergence speed of an oracle running existing accelerated methods with perfect spectral knowledge.

Performance of DMPM on Spectral Clustering. We depict the performance of DMPM ($\rho=\sqrt[3]{\epsilon}$) on dividing a collection of data points into their natural, geometrically-partitioned clusters. The first series (a-e) is the concentric circles dataset while the second series (f-j) is the half-moons dataset. In both series, the first image depicts the unlabeled arrangement, the second depicts a naive application of k-means without spectral clustering, while the last three images depict the performance of k-means assisted by spectral clustering over progressively tighter $\epsilon$ error thresholds (thus, requiring more accurate affinity matrix eigenvector approximations). As $\epsilon$ grows smaller, the data separation improves, eventually achieving perfect classification by $\epsilon=10^{-8}$.