Statistical properties of sketching algorithms

D. C. Ahfock; W. J. Astle; S. Richardson

doi:10.1093/biomet/asaa062

Biometrika. Author manuscript; available in PMC 2022 Feb 4.

Published in final edited form as:

Biometrika. 2021 Jun; 108(2): 283–297.

Published online 2020 Jul 30. doi: 10.1093/biomet/asaa062

PMCID: PMC7612324

EMSID: EMS140909

PMID: 35125502

Statistical properties of sketching algorithms

D. C. Ahfock, W. J. Astle, and S. Richardson

Author information Copyright and License information PMC Disclaimer

Associated Data

Supplementary Materials: Supplementary Material.
EMS140909-supplement-Supplementary_Material.zip (1.8M)
GUID: EDD5EB46-50EF-4B2E-A9A7-217FB7E2F08D

Summary

Sketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson–Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.

Keywords: Computational efficiency, Random projection, Randomized numerical linear algebra, Sketching

1. Introduction

Sketching is a general probabilistic data compression technique involving random projections (Cormode, 2011). Even routine calculations can be prohibitively computationally expensive if performed on massive datasets. Computational time can be reduced to an acceptable level by allowing some approximation error in the results. Sketching algorithms simplify the computational task by generating a compressed version of the original dataset that then serves as a surrogate for calculations. The compressed dataset is referred to as a sketch, because it acts as a compact representation of the full dataset. Sketching algorithms use a randomized compression stage, which makes them interesting from a statistical viewpoint. Sketching algorithms for linear regression have attracted significant attention in the numerical linear algebra and theoretical computer science communities (Mahoney, 2011; Woodruff, 2014).

To describe sketched regression in more detail, we first assume that the data consist of a length-n response vector y and an n × p matrix of covariates, X, which is of full rank. It is assumed throughout that n > p. The objective is to find the least squares coefficients. Given sufficient computational resources, these can be computed exactly as

β_{F} = {(X^{T} X)}^{- 1} X^{T} y,

where the subscript F indicates the connection to the full dataset. Only two quantities are needed to determine β _F: the Gram matrix X ^T X and the marginal associations X ^T y. Calculation of X ^T X requires O(np ²) operations, while computation of X ^T y needs only O (np) calculations. There are two broad methods for sketched regression, namely complete sketching and partial sketching. Complete sketching is based on approximating both X ^T X and X ^T y, whereas partial sketching approximates only the Gram matrix. Drineas et al. (2006) established many important results for complete sketching, and Dhillon et al. (2013) and Pilanci & Wainwright (2016) derived foundational results for partial sketching.

Sketching algorithms use random linear mappings to reduce the size of the dataset from n to k observations. The random linear mapping can be represented as a k × n sketching matrix S. Complete sketching generates a length-k sketched response vector ỹ and a k × p matrix of sketched predictors $\tilde{X}$ . The sketched data are computed through the linear mappings ỹ = Sy and $\tilde{X} = S X$ . Assuming that $\tilde{X}$ is of rank p, the complete sketching estimator β _S is defined to be the set of least squares coefficients using the sketched responses and predictors,

β_{S} = {({\tilde{X}}^{T} \tilde{X})}^{- 1} {\tilde{X}}^{T} \tilde{y} .

(1)

The partial sketching estimator, β _P, is defined as

β_{P} = {({\tilde{X}}^{T} \tilde{X})}^{- 1} X^{T} y .

(2)

The key difference between (1) and (2) is that the partial sketching estimator β _P is constructed using the exact marginal associations X ^T y. Given the sketched data, computation of β _S or β _P requires only O(kp ²) operations, compared with the O(np ²) operations required for β _F.

There is a large literature concerned with designing appropriate distributions for the random sketching matrix S. Our focus is on data-oblivious random projections, such that the distribution of the sketching matrix is not a function of the source data (y, X). An example is the Gaussian sketch, where each element is independently distributed as an N(0, 1/k) variate. We also consider the Hadamard sketch and the Clarkson–Woodruff sketch, random projections that exploit structure and sparsity for computational efficiency.A motivation for this work is that there are no clear ties between data-oblivious random projections and classical subsampling techniques.

Most existing results on the accuracy of sketching are universal worst-case bounds (Woodruff, 2014; Mahoney & Drineas, 2016). This is typical for randomized algorithms; however, a more detailed error analysis can provide important insights (Halko et al., 2011). We investigate the statistical properties of β _P and β _S when data-oblivious sketches are used. An important finding is that the signal-to-noise ratio in the source dataset strongly influences the relative efficiency of complete to partial sketching. The statistical analysis also allows the construction of exact confidence intervals for the Gaussian sketch and asymptotic confidence intervals for other random projections, paving the way for their wider use in the statistical community.

At its core, sketched regression is a randomized algorithm for approximate computation of β _F. Repeated application of the sketching algorithm to the same dataset will produce different results. The first stage in our analysis is to establish the distributional properties of the sketching estimators with the source dataset held fixed. An important result is a conditional central limit theorem for the sketched dataset that connects the Hadamard and Clarkson–Woodruff projections to the Gaussian sketch. The conditional analysis of the randomized algorithms is then extended to cover situations where sketching is used for approximate statistical inference. Given a statistical model for the response y = X β ₀ + ϵ, with population parameter β ₀ and error term ϵ, distributional properties of β _P and β _S can be determined by integrating over the conditional distributions of the sketching estimators that take y and X as fixed.

2. Background and related work

2.1. Preliminaries

We define a number of quantities related to the full dataset before moving on. The total, residual and model sum of squares are given by TSS_F = y ^T y, ${RSS}_{F} = ‖ y - X β_{F} ‖_{2}^{2}$ and ${MSS}_{F} = ‖ X β_{F} ‖_{2}^{2}$ , respectively, with TSS_F = MSS_F + RSS_F. The proportion of variance explained by the model is $R_{F}^{2} = {MSS}_{F} / {TSS}_{F}$ . These values will be important in characterizing the behaviour of β _S and β _P. The source data are generically represented by the n × d matrix A = (y, X).

There are two general categories of distributions for the random matrix S: data-aware random projections and data-oblivious random projections. A data-aware random projection uses information in the source data (y, X) to generate S. In contrast, a data-oblivious random projection can be sampled without knowledge of y or X. Data-aware random projections are closely connected to finite population sampling methods in the statistics literature (Ma & Sun, 2015). Our main focus is on data-oblivious random projections, as their mechanism for data compression is not obviously tied to subsampling. Data-oblivious random projections generate a dataset of k pseudo-observations using the source dataset as a component in the generative process.

2.2. Data-oblivious sketches

The Gaussian sketch was one of the first projections proposed for sketched regression (Sarlos, 2006). Recall that a Gaussian sketch is formed by independently sampling each element of S from an N(0, 1/k) distribution.A drawback of the Gaussian sketch is that computation of the sketched data is quite demanding, taking O(ndk) operations. Therefore, work has been done on designing more computationally efficient random projections. Woodruff (2014) gives an excellent survey of work in this area.

The Hadamard sketch is a structured random matrix (Ailon & Chazelle, 2009). The sketching matrix is formed as S = Φ HD / √k, where Φ is a k × n matrix and H and D are both n × n matrices. The fixed matrix H is a Hadamard matrix of order n. A Hadamard matrix is a square matrix with elements that are either +1 or −1 and orthogonal rows. Although Hadamard matrices do not exist for all integers n, the source dataset can be padded with zeros so that a conformable Hadamard matrix is available. The matrix D is a diagonal matrix whose n diagonal entries are independent Rademacher random variables. The random matrix Φ subsamples k rows of H with replacement. The structure of the Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to O(nd log k) operations.

The Clarkson–Woodruff sketch is a sparse random matrix (Clarkson & Woodruff, 2013). The projection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k × n matrix and D is a random n × n matrix. The matrix Γ is initialized as a matrix of zeros. Independently in each column, one element is selected and set to +1. The matrix D is a diagonal matrix whose n diagonal entries are independent Rademacher random variables. The sparsity of the Clarkson–Woodruff sketch speeds up matrix multiplication, decreasing the complexity of generating the sketched dataset to O(nd).

3. Gaussian sketching

3.1. Complete sketching

The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite-sample results regarding the performance of the sketching estimators. In this section we derive the distribution of β _S in the case where a Gaussian sketch is used. As mentioned previously, all results treat y and X as fixed. The variability in β _S is solely due to the use of the random sketching matrix S. Let $({\tilde{y}}_{j}, {\tilde{x}}_{j}^{T}) (j = 1, \dots, k)$ refer to the jth row of the sketched data matrix $\tilde{A} = (\tilde{y}, \tilde{X})$ . Similarly, let $s_{j}^{T}$ denote the jth row of the sketching matrix S. The sketched dataset consists of k random units $({\tilde{y}}_{j}, {\tilde{x}}_{j}^{T}) (j = 1, \dots, k)$ . The jth sketched response is given by ${\tilde{y}}_{j} = s_{j}^{T} y$ and the jth sketched predictor is calculated as ${\tilde{x}}_{j}^{T} = s_{j}^{T} X (j = 1, \dots, k)$ . The k sketched instances are independently distributed, because rows of the sketching matrix are independent.

It can be shown that the joint distribution of the sketched data, $p (\tilde{y} | \tilde{X}, y, X) p (\tilde{X} | y, X)$ , has the structure of a hierarchical Gaussian linear model. The sketched dataset has a multivariate normal distribution, conditional on the source dataset. This is because the sketched dataset can be expressed as a linear combination of Gaussian random variables. Specifically, row j in the sketched dataset is $({\tilde{y}}_{j}, {\tilde{x}}_{j}^{T}) = s_{j}^{T} A$ . Given the source dataset A = (y,X), A^T _sj is a linear combination of independent Gaussians as Sj ~ N(0, I_d /k), and so $({\tilde{y}}_{j}, {\tilde{x}}_{j}^{T})$ must be jointly normally distributed, conditional on the source data A = (y,X). It is easily shownthat the conditional joint distribution of the sketched responses and predictors is then

(\begin{array}{l} {\tilde{y}}_{j} \\ {\tilde{x}}_{j} \end{array}) | Y, X ~ N {(\begin{array}{l} 0 \\ 0 \end{array}), \frac{1}{K} (\begin{array}{l} \begin{matrix} y^{T} y \\ X^{T} y \end{matrix} & \begin{matrix} y^{T} X \\ X^{T} X \end{matrix} \end{array})} (j = 1, \dots, k) .

From standard results on the multivariate normal distribution, it follows that the conditional distribution of ỹ_j given ${\tilde{x}}_{j}$ is also normal with conditional mean $E_{S} ({\tilde{y}}_{j} | {\tilde{x}}_{j}, y, X) = {\tilde{x}}_{j}^{T} β_{F}$ . The subscript S is used with the expectation operator to emphasize that the only random quantity is the sketching matrix. The conditional distribution of ỹ_j given the sketched predictors ${\tilde{x}}_{j}$ and the source dataset (y, X) is

{\tilde{y}}_{j} | {\tilde{x}}_{j}, y, X \sim N ({\tilde{x}}_{j}^{T} β_{F}, \frac{{RSS}_{F}}{k}) (j = 1, \dots, k) .

This is the exact form of a standard Gaussian linear model, where the regression coefficient is β _F and the conditional variance is RSS_F/k. The distribution $p (\tilde{X} | y, X)$ is easily obtained as the marginal distribution of ${\tilde{x}}_{j}$ is also multivariate normal,

{\tilde{x}}_{j} | y, X \sim N (0, X^{T} X / k) (j = 1, \dots, k) .

A Gaussian sketch effectively simulates a series of observations from a Gaussian linear model parameterized in terms of β _F and RSS_F, where the design matrix has a matrix normal distribution. The distribution of β _S conditional on the sketched predictors $\tilde{X}$ follows immediately from standard results on linear models (Searle, 1997, Ch. 3). To obtain the marginal distribution of β _S,it is necessary to integrate over the random sketched design matrix $\tilde{X}$ . Using properties of the normal distribution (Eaton, 2007), it is possible to show that $({\tilde{X}}^{T} \tilde{X}) | y, X \sim Wis(k, X^{T} X / k)$ . Hence,

({\tilde{X}}^{T} \tilde{X})^{- 1} ∣ y, X \sim IW{k, k (X^{T} X)^{- 1}},

where iw denotes the inverse Wishart distribution. The marginal distribution of β _S can then be described using the normal inverse Wishart distribution (Gelman et al., 2014, p. 73). The following theorem characterizes the distribution of β _S under the Gaussian sketch.

Theorem 1

Suppose β _S is computed using a Gaussian sketch and that k ⩾ p. Then:

(i)
the conditional distribution of β _S is
$β_{S} | \tilde{X}, y, X \sim N {β_{F}, \frac{{RSS}_{F}}{k} {({\tilde{X}}^{T} \tilde{X})}^{- 1}};$
(ii)
the marginal distribution of β _S is
$β_{S} | y, X \sim Student {β_{F}, \frac{{RSS}_{F}}{k - p + 1} {(X^{T} X)}^{- 1}, k - p + 1} .$

For a proof see the Supplementary Material.

An immediate consequence of (i) is the ability to generate exact confidence intervals for the elements of β _S, anapproach that does not seem to have been considered in the existing literature. The variance of β _S,

var (β_{S} | y, X) = \frac{{RSS}_{F}}{(k - p - 1)} (X^{T} X)^{- 1},

(3)

is not dependent on the compression ratio k/n. Although RSS_F can be expected to grow linearly with n, this will generally be counterbalanced by (X ^T X)⁻¹ decreasing linearly with n.

3.2. Partial sketching

Partial sketching was first proposed by Dhillon et al. (2013) using uniform subsampling, and was later studied for general sketches by Pilanci & Wainwright (2016). Existing results on partial sketching highlight that the model sum of squares influences the approximation error of the partial sketching estimator β _P. It is easy to see that the variance of the partial sketching estimator will not be a function of the residual sum of squares. From the normal equations it follows that X ^T y = X ^T X β _F. Using this property, we see that conditional on y and X, the variance of the random linear combination β_P = (X ^T S ^T SX)⁻¹ X ^T y = (X ^T S ^T SX)⁻¹ X ^T X β _F will be a function of the covariates X and the fitted values X β _F. The residual vector has no influence on the variance of the partial sketching estimator, and as such the variance of β _P will not be related to the residual sum of squares. This suggests that when the noise level is high, partial sketching may become preferable to complete sketching (Dhillon et al., 2013; Becker et al., 2015).

The hierarchical model for complete sketching provides an intuitive statistical perspective on the mechanics of the algorithm. Partial sketching seems to lack a similar conceptual device. The least squares coefficients can be represented as the solution to the linear system of equations X ^T Xb = X ^T y. Partial sketching simply returns the solution, b, to the approximate linear system ${\tilde{X}}^{T} \tilde{X} b = X^{T} y$ . Lacking a convenient representation for the estimator, we must proceed in a more pedestrian manner. The mean squared error of the estimator β _P can be determined using only mean and variance information, and this will be the goal for now. The key observation is that $({\tilde{X}}^{T} \tilde{X})^{- 1} | y, X \sim IW{k, k (X^{T} X)^{- 1}}$ . Conditional on y and X, the estimator $β_{P} = ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} y$ is a linear combination of the elements of an inverse Wishart random matrix. However, this is a nonstandard distribution, and it is difficult to express directly the distribution function of β _P. Despite this obstacle, it is straightforward to determine the mean and variance of β _P. From properties of the inverse Wishart distribution, it can be seen that the partial sketching estimator is biased, with mean

E_{S} (β_{P} | y, X) = \frac{k}{(k - p - 1)} β_{F},

where it is assumed that k > p + 3. This motivates an alternative unbiased estimator

β_{P}^{*} = \frac{(k - p - 1)}{k} {({\tilde{X}}^{T} \tilde{X})}^{- 1} X^{T} y = \frac{(k - p - 1)}{k} β_{P} .

Determining the variance of β _P and the unbiased $β_{P}^{*}$ is a more lengthy computation, which is given in the Supplementary Material. The variance of the unbiased estimator $β_{P}^{*}$ is

var (β_{P}^{*} | y, X) = \frac{(k - p - 1)}{(k - p) (k - p - 3)} {{MSS}_{F} {(X^{T} X)}^{- 1} + \frac{k - p + 1}{k - p - 1} β_{F} β_{F}^{T}} .

(4)

By making a connection with method-of-moments estimation it is possible to establish asymptotic normality of both β _P and $β_{P}^{*}$ as k tends to infinity. This motivates the construction of approximate confidence intervals. As the exact variance is unknown, we propose the following estimator of $var (β_{P}^{*} | y, X)$ using the sketched model sum of squares MSS_S:

\frac{(k - p - 1)}{(k - p) (k - p - 3)} {(\frac{k - p - 1}{k}) {MSS}_{S} {({\tilde{X}}^{T} \tilde{X})}^{- 1} + β_{P}^{*} β_{P}^{* T}} .

3.3. Relative efficiency

The relative efficiencies of complete and partial sketching are also of interest. As the plug-in estimator β _P has a greater mean squared error than $β_{P}^{*}$ , it will not be considered in this subsection. The performance of the complete sketching estimator β _S and the unbiased partial sketching estimator $β_{P}^{*}$ will be compared in terms of mean squared error. As both β _S and $β_{P}^{*}$ are unbiased, the mean squared errors can be computed using $var (β_{S} | y, X)$ and $var (β_{P}^{*} | y, X)$ . Comparing (3) and (4), it can be seen that the variance of $β_{P}^{*}$ is dependent on MSS_F whereas the variance of β _S is dependent on RSS_F. This suggests that the signal-to-noise ratio in the source dataset will be an influential factor in determining which estimator is more efficient. In the Supplementary Material it is shown that for k > p + 3 the relative efficiency can be bounded in terms of the signal-to-noise ratio

\frac{R_{F}^{2}}{1 - R_{F}^{2}} ⩽ \frac{E_{S} (‖ β_{P}^{*} - β_{F} ‖_{2}^{2} | y, X)}{E_{S} (| | β_{S} - β_{F} ‖_{2}^{2} y, X)} ⩽ \frac{2 (k - p - 1)}{(k - p - 3)} \frac{R_{F}^{2}}{1 - R_{F}^{2}} .

When $R_{F}^{2}$ is close to 1, complete sketching can be orders of magnitude more efficient than partial sketching; and when $R_{F}^{2}$ is close to 0, partial sketching can be orders of magnitude more efficient than complete sketching.

3.4. Combined estimator

So far we have assumed that an analyst must choose between one of the two methods; but obtaining both $β_{P}^{*}$ and β _S from a single sketch is computationally cheap and may be an attractive strategy. The most demanding operation with the sketched data is calculating $({\tilde{X}}^{T} \tilde{X})^{- 1}$ . Given this quantity, it is economical to compute both β _S and $β_{P}^{*}$ . Becker et al. (2015) mentioned that they were investigating such a strategy, but did not give any details. Our development of a combined estimator is motivated by the fact that, even when using a single sketch $(\tilde{y}, \tilde{X})$ , the two estimators are uncorrelated, i.e., $cov (β_{P}^{*}, β_{S} | y, X) = 0$ . This is established in the Supplementary Material by taking iterated expectations and using the hierarchical model from § 3.1. Asimple strategy is then to take a weighted combination of β _S and $β_{P}^{*}$ . A combined estimator βC can be defined as

β_{C} = ϕ β_{S} + (1 - ϕ) β_{P}^{*}

for some 0 ⩽ φ ⩽ 1. The value of φ that minimizes the mean squared error is $ϕ_{opt} = tr {var (β_{P}^{*}) | y, X} / [tr {var (β_{P}^{*} | y, X)} + tr{var (β_{S} | y, X)}]$ . Use of the weighted estimator is expected to be most beneficial when the signal-to-noise ratio is moderate, i.e., $R_{F}^{2} \approx 0.5$ . When the signal-to-noise ratio is either very high or very low, there is little advantage in using the weighted estimator, as either the complete or the partial estimator will dominate.

3.5. One-step correction

As noted by a referee, the combined estimator is related to another strategy in the sketching literature for improving β _S. Dhillon et al. (2013) and Pilanci & Wainwright (2016) proposed a refinement procedure that uses gradient information from the source dataset. The one-step corrected estimator is defined as

β_{H} = β_{S} + ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} (y - X β_{S}) = {I - {({\tilde{X}}^{T} \tilde{X})}^{- 1} X^{T} X} β_{S} + ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} y .

(5)

Now the least squares solution β _F satisfies X ^T(y – X β _F) = 0, so

β_{F} = β_{F} + ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} (y - X β_{F}) = {I - {({\tilde{X}}^{T} \tilde{X})}^{- 1} X^{T} X} β_{F} + ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} y .

(6)

Subtracting (6) from (5) gives the following expression for the error:

β_{H} - β_{F} = {I - ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} X} (β_{S} - β_{F}) .

(7)

The one-step estimator can be interpreted as a single step of the iterative Hessian sketch proposed by Pilanci & Wainwright (2016), initialized at β_S . Setting $\tilde{H} = ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} X$ it follows from (7) and Theorem 1(i) that

E_{S} (‖ β_{H} - β_{F} ‖_{2}^{2} | y, X) = E_{\tilde{X}} [tr{k^{- 1} {RSS}_{F} {({\tilde{X}}^{T} \tilde{X})}^{- 1} {(I - \tilde{H})}^{T} (I - \tilde{H})}] .

(8)

The key terms in (8) are the random matrices $({\tilde{X}}^{T} \tilde{X})^{- 1}$ and $\tilde{H} = ({\tilde{X}}^{T} \tilde{X})^{- 1} X^{T} X$ . As $({\tilde{X}}^{T} \tilde{X})^{- 1} | y, X \sim IW{k, k (X^{T} X)^{- 1}}$ , it is possible to evaluate the expectation in (8) using the first, second and third moments of the inverse Wishart distribution. The exact expression for (8) is lengthy and is given in the Supplementary Material. The main conclusions are that the one-step estimator β _H can have a larger mean squared error than β _S when the ratio k/p of sketch size to number of variables is close to 1. As k/p increases, the one-step estimator becomes more efficient than both β _S and β _C with the optimal weight φ _opt. The relative efficiency of β _C to β _S is at most 2. The relative efficiency of β _H to β _S can be much higher, provided that k/p is sufficiently large.

4. Asymptotics

4.1. Preliminaries

Finite-sample distributions of random projection estimators can be mathematically intractable, and thus asymptotic analysis can be a powerful tool (Diaconis & Freedman, 1984; Li et al., 2006). It is very difficult to establish meaningful finite-sample results for the Hadamard and Clarkson– Woodruff sketches, as they are discrete distributions over an enormous combinatorial space. Instead, it is useful to study the large-n distribution of the estimators β _S and β _P to obtain an interpretable expression.

As β _F is the estimand in sketching algorithms, conditioning on the source data is required in the asymptotic analysis. To elaborate, let A _(n) = (y _(n), X _(n)) represent the n×d source data matrix of full column rank.Any source data matrix A _(n) has a set of associated least squares coefficients, which will be denoted by $β_{F}^{(n)}$ here. The overall goal is to determine the asymptotic form of the distributions $p (β_{S} | A_{(n)})$ and $p (β_{P}^{*} | A_{(n)})$ for some arbitrary large dataset A(n). To take limits, we employ a fixed sequence of n × d datasets, all of rank d.

Some related work has been done by Ma et al. (2015), who developed Taylor series approximations for the bias and variance of data-aware sketched regression estimators, where the asymptotic expansion is taken in the sketch size k. In independent work, Dobriban & Liu (2019) examined the behaviour of data-oblivious sketching algorithms in the asymptotic regime where k, d →∞, using elements of random matrix theory. Our work is novel, as we study data-oblivious random projections in the regime where k and d are fixed, while taking limits in the number n of source observations.

4.2. Sketching central limit theorem

A central limit theorem for sparse sketching matrices with independent entries is given in Li et al. (2006). The Clarkson–Woodruff sketch and the Hadamard sketch have dependent entries, so we use a different method of proof. Under some regularity conditions, the Hadamard and Clarkson–Woodruff sketches produce sketched data that asymptotically have the same matrix normal distribution as under the Gaussian sketch.

The k × d random matrix Ã is the output of a stochastic process governed by the fixed n × d source dataset A _(n) and the distribution of the random k × n sketching matrix S. Eachcolumnof the sketched dataset is a linear combination of random vectors, the number of which increases with n. Under an assumption on the limiting leverage scores of the source data matrix, we can establish a central limit theorem for the sketched dataset. The leverage scores of the observations in the source data matrix have been identified as an important structural property of sketching algorithms (Mahoney & Drineas, 2016). Assumption 1 highlights their role in establishing asymptotic normality of the sketched data matrix.

Assumption 1

Let $A_{(n)} = U_{(n)} D_{(n)} V_{(n)}^{T}$ be the singular value decomposition of the n × d source dataset, and let $u_{(n) i}^{T}$ be the ith row in U(n). The maximum leverage score tends to zero, that is,

\lim_{n \to \infty} \max_{i = 1, \dots, n} ‖ u_{(n) i} ‖_{2}^{2} = 0.

Theorem 2 is the sketching central limit theorem. Its proof is given in the Supplementary Material.

Theorem 2

Consider a sequence of arbitrary n × d data matrices A(n), where d is fixed. Let $A_{(n)} = U_{(n)} D_{(n)} V_{(n)}^{T}$ be the singular value decomposition of A(n), and let S be a k × n Hadamard or Clarkson–Woodruff sketching matrix where k is also fixed. Suppose that Assumption 1 is satisfied. Then, as n tends to infinity, the following convergence in distribution holds:

{\tilde{A} V_{(n)} D_{(n)}^{- 1} | A_{(n)}} \to MN (0, I_{k}, I_{d} / k),

where MN denotes the matrix normal distribution.

4.3. Sketching estimators

The central limit theorem for the sketched data suggests that the results on β _S and β _P for the Gaussian sketch will also hold approximately for the Hadamard and Clarkson–Woodruff sketches for large n. To establish convergence of the estimators it helps to make an extra assumption on the sequence of source datasets.

Assumption 2

We have that

\lim_{n \to \infty} n^{- 1} (\begin{array}{l} \begin{matrix} y_{(n)}^{T} y_{(n)} \\ X_{(n)}^{T} y_{(n)} \end{matrix} & \begin{matrix} y_{(n)}^{T} X_{(n)} \\ X_{(n)}^{T} X_{(n)} \end{matrix} \end{array}) = Q

for some positive-definite matrix Q.

The limiting matrix Q allows one to avoid specifying a probability model for the source dataset, without overcomplicating the mathematical analysis. Under Assumptions 1 and 2, it is possible to establish an asymptotic result for β _S and β _P.

Theorem 3

Suppose that Assumptions 1 and 2 hold, k ⩾ p, and β _S is computed using a Hadamard or Clarkson–Woodruff sketch. Let $({\tilde{X}}^{T} \tilde{X})^{+}$ denote the Moore–Penrose pseudo-inverse of $({\tilde{X}}^{T} \tilde{X})$ . Let

{\tilde{C}}_{(n)} = \frac{{RSS}_{F}^{(n)}}{k} ({\tilde{X}}^{T} \tilde{X})^{+}, C_{(n)} = \frac{{RSS}_{F}^{(n)}}{k - p + 1} (X_{(n)}^{T} X_{(n)})^{- 1} .

Then, as n → ∞, the following convergence results hold in distribution:

(i)
${C_{(n)}^{- 1 / 2} (β_{S} - β_{F}^{(n)}) | A_{(n)}} \to Student (0, I_{p}, k - p + 1)$
(ii)
${{\tilde{C}}_{(n)}^{- 1 / 2} (β_{S} - β_{F}^{(n)}) | A_{(n)}} \to N (0, I_{p})$

The proof is given in the Supplementary Material. For large n we expect β _S to be approximately distributed as per Theorem 1 for both the Hadamard and the Clarkson–Woodruff sketches.

It is harder to establish a comparable limit theorem for $β_{P}^{*}$ , because of the nonstandard distribution of $β_{P}^{*}$ when a Gaussian sketch is used. Instead, we wish to show that the partial sketching estimators under the Hadamard and Clarkson–Woodruff sketches have similar mean and variance properties to the Gaussian partial sketching estimator. Convergence in moments can be established given a stability condition on the singular values of the sketched data matrix.

Assumption 3

The sequence of source datasets is such that $E_{S} {1 / σ_{\min}^{4} (n^{- 1} {\tilde{X}}^{T} \tilde{X}) ∣ y, X}$ is finite for large enough n, where σ_min(·) denotes the minimum singular value of a matrix.

This additional regularity condition enables a formal limit theorem regarding the moments of $β_{P}^{*}$ to be established.

Theorem 4

Suppose that Assumptions 1–3 hold, k > p + 3, and $β_{P}^{*}$ is computed using a Hadamard or Clarkson–Woodruff sketch. Let

C_{(n)} = \frac{(k - p - 1)}{(k - p) (k - p - 3)} {{MSS}_{F}^{(n)} {(X_{(n)}^{T} X_{(n)})}^{- 1} + \frac{k - p + 1}{k - p - 1} β_{F}^{(n)} β_{F}^{(n) T}} .

Then, as n → ∞:

(i)
$E_{S} {β_{P}^{*} - β_{F}^{(n)} | A_{(n)}} \to 0$
(ii)
${var}_{S} {C_{(n)}^{- 1 / 2} (β_{P}^{*} - β_{F}^{(n)}) | A_{(n)}} \to I_{p}$

The proof is given in the Supplementary Material. This theorem suggests that the conditional bias and variance of $β_{P}^{*}$ under the Clarkson–Woodruff and Hadamard sketches should be approximately equal to those under the Gaussian sketch. The results here are meant to provide useful heuristics for assessing the uncertainty associated with the output of the randomized approximation algorithm. There is a need to quantify the approximation error of sketching algorithms and communicate it to end users (Lopes et al., 2018), for which the asymptotic results developed in this section may be helpful.

5. Unconditional results

The previous analysis treated the source dataset as fixed to isolate the approximation error introduced by the random projection. When sketching is used for statistical inference, the hierarchical model of § 3.1 can be extended to include a source of variation at the population level. We take the design matrix X to be fixed and treat the response y as random. The assumed data-generating process is y = X β ₀ + ε, where ε is a vector of n independent and identically distributed random variables with mean zero and variance σ ². Let γ ² represent the average mean function sum of squares, so $γ^{2} = | | X β_{0} ‖_{2}^{2} / n$ . As shown in Searle (1997), at the population level the ordinary least squares estimator satisfies $E_{y} (β_{F} | X) = β_{0}, {var}_{y} (β_{F} ∣ X) = σ^{2} {(X^{T} X)}^{- 1} E_{y} ({RSS}_{F} ∣ X) = (n - p) σ^{2}$ and $E_{y} ({MSS}_{F} | X) = p σ^{2} + n γ^{2}$ . Taking iterated expectations, it can be seen that the Gaussian sketch gives an unbiased estimator of the population parameter $β_{0} : E_{y} (β_{S} | X) = E_{y} {E_{S} (β_{S} | y, X)} = E_{y} (β_{F} | X) = β_{0}$ . The same argument shows that $E_{y} (β_{p}^{*} | X) = β_{0}$ . In the Supplementary Material, we use the law of total variance to determine the unconditional variances

\begin{array}{l} {var}_{y} (β_{S} | X) & = & σ^{2} {(X^{T} X)}^{- 1} + \frac{(n - p) σ^{2}}{(k - p - 1)} {(X^{T} X)}^{- 1}, \\ {var}_{y} (β_{P}^{*} | X) & = & σ^{2} {(X^{T} X)}^{- 1} + \frac{(k - p - 1)}{(k - p) (k - p - 3)} [(p σ^{2} + n γ^{2}) {(X^{T} X)}^{- 1} + \frac{k - p + 1}{k - p - 1} {σ^{2} {(X^{T} X)}^{- 1} + β_{0} β_{0}^{T}}] . \end{array}

For large n, the most significant term in the unconditional variance of β _S is nσ²(X ^T X)⁻¹. The dominating term in the unconditional variance of $β_{P}^{*}$ is nγ ²(X ^T X)⁻¹, a function of the average model sum of squares γ ². We reach conclusions similar to those of the conditional analysis in § 3.3, in that β _S is expected to be more efficient when the signal-to-noise ratio is high, while $β_{P}^{*}$ is expected to be more efficient when the signal-to-noise ratio is low. Under Assumptions 1–3, the variance expressions give asymptotic approximations for the Hadamard and Clarkson–Woodruff projections. These results can be extended to account for more complicated error models on ϵ if it is still possible to determine $E_{y} (β_{F} | X), {var}_{y} (β_{F} | X), E_{y} ({RSS}_{F} | X)$ and $E_{y} ({MSS}_{F} | X)$ . Raskutti & Mahoney (2016) provides further results on the performance of sketching estimators from an inferential perspective.

6. Data application

6.1. Human leukocyte antigen locus dataset

We compare the performance of the sketching estimators on a genetic dataset from the UK Biobank database.We use a small extract of the data in Astle et al. (2016). The selected response variable is mean red cell volume, taken from the full blood count assay and with adjustments for various technical and environmental covariates. Genome-wide imputed genotype data in expected allele dose format were available on n = 132 353 study subjects (Howie et al., 2009; Bycroft etal., 2018). We consider 1000 genetic variants in the human leukocyte antigen, HLA, region of chromosome 6, selected so that no pair of variants had squared Pearson correlation of posterior expected allele doses greater than 0.8. We chose to focus on this region because many associations have been discovered in a genome-wide scan using univariable models; these associations were with variants having different allele frequencies, which suggests multiple distinct causal variants in the region. The aim is to perform a multivariable regression analysis to obtain variant effect size estimates that are conditional on the other variants in the region.

An early theoretical finding was that the partial sketching estimator β _P is biased. One thousand sketches were taken to estimate the bias $E_{S} (β_{P} - β_{F} | y, X)$ with k = 1500. We also computed the bias-corrected estimator $β_{P}^{*}$ in each replication. Figure 1 plots the average value of the estimators against the true value of the least squares coefficient using the full dataset. The top row shows results for β _P, and the bottom row shows results for $β_{P}^{*}$ . The left, middle and right columns display results for the Gaussian, Hadamard and Clarkson–Woodruff sketches, respectively. The solid line in each panel is the identity line. The dashed line in the top row represents the theoretical bias, with slope k/(k – p – 1).

An external file that holds a picture, illustration, etc.
Object name is EMS140909-f001.jpg

Open in a separate window

Fig. 1

Bias of partial sketching estimators on the HLA dataset: panels (a)–(c) show results for β _P and panels (d)– (f) results for the bias-corrected estimator $β_{P}^{*}$ ; mean estimates are plotted against the true values. In this scenario n = 132 353, p = 1000 and k = 1500. The solid line in each panel is the identity line, and the dashed line in panels (a)–(c) represents the theoretical bias factor.

The results in panels (a)–(c) show that β _P is biased for each of the random projections. The bias closely matches the theoretical factor. Panels (d)–(f) show that the adjusted estimator $β_{P}^{*}$ appears to be unbiased, with the mean values falling close to the identity line.

We also compared the complete and partial sketching estimators in terms of mean squared error and coverage of confidence intervals at k = 1500 and k = 10 000. Moreover, we compared the data-oblivious sketches to simple uniform subsampling with replacement. Table 1 reports the mean squared error for each of the estimators. The signal-to-noise ratio is quite low for this dataset, with $R_{F}^{2} = 0.02$ . We expect that partial sketching will be much more efficient than complete sketching on this dataset given the low signal-to-noise ratio. The simulation results support this prediction, with $β_{P}^{*}$ having a mean squared error roughly 60 times smaller than β _S at both values of k. The results are very similar for each of the random projections, suggesting that the asymptotic approximations are reasonable for this dataset. For k = 1500, the mean squared error of β _P is approximately 10 times that of $β_{P}^{*}$ .For k = 10 000 there is less of a difference, as the ratio k/(k−p−1) is closer to 1.

Table 1

Mean squared errors of sketching estimators on the HLA dataset

	k = 1500			k = 10 000
	β _S	β _P	$β_{P}^{*}$	β _S	β _P	$β_{P}^{*}$
Gaussian	238 (3)	39 (0.7)	3.8 (0.08)	13.3 (0.17)	0.28 (0.004)	0.21 (0.002)
Hadamard	238 (4)	39 (0.7)	3.8 (0.07)	12.5 (0.16)	0.26 (0.003)	0.20 (0.002)
Clarkson–Woodruff	241 (3)	38 (0.8)	4.0 (0.05)	13.2 (0.16)	0.28 (0.004)	0.21 (0.002)
Uniform	375 (15)	105 (7.6)	10.7 (0.55)	13.8 (0.20)	0.38 (0.007)	0.29 (0.005)

Open in a separate window

Table 2 summarizes the coverage of 95% confidence intervals forthe sketching estimators. We report the overall proportion of intervals containing the true value of the least squares estimate β _F over the 250 sketches and p = 1000 coefficients. The observed coverage is close to the nominal level of 0.95 at both levels of k. The different random projections give very similar results, suggesting that the use of asymptotic approximations is again reasonable for this dataset. The intervals for the Hadamard sketch appear to be slightly conservative at k = 10 000.

Table 2

Coverage of confidence intervals; the largest standard error is 0.004

	HLA		HLA		Flights
	k = 1500		k = 10 000		k = 1500
	β _S	$β_{P}^{*}$	β _S	$β_{P}^{*}$	β _S	$β_{P}^{*}$
Gaussian	0.950	0.953	0.950	0.951	0.948	0.951
Hadamard	0.949	0.949	0.954	0.954	0.950	0.948
Clarkson–Woodruff	0.947	0.952	0.951	0.950	0.948	0.947

Open in a separate window

Table 3 reports the average sketching times for the data-oblivious sketches. We computed 10 sketches using each projection. The Gaussian sketch is an order of magnitude slower than the Hadamard projection and two orders of magnitude slower than the Clarkson–Woodruff sketch.

Table 3

Timings for sketching: average times to compute the sketched dataset Ã = SA, in seconds

	HLA		Flights
	k = 1500	k = 10 000	k = 5000
Gaussian	522	3479	404
Hadamard	57	65	5.8
Clarkson–Woodruff	5.3	5.4	0.2

Open in a separate window

6.2. New York flights dataset

We also evaluated the sketching algorithms on the New York flights dataset available in the R (R Development Core Team, 2021) package nycflights13 (Wickham, 2014). Arrival delay was taken as the response, and departure delay, distance, departure time, origin, and month and day were chosen to be the covariates. Rows of the dataset with missing data were omitted, sothat we were left with n = 327 346 and d = 47. The goal is to compare the accuracy of the various sketches on real data rather than to build a statistical model for the flights dataset. We compare the mean squared error of the estimators and the coverage of confidence intervals for k = 5000. In contrast to the HLA dataset, the flights dataset has a very high $R_{F}^{2}$ value of 0.99. We took 500 sketches to compare complete and partial sketching. See Table 4 for details.

Table 4

Mean squared errors of sketching estimators (with standard errors in parentheses) on the flights dataset with k = 5000

	β _S	β _P	$β_{P}^{*}$
Gaussian	60 (2)	14900 (400)	14900 (400)
Hadamard	63 (2)	14800 (500)	13900 (400)
Clarkson–Woodruff	66 (2)	15000 (500)	13800 (400)
Uniform	64 (2)	14600 (500)	14600 (400)

Open in a separate window

7. Discussion

In recent years work has been done to adapt sketching methods for statistical inference in large datasets, building upon the worst-case bounds developed in the computer science literature. Geppert et al. (2017) and Bardenet & Maillard (2015) investigated sketching algorithms for Bayesian regression, and derived bounds on the difference between the sketched posterior distribution and the full-data posterior distribution. Only complete sketching was considered in those works. The results on the advantages of partial sketching in this paper could motivate adaptations that make use of the exact marginal associations X ^T y. Sketching ideas have been used to develop methods for approximate nonlinear regression (Banerjee et al., 2013; Avron et al., 2014). The goodness of fit ofthe model may also influence the relative efficiency of different sketching algorithms in more complex regression tasks. A related branch of work uses random projections to reduce the number of predictors inregressionandclassificationproblems(Shah & Meinshausen, 2018; Guhaniyogi & Dunson, 2015; Cannings & Samworth, 2017).

Supplementary Material

Supplementary material available at Biometrika online includes proofs of all the theorems.

Supplementary Material

Click here to view.^{(1.8M, zip)}

Acknowledgements

This research was conducted using the UK Biobank resource. Richardson was supported by the UKRI Medical Research Council and the Alan Turing Institute. Astle was supported by NHS Blood and Transplant and the National Institute for Health Research Blood and Transplant Research Unit. Many thanks to Rajen Shah for helpful discussions, and the reviewers and associate editor for insightful comments that have improved the quality of the manuscript.

References

Ailon N, Chazelle B. The fast Johnson Lindenstrauss transform and approximate nearest neighbors. SIAM J Comp. 2009;39:302–22. [Google Scholar]
Astle WJ, Elding H, Jiang T, Allen D, Ruklisa D, Mann AL, Mead D, Bouman H, Riveros-Mckay F, Kostadima MA, et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell. 2016;167:1415–29. [PMC free article] [PubMed] [Google Scholar]
Avron H, Nguyen H, Woodruff D. Subspace embeddings for the polynomial kernel; Proc 27th Int Conf Neural Information Processing Systems (NIPS’14); Cambridge, Massachusetts: MIT Press; 2014. pp. 2258–66. [Google Scholar]
Banerjee A, Dunson DB, Tokdar ST. Efficient Gaussian process regression for large datasets. Biometrika. 2013;100:75–89. [PMC free article] [PubMed] [Google Scholar]
Bardenet R, Maillard O-A. A note on replacing uniform subsampling by random projections in MCMC for linear regression of tall datasets. HAL; 2015. 01248841. preprint https://hal.archives-ouvertes.fr/hal-01248841 . [Google Scholar]
Becker S, Kawas B, Petrik M, Ramamurthy K. Robust partially-compressed least-squares. arXiv. 2015:1510.04905v1 [Google Scholar]
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9. [PMC free article] [PubMed] [Google Scholar]
Cannings TI, Samworth RJ. Random-projection ensemble classification. J R Statist Soc B. 2017;79:959–1035. [Google Scholar]
Clarkson KL, Woodruff DP. Low rank approximation and regression in input sparsity time; Proc 45th Annual ACM Sympos Theory of Computing (STOC’13); New York. 2013. pp. 81–90. [Google Scholar]
Cormode G. Foundations and Trends in Databases. NOW Publishers; Hanover, Massachusetts: 2011. Sketch techniques for approximate query processing. [Google Scholar]
Dhillon P, Lu Y, Foster DP, Ungar L. New subsampling algorithms for fast least squares regression; Proc 26th Int Conf Neural Information Processing Systems (NIPS’13); Red Hook, NewYork. 2013. pp. 360–8. [Google Scholar]
Diaconis P, Freedman D. Asymptotics of graphical projection pursuit. Ann Statist. 1984;12:793–815. [Google Scholar]
Dobriban E, Liu S. Asymptotics for sketching in least squares regression; Advances in Neural Information Processing Systems 32 (Proc NeurIPS 2019); La Jolla, California. 2019. pp. 3675–85. [Google Scholar]
Drineas P, Mahoney MW, Muthukrishnan S. Sampling algorithms for ℓ₂ regression and applications; Proc 17th Annual ACM-SIAM Sympos Discrete Algorithm (SODA ’06); Philadelphia. 2006. pp. 1127–36. [Google Scholar]
Eaton ML. Multivariate Statistics: A Vector Space Approach. Institute of Mathematical Statistics; Beachwood, Ohio: 2007. [Google Scholar]
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed. Chapman & Hall; Boca Raton, Florida: 2014. [Google Scholar]
Geppert LN, Ickstadt K, Munteanu A, Quedenfeld J, Sohler C. Random projections for Bayesian regression. Statist Comp. 2017;27:79–101. [Google Scholar]
Guhaniyogi R, Dunson DB. Bayesian compressed regression. J Am Statist Assoc. 2015;110:1500–14. [Google Scholar]
Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 2011;53:217–88. [Google Scholar]
Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. [PMC free article] [PubMed] [Google Scholar]
Li P, Hastie TJ, Church KW. Very sparse random projections; Proc 12th ACM-SIGKDD Int Conf Knowledge Discovery and Data Mining; New York: 2006. pp. 287–96. [Google Scholar]
Lopes M, Wang S, Mahoney M. In: Dy J, Krause A, editors. Error estimation for randomized least-squares algorithms via the bootstrap; Proc 35th Int Conf Machine Learning; 2018. pp. 3217–26. [Google Scholar]
Ma P, Mahoney MW, Yu B. A statistical perspective on algorithmic leveraging. J Mach Learn Res. 2015;16:861–911. [Google Scholar]
Ma P, Sun X. Leveraging for big data regression. WIREs Comp Statist. 2015;7:70–6. [Google Scholar]
Mahoney M. Foundations and Trends in Machine Learning. Vol. 3. NOW Publishers; Hanover, Massachusetts: 2011. Randomized algorithms for matrices and data; pp. 123–224. [Google Scholar]
Mahoney M, Drineas P. In: Handbook of Big Data. Buhlmann P, Drineas P, Kane M, van de Laan M, editors. Chapman & Hall; Boca Raton, Florida: 2016. Structural properties underlying high-quality randomized numerical linear algebra algorithms; pp. 137–54. [Google Scholar]
Pilanci M, Wainwright MJ. Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares. J Mach Learn Res. 2016;17:1842–79. [Google Scholar]
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2021. ISBN 3-900051-07-0. http://www.R-project.org . [Google Scholar]
Raskutti G, Mahoney MW. A statistical perspective on randomized sketching for ordinary least-squares. J Mach Learn Res. 2016;17:7508–38. [Google Scholar]
Sarlos T. Improved approximation algorithms for large matrices via random projections; 47th Annual IEEE Sympos Foundations of Computer Science (FOCS’06); New York. 2006. pp. 143–52. [Google Scholar]
Searle SR. Linear Models. Wiley; New York: 1997. [Google Scholar]
Shah RD, Meinshausen N. Min-wise hashing for large-scale regression and classification with sparse data. arXiv. 2018;1308:1269v4 [Google Scholar]
Wickham H. nycflights13: Flights that Departed NYC in 2013. 2014. R package version 0.1, available at https://cran.r-project.org/web/packages/nycflights13/
Woodruff DP. Foundations and Trends in Theoretical Computer Science. Vol. 10. NOW Publishers; Hanover, Massachusetts: 2014. Sketching as a tool for numerical linear algebra; pp. 1–157. [Google Scholar]