Quantum-Inspired Stochastic Regression
Reference: https://arxiv.org/abs/1811.04909
Definitions and Assumptions
Let $b \in \CC^m$ and $A \in \CC^{m \times n}$ s.t. $\Vert A \Vert \leq 1$ where $\Vert \cdot \Vert$ signifies the operator norm (or spectral norm). Furthermore, require that $\rank(A) = k$ and $\Vert A^+ \Vert \leq \kappa$ where $A^+$ is the pseudoinverse of $A$. Hence, observe that $\Vert A \Vert \leq 1$ is equivalent to $A$ having maximum singular value $1$ (to see this, simply consider Spectral Theorem applied to Hermitian matrix $A^\dagger A$). Similarly, $A^+$ has inverted singular values from $A$ and so $\Vert A^+ \Vert$ is equal to the reciprocal of the minimum nonzero singular value. Therefore, the condition number of $A$ is given by $\Vert A \Vert \Vert A^+ \Vert = \kappa \Vert A \Vert$.
So, define $x$ to be the least-squares solution to the linear system $Ax = b$ i.e. $x = A^+ b$. Then, in terms of these definitions, we define two primary goals:
- Query a vector $\tilde{x}$ s.t. $\Vert \tilde{x} - x \Vert \leq \epsilon \Vert x \Vert$
- Sample from a distribution that approximates \(\frac{|x_j|^2}{\Vert x \Vert^2}\) within total variation distance $2\epsilon$.
In order to do this, we simply assume that we have length-square sampling access to $A$. In other words, we are able to sample row indices of $A$ from the distribution $\frac{\Vert A_{i, \cdot}\Vert^2}{\Vert A \Vert^2_F}$
Sequence of Approximations
First, we’ll summarize the sequence of approximations that we’ll perform using length-squared sampling techniques. We’ll describe these steps in depth in the following sections.
Of course, we know that the least squares solution of the linear system is given by the orthogonal projection
\[(A^\dagger A)^+ A^\dagger = A^+ b\]So, we first approximate $A^\dagger A$ by $R^\dagger R$ where $R \in \CC^{r \times n}$, $r \ll m$ is constructed from length-square sampling $r$ rows of $A$. Now, denote the spectral decomposition
\[A^\dagger A \approx R^\dagger R = \sum_{l=1}^k \frac{1}{\sigma_l^2}\ket{v^{(l)}}\bra{v^{(l)}}\]where of course $\sigma_i$ and $\ket{v^{(i)}} \in \CC^n$ are the singular values and right singular vectors of $R$, respectively.
We see that computing these right singular vectors of $R$ can still be computationally prohibitive given the dimension $n$. Hence, we can use length-square sampling again, this time on the columns of $R$ to give a matrix $C \in \CC^{r \times c}$, $c \ll n$. Now, the left singular vectors of $C$ which we denote as $\ket{w^{(i)}} \in \CC^r$ can be efficiently computed via standard SVD methods. So,
\[RR^\dagger \approx CC^\dagger = \sum_{l=1}^k \frac{1}{\sigma_l^2}\ket{w^{(l)}}\bra{w^{(l)}}\]We can then show that ()
\[\ket{\tilde{v}^{(i)}} := R^\dagger \ket{w^{(l)}} / \tilde{\sigma}_l\]provides a good approximation of $\ket{v^{(i)}}$. Note that $\tilde{\sigma}_l$ are the singular values of $C$ which then approximate the singular values of $R$ which similarly approximate the singular values of $A$. This follows from $A^\dagger A \approx R^\dagger R$ and $RR^\dagger \approx CC^\dagger$ by the Hoffman–Wielandt inequality detailed in Lemma 2.7 of Kanna and Vempala and stated without proof below.
Hoffman–Wielandt inequality
If $P, Q$ are two real, symmetric $n \times n$ matrices and $\lambda_1, \cdots \lambda_n$ denote eigenvalues in non-decreasing order, then
\[\sum_{t=1}^n(\lambda_t(P) - \lambda_t(Q))^2 \leq \Vert P - Q \Vert_F^2\]At this point, it seems like we haven’t made much progress since computing $R^\dagger \ket{w^{(l)}}$ is still expensive. However, it turns out that all we need to enable query access to $\tilde{x}$ is the ability to efficiently estimate the trace inner product $\tr(U^\dagger V)$ where $U$ and $V$ are operators such that $U$ can be the length-square sampled and $V$ can be queried. To see this, we write our solution, $\tilde{x}$, in terms of the approximations thus far
\[\tilde{x} \approx A^+ \ket{b} \\ \approx (R^\dagger R)^+ A^\dagger \ket{b}\\ \approx \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^2} \ket{\tilde{v}^{(l)}} \bra{\tilde{v}^{(l)}} A^\dagger \ket{b}\]Hence, define $U := A$, $V := \ket{b}\bra{\tilde{v}^{(l)}}$ in which case
\[\tr(U^\dagger V) = \tr(A^\dagger \ket{b} \bra{\tilde{v}^{(l)}}) \\ = \tr(\bra{\tilde{v}^{(l)}} A^\dagger \ket{b} )\\ = \bra{\tilde{v}^{(l)}} A^\dagger \ket{b}\]since $\bra{\tilde{v}^{(l)}} A^\dagger \ket{b}$ is a scalar. Therefore, say that
\[\tilde{\lambda}_l \approx \tr(A^\dagger \ket{b} \bra{\tilde{v}^{(l)}})\]and assume that we can compute and memoize these scalars $\tilde{\lambda}_i$ efficiently. In which case,
\[\tilde{x} \approx \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^2} \ket{\tilde{v}^{(l)}} \tilde{\lambda}_l\]Recalling the definition of $\ket{\tilde{v}^{(i)}}$,
\[= \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^3} R^\dagger \ket{w^{(l)}} \tilde{\lambda}_l\\ = R^\dagger \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^3} \ket{w^{(l)}} \tilde{\lambda}_l\]and so defining $z := \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^3} \ket{w^{(l)}} \tilde{\lambda}_l$,
\[= R^\dagger z\]We see that we can compute $z$ efficiently (and memoize it for future queries) because it is a $k$-linear combination of left singular vectors in $\CC^r$. So, say that we wish to query an element $\tilde{x_j}$. We can simply query column $R_{\cdot, j} \in \CC^r$ (or equivalently row $R_{j, \cdot}^\dagger$) and compute $R_{\cdot, j} \cdot z$. Hence, we’ve achieved our first goal.
In order to achieve our second goal, enabling sample access to a distribution that approximates $\frac{\vert x_j \vert^2}{\Vert x \Vert^2}$, we require one more trick: rejection sampling which we detail in the QML Notes linked at the document end.
All in all, we’ve performed the chain of approximations,
\[\ket{x} = A^+ \ket{b} = (A^\dagger A)^+ A^\dagger \ket{b}\\ \approx (R^\dagger R)^+ A^\dagger \ket{b} = \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^2} \ket{v^{(l)}} \bra{v^{(l)}} A^\dagger \ket{b}\\ \approx \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^2} \ket{\tilde{v}^{(l)}} \bra{\tilde{v}^{(l)}} A^\dagger \ket{b}\\ \Big(\ket{\tilde{v}^{(l)}} := R^\dagger \ket{w^{(l)}}\Big) \approx \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^2} \ket{\tilde{v}^{(l)}} \tilde{\lambda}_l = R^\dagger \sum_{l = 1}^k \frac{1}{\tilde{\sigma}_l^3} \ket{w^{(l)}} \tilde{\lambda}_l = R^\dagger z\]Now that we’ve sketched the steps of this process, we detail each approximation and show that we can achieve the claimed correctness and complexity bounds.
Computing Approximate Singular Vectors
As described above, we begin by length-square sampling the original matrix $A \in \CC^{m \times n }$. So, pick row index $i$ with probability $p_i = \frac{\Vert A_{i, \cdot}\Vert^2}{\Vert A \Vert_F^2}$ and output random row $Y = A_i / \sqrt{p_i} = \frac{A_{i, \cdot}}{\Vert A_{i, \cdot}\Vert}\Vert A\Vert_F$. After sampling $r$ rows, we implicitly define matrix $R$ to be the concatenation of the outputted random rows. Furthermore, we rescale $R$ so that $E[R^{\dagger} R] = A^\dagger A$, which requires normalizing by $\sqrt{r}$. Therefore,
\[R = \frac{1}{\sqrt{r}} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_r \end{bmatrix} \in \CC^{r \times n}\]Writing in progress…
For the filled in details, please check out my QML Notes.