Search | arXiv e-print repository

Matrix Norms in Data Streams: Faster, Multi-Pass and Row-Order

Authors: Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Yi Li, David P. Woodruff, Lin F. Yang

Abstract: A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that is presented as a data-stream. This setting generalizes classical problems to the analogous ones for matrices. For example, instead of estimating… ▽ More A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that is presented as a data-stream. This setting generalizes classical problems to the analogous ones for matrices. For example, instead of estimating frequent-item counts, we now wish to estimate "frequent-direction" counts. A related example is to estimate norms, which now correspond to estimating a vector norm on the singular values of the matrix. Despite recent efforts, the current understanding for such matrix problems is considerably weaker than that for vector problems. We study a number of aspects of estimating matrix norms in a stream that have not previously been considered: (1) multi-pass algorithms, (2) algorithms that see the underlying matrix one row at a time, and (3) time-efficient algorithms. Our multi-pass and row-order algorithms use less memory than what is provably required in the single-pass and entrywise-update models, and thus give separations between these models (in terms of memory). Moreover, all of our algorithms are considerably faster than previous ones. We also prove a number of lower bounds, and obtain for instance, a near-complete characterization of the memory required of row-order algorithms for estimating Schatten $p$-norms of sparse matrices. △ Less

Submitted 24 October, 2018; v1 submitted 19 September, 2016; originally announced September 2016.

Comments: Merged works

arXiv:1603.00759 [pdf, ps, other]

BPTree: an $\ell_2$ heavy hitters algorithm using constant memory

Authors: Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, David P. Woodruff

Abstract: The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list $i_1,i_2,\ldots,i_m\in[n]$ and the goal is to identify the items among $[n]$ that appear frequently in the list. In sub-polynomial space, the strongest guarantee available is the $\ell_2$ guarantee, which requires finding all items that occur at least $εいぷしろん\|f\|_2$ tim… ▽ More The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list $i_1,i_2,\ldots,i_m\in[n]$ and the goal is to identify the items among $[n]$ that appear frequently in the list. In sub-polynomial space, the strongest guarantee available is the $\ell_2$ guarantee, which requires finding all items that occur at least $εいぷしろん\|f\|_2$ times in the stream, where the vector $f\in\mathbb{R}^n$ is the count histogram of the stream with $i$th coordinate equal to the number of times~$i$ appears $f_i:=\#\{j\in[m]:i_j=i\}$. The first algorithm to achieve the $\ell_2$ guarantee was the CountSketch of [CCF04], which requires $O(εいぷしろん^{-2}\log n)$ words of memory and $O(\log n)$ update time and is known to be space-optimal if the stream allows for deletions. The recent work of [BCIW16] gave an improved algorithm for insertion-only streams, using only $O(εいぷしろん^{-2}\logεいぷしろん^{-1}\log\log n)$ words of memory. In this work, we give an algorithm \bptree for $\ell_2$ heavy hitters in insertion-only streams that achieves $O(εいぷしろん^{-2}\logεいぷしろん^{-1})$ words of memory and $O(\logεいぷしろん^{-1})$ update time, which is the optimal dependence on $n$ and $m$. In addition, we describe an algorithm for tracking $\|f\|_2$ at all times with $O(εいぷしろん^{-2})$ memory and update time. Our analyses rely on bounding the expected supremum of a Bernoulli process involving Rademachers with limited independence, which we accomplish via a Dudley-like chaining argument that may have applications elsewhere. △ Less

Submitted 9 November, 2017; v1 submitted 2 March, 2016; originally announced March 2016.

Comments: v4: PODS'17 camera-ready version, includes improved space l_2 tracking (by log(1/epsilon) factor); v3: fixed accidental mis-sorting of author last names; v2: added section explaining why pick-and-drop sampling fails for l2 heavy hitters, and fixed minor typos

arXiv:1601.07473 [pdf, ps, other]

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

Authors: Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, Lin F. Yang

Abstract: A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity of approximating $\sum_{i=1}^n g(|f_i|)$, where $f\in\mathbb{Z}^n$ is the frequency vector of a turnstile stream. This is a generalization of the wel… ▽ More A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity of approximating $\sum_{i=1}^n g(|f_i|)$, where $f\in\mathbb{Z}^n$ is the frequency vector of a turnstile stream. This is a generalization of the well-known frequency moments problem, and previous results apply only when $g$ is monotonic or has a special functional form. Our contribution is to give a condition such that, except for a narrow class of functions $g$, there is a space-efficient approximation algorithm for the sum if and only if $g$ satisfies the condition. The functions $g$ that we are able to characterize include all convex, concave, monotonic, polynomial, and trigonometric functions, among many others, and is the first such characterization for non-monotonic functions. Thus, for nearly all functions of one variable, we answer the open question from the celebrated paper of Alon, Matias and Szegedy (1996). △ Less

Submitted 27 January, 2016; originally announced January 2016.

arXiv:1512.07126 [pdf, ps, other]

Sublinear Bounds for a Quantitative Doignon-Bell-Scarf Theorem

Authors: Stephen R. Chestnut, Robert Hildebrand, Rico Zenklusen

Abstract: The recent paper "A quantitative Doignon-Bell-Scarf Theorem" by Aliev et al. generalizes the famous Doignon-Bell-Scarf Theorem on the existence of integer solutions to systems of linear inequalities. Their generalization examines the number of facets of a polyhedron that contains exactly $k$ integer points in $\mathbb{R}^n$. They show that there exists a number $c(n,k)$ such that any polyhedron in… ▽ More The recent paper "A quantitative Doignon-Bell-Scarf Theorem" by Aliev et al. generalizes the famous Doignon-Bell-Scarf Theorem on the existence of integer solutions to systems of linear inequalities. Their generalization examines the number of facets of a polyhedron that contains exactly $k$ integer points in $\mathbb{R}^n$. They show that there exists a number $c(n,k)$ such that any polyhedron in $\mathbb{R}^n$ that contains exactly $k$ integer points has a relaxation to at most $c(n,k)$ of its inequalities that will define a new polyhedron with the same integer points. They prove that $c(n,k) = O(k2^n)$. In this paper, we improve the bound asymptotically to be sublinear in $k$. We also provide lower bounds on $c(n,k)$, along with other structural results. For dimension $n=2$, our bounds are asymptotically tight to within a constant. △ Less

Submitted 30 August, 2017; v1 submitted 22 December, 2015; originally announced December 2015.

arXiv:1511.02486 [pdf, ps, other]

Hardness and Approximation for Network Flow Interdiction

Authors: Stephen R. Chestnut, Rico Zenklusen

Abstract: In the Network Flow Interdiction problem an adversary attacks a network in order to minimize the maximum s-t-flow. Very little is known about the approximatibility of this problem despite decades of interest in it. We present the first approximation hardness, showing that Network Flow Interdiction and several of its variants cannot be much easier to approximate than Densest k-Subgraph. In particul… ▽ More In the Network Flow Interdiction problem an adversary attacks a network in order to minimize the maximum s-t-flow. Very little is known about the approximatibility of this problem despite decades of interest in it. We present the first approximation hardness, showing that Network Flow Interdiction and several of its variants cannot be much easier to approximate than Densest k-Subgraph. In particular, any $n^{o(1)}$-approximation algorithm for Network Flow Interdiction would imply an $n^{o(1)}$-approximation algorithm for Densest k-Subgraph. We complement this hardness results with the first approximation algorithm for Network Flow Interdiction, which has approximation ratio 2(n-1). We also show that Network Flow Interdiction is essentially the same as the Budgeted Minimum s-t-Cut problem, and transferring our results gives the first approximation hardness and algorithm for that problem, as well. △ Less

Submitted 8 November, 2015; originally announced November 2015.

arXiv:1511.02484 [pdf, ps, other]

Interdicting Structured Combinatorial Optimization Problems with {0,1}-Objectives

Authors: Stephen R. Chestnut, Rico Zenklusen

Abstract: Interdiction problems ask about the worst-case impact of a limited change to an underlying optimization problem. They are a natural way to measure the robustness of a system, or to identify its weakest spots. Interdiction problems have been studied for a wide variety of classical combinatorial optimization problems, including maximum $s$-$t$ flows, shortest $s$-$t$ paths, maximum weight matchings,… ▽ More Interdiction problems ask about the worst-case impact of a limited change to an underlying optimization problem. They are a natural way to measure the robustness of a system, or to identify its weakest spots. Interdiction problems have been studied for a wide variety of classical combinatorial optimization problems, including maximum $s$-$t$ flows, shortest $s$-$t$ paths, maximum weight matchings, minimum spanning trees, maximum stable sets, and graph connectivity. Most interdiction problems are NP-hard, and furthermore, even designing efficient approximation algorithms that allow for estimating the order of magnitude of a worst-case impact, has turned out to be very difficult. Not very surprisingly, the few known approximation algorithms are heavily tailored for specific problems. Inspired by an approach of Burch et al. (2003), we suggest a general method to obtain pseudoapproximations for many interdiction problems. More precisely, for any $αあるふぁ>0$, our algorithm will return either a $(1+αあるふぁ)$-approximation, or a solution that may overrun the interdiction budget by a factor of at most $1+αあるふぁ^{-1}$ but is also at least as good as the optimal solution that respects the budget. Furthermore, our approach can handle submodular interdiction costs when the underlying problem is to find a maximum weight independent set in a matroid, as for example the maximum weight forest problem. The approach can sometimes be refined by exploiting additional structural properties of the underlying optimization problem to obtain stronger results. We demonstrate this by presenting a PTAS for interdicting $b$-stable sets in bipartite graphs. △ Less

Submitted 8 November, 2015; originally announced November 2015.

arXiv:1511.01111 [pdf, other]

Streaming Symmetric Norms via Measure Concentration

Authors: Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Lin F. Yang

Abstract: We characterize the streaming space complexity of every symmetric norm $l$ (a norm on $\mathbb{R}^n$ invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of $l$. Specifically, we provide nearly matching upper and lower bounds on the space complexity of calculating a $(1\pmεいぷしろん)$-approximation to the norm of the stream,… ▽ More We characterize the streaming space complexity of every symmetric norm $l$ (a norm on $\mathbb{R}^n$ invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of $l$. Specifically, we provide nearly matching upper and lower bounds on the space complexity of calculating a $(1\pmεいぷしろん)$-approximation to the norm of the stream, for every $0<εいぷしろん\leq 1/2$. (The bounds match up to $poly(εいぷしろん^{-1} \log n)$ factors.) We further extend those bounds to any large approximation ratio $D\geq 1.1$, showing that the decrease in space complexity is proportional to $D^2$, and that this factor the best possible. All of the bounds depend on the median of $l(x)$ when $x$ is drawn uniformly from the $l_2$ unit sphere. The same median governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem. The family of symmetric norms contains several well-studied norms, such as all $l_p$~norms, and indeed we provide a new explanation for the disparity in space complexity between $p\le 2$ and $p>2$. In addition, we apply our general results to easily derive bounds for several norms that were not studied before in the streaming model, including the top-$k$ norm and the $k$-support norm, which was recently employed for machine learning tasks. Overall, these results make progress on two outstanding problems in the area of sublinear algorithms (Problems 5 and 30 in~\url{http://sublinear.info}). △ Less

Submitted 26 June, 2017; v1 submitted 3 November, 2015; originally announced November 2015.

Comments: published in STOC 2017

arXiv:1511.00661 [pdf, ps, other]

Beating CountSketch for Heavy Hitters in Insertion Streams

Authors: Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, David P. Woodruff

Abstract: Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq εいぷしろん\sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Such a… ▽ More Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq εいぷしろん\sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Such a guarantee is considerably stronger than the $\ell_1$-guarantee, which finds those $j$ for which $f_j \geq εいぷしろんm$. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such $j$ using $Θしーた(\log^2 n)$ bits of space (for constant $εいぷしろん> 0$). The only known lower bound is $Ωおめが(\log n)$ bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve $O(\log n \log \log n)$ bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating $F_2$ simultaneously at all points in a stream using only $O(\log n\log\log n)$ bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the $\ell_{\infty}$ norm of a stream up to additive error $εいぷしろん\sqrt{F_2}$ with $O(\log n\log\log n)$ bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams. △ Less

Submitted 2 November, 2015; originally announced November 2015.

arXiv:1408.5096 [pdf, ps, other]

Universal sketches for the frequency negative moments and other decreasing streaming sums

Authors: Vladimir Braverman, Stephen R. Chestnut

Abstract: Given a stream with frequencies $f_d$, for $d\in[n]$, we characterize the space necessary for approximating the frequency negative moments $F_p=\sum |f_d|^p$, where $p<0$ and the sum is taken over all items $d\in[n]$ with nonzero frequency, in terms of $n$, $εいぷしろん$, and $m=\sum |f_d|$. To accomplish this, we actually prove a much more general result. Given any nonnegative and nonincreasing function… ▽ More Given a stream with frequencies $f_d$, for $d\in[n]$, we characterize the space necessary for approximating the frequency negative moments $F_p=\sum |f_d|^p$, where $p<0$ and the sum is taken over all items $d\in[n]$ with nonzero frequency, in terms of $n$, $εいぷしろん$, and $m=\sum |f_d|$. To accomplish this, we actually prove a much more general result. Given any nonnegative and nonincreasing function $g$, we characterize the space necessary for any streaming algorithm that outputs a $(1\pmεいぷしろん)$-approximation to $\sum g(|f_d|)$, where again the sum is over items with nonzero frequency. The storage required is expressed in the form of the solution to a relatively simple nonlinear optimization problem, and the algorithm is universal for $(1\pmεいぷしろん)$-approximations to any such sum where the applied function is nonnegative, nonincreasing, and has the same or smaller space complexity as $g$. This partially answers an open question of Nelson (IITK Workshop Kanpur, 2009). △ Less

Submitted 16 February, 2015; v1 submitted 21 August, 2014; originally announced August 2014.

Comments: 19 pages

arXiv:1208.4125 [pdf, ps, other]

Counting Spanning Trees of Threshold Graphs

Authors: Stephen R. Chestnut, Donniell E. Fishkind

Abstract: Cayley's formula states that there are $n^{n-2}$ spanning trees in the complete graph on $n$ vertices; it has been proved in more than a dozen different ways over its 150 year history. The complete graphs are a special case of threshold graphs, and using Merris' Theorem and the Matrix Tree Theorem, there is a strikingly simple formula for counting the number of spanning trees in a threshold graph… ▽ More Cayley's formula states that there are $n^{n-2}$ spanning trees in the complete graph on $n$ vertices; it has been proved in more than a dozen different ways over its 150 year history. The complete graphs are a special case of threshold graphs, and using Merris' Theorem and the Matrix Tree Theorem, there is a strikingly simple formula for counting the number of spanning trees in a threshold graph on $n$ vertices; it is simply the product, over $i=2,3, ...,n-1$, of the number of vertices of degree at least $i$. In this manuscript, we provide a direct combinatorial proof for this formula which does not use the Matrix Tree Theorem; the proof is an extension of Joyal's proof for Cayley's formula. Then we apply this methodology to give a formula for the number of spanning trees in any difference graph. △ Less

Submitted 8 January, 2013; v1 submitted 20 August, 2012; originally announced August 2012.

Comments: 14 pages, 5 figures

MSC Class: 05A19

Showing 1–10 of 10 results for author: Chestnut, S R