Where: Sudikoff 115
When: 2A hour: TTh 14:00-15:50, X-hr W 16:15-17:20
Who: Instructor: Amit Chakrabarti,
Sudikoff 107, Office hours: Mon 9:30-10:30 or by appointment
This course studies algorithms that process massive amounts of
data; so massive that they will not fit in a computer's storage.
As we shall see, this forces one to rethink even very basic
algorithmic problems, such as counting the number of distinct
elements, selection, and sorting. We shall study techniques to
summarize such large amounts of data into succinct "sketches"
that nevertheless retain important and useful information. We
shall introduce the necessary mathematical tools along the way.
Most techniques we shall study come from research in the last
decade or so, although a few date as far back as the 1970s. The
techniques we shall study have been applied successfully to web
data compression, approximate query processing in databases,
network measurement and signal processing. The course itself
will focus on the underlying techniques rather than the specific
Prerequisites: An undergraduate-level course
in Algorithms (such as our CS 31)
or else: A strong mathematics
background and permission of the instructor.
Lecture Plan: The following is a rough outline,
and will very likely be changed once the course gets going.
|1. ||Sep 17 || Data stream model; Our first algorithm: the majority problem. |
|2. ||Sep 22 || The distinct-elements problem; Randomized algorithms; Approximation. |
|3. ||Sep 24 || More modern results on distinct-elements. |
|4. ||Sep 26 || Count Sketch; Count-Min Sketch; majority revisited; The heavy-hitters problem. |
|5. ||Sep 29 || Frequency moments; The AMS algorithm; Improving a Basic Estimator. |
|6. ||Oct 1 || The amazing AMS second moment (F2) algorithm; Dimension reduction; Johnson-Lindenstrauss Lemma. |
|7. ||Oct 6 || Stable distributions; Estimating L1 distance; Indyk's algorithm and Lp norms, p ∈ (0,2). |
|8. ||Oct 8 || The rest of the Lp norms; Sketch using precision sampling. |
|9. ||Oct 13 || Proof of the Precision Sampling Lemma. |
|10. ||Oct 15 || The median and selection problems; Random order versus adversarial order. |
|11. ||Oct 20 || Geometric problems; Coresets; The min-enclosing-ball problem. |
|12. ||Oct 22 || Metric streams; Clustering; k-center, k-median, k-means. |
|13. ||Oct 29 || Finding a good k-center clustering; Doubling algorithm; Guha's algorithm. |
|14. ||Nov 3 || Triangle counting; Mininum Spanning Trees. |
|15. ||Nov 5 || Maximum Matchings. |
|16. ||Nov 10 || Communication complexity; The index, disjointness and gap-hamming problems. |
|17. ||Nov 12 || Some reductions from communication to streaming problems. |
|18. ||Nov 17 || First set of student presentations || |
|19. ||Nov 20 || (starting 11:30am) Second set of student presentations |
Textbooks: There is no set textbook for this course,
but we have some evolving in-house course
notes, thanks to the scribing efforts of students from the previous
offering of the course. A good fraction of the material we shall study only
resides in research papers. There is a good survey by Muthukrishnan, somewhat dated
at this point, but still very useful for the basics.
Homework Sets: We will have four homework sets in total.
Here is a description of the grading
principles for these homework sets. Please also note all other applicable policies, including how the honor
principle applies to this course.
Class Presentation: In lieu of a final exam,
students are required to read up on an advanced topic in data stream
algorithms (usually represented by a research paper), and give a
short, 30-minute presentation on its main ideas.
Below is a list of suggested papers for such a presentation.
Efficient sampling of non-strict turnstile data streams.
Barkay, Porat, Shalem. (TCS, 2015)
Plus, for background:
A Small Approximately Min-Wise Independent Family of Hash
Functions. Indyk. (J. Alg., 2001)
Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation.
Indyk. (JACM, 2006)
Numerical Linear Algebra in the Streaming Model.
Clarkson, Woodruff. (STOC, 2009)
A Near-Optimal Algorithm for Estimating the Entropy of a Stream.
Chakrabarti, Cormode, McGregor. (TALG, 2010)
Estimating PageRank on Graph Streams.
Das Sarma, Gollapudi, Panigrahy. (JACM, 2011)
Sketching Information Divergences.
Guha, Indyk, McGregor. (JML, 2008)
Analyzing Graph Structure via Linear Measurements.
Ahn, Guha, McGregor. (SODA, 2012)
Parameterized Streaming Algorithms for Vertex Cover.
Chitnis, Cormode, Hajiaghayi, Monemizadeh. (SODA, 2015)
The Communication Complexity of Distributed ε-Approximations.
Huang, Yi. (FOCS, 2014)
Streaming and Fully Dynamic Centralized Algorithms for Constructing and Maintaining Sparse Spanners.
Elkin. (ACM TALG, 2011)
Counting Triangles in Real-World Graph Streams: Dealing with Repeated Edges and Time Windows.
Jha, Seshadri, Pinar. (ACM TKDD, 2015)