Where: Sudikoff 115
When: 2A hour: TTh 14:0015:50, Xhr W 16:1517:20
Who: Instructor: Amit Chakrabarti,
Sudikoff 107, Office hours: Mon 9:3010:30 or by appointment
What:
This course studies algorithms that process massive amounts of
data; so massive that they will not fit in a computer's storage.
As we shall see, this forces one to rethink even very basic
algorithmic problems, such as counting the number of distinct
elements, selection, and sorting. We shall study techniques to
summarize such large amounts of data into succinct "sketches"
that nevertheless retain important and useful information. We
shall introduce the necessary mathematical tools along the way.
Most techniques we shall study come from research in the last
decade or so, although a few date as far back as the 1970s. The
techniques we shall study have been applied successfully to web
data compression, approximate query processing in databases,
network measurement and signal processing. The course itself
will focus on the underlying techniques rather than the specific
applications.
Prerequisites: An undergraduatelevel course
in Algorithms (such as our CS 31)
or else: A strong mathematics
background and permission of the instructor.
Lecture Plan: The following is a rough outline,
and will very likely be changed once the course gets going.
I. Basic Algorithms: Estimating Statistics 
1.  Sep 17  Data stream model; Our first algorithm: the majority problem. 
2.  Sep 22  The distinctelements problem; Randomized algorithms; Approximation. 
3.  Sep 24  More modern results on distinctelements. 
4.  Sep 26  Count Sketch; CountMin Sketch; majority revisited; The heavyhitters problem. 
5.  Sep 29  Frequency moments; The AMS algorithm; Improving a Basic Estimator. 
6.  Oct 1  The amazing AMS second moment (F_{2}) algorithm; Dimension reduction; JohnsonLindenstrauss Lemma. 
7.  Oct 6  Stable distributions; Estimating L_{1} distance; Indyk's algorithm and L_{p} norms, p ∈ (0,2). 
8.  Oct 8  The rest of the L_{p} norms; Sketch using precision sampling. 
9.  Oct 13  Proof of the Precision Sampling Lemma. 
10.  Oct 15  The median and selection problems; Random order versus adversarial order. 
II. More Advanced Algorithms: Graphs, Geometry, Sequences 
11.  Oct 20  Geometric problems; Coresets; The minenclosingball problem. 
12.  Oct 22  Metric streams; Clustering; kcenter, kmedian, kmeans. 
13.  Oct 29  Finding a good kcenter clustering; Doubling algorithm; Guha's algorithm. 
14.  Nov 3  Triangle counting; Mininum Spanning Trees. 
15.  Nov 5  Maximum Matchings. 
III. Lower Bounds 
16.  Nov 10  Communication complexity; The index, disjointness and gaphamming problems. 
17.  Nov 12  Some reductions from communication to streaming problems. 
18.  Nov 17  First set of student presentations  
19.  Nov 20  (starting 11:30am) Second set of student presentations 
Textbooks: There is no set textbook for this course,
but we have some evolving inhouse course
notes, thanks to the scribing efforts of students from the previous
offering of the course. A good fraction of the material we shall study only
resides in research papers. There is a good survey by Muthukrishnan, somewhat dated
at this point, but still very useful for the basics.
Homework Sets: We will have four homework sets in total.
Here is a description of the grading
principles for these homework sets. Please also note all other applicable policies, including how the honor
principle applies to this course.
Class Presentation: In lieu of a final exam,
students are required to read up on an advanced topic in data stream
algorithms (usually represented by a research paper), and give a
short, 30minute presentation on its main ideas.
Below is a list of suggested papers for such a presentation.

Efficient sampling of nonstrict turnstile data streams.
Barkay, Porat, Shalem. (TCS, 2015)
Plus, for background:
A Small Approximately MinWise Independent Family of Hash
Functions. Indyk. (J. Alg., 2001)

Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation.
Indyk. (JACM, 2006)

Numerical Linear Algebra in the Streaming Model.
Clarkson, Woodruff. (STOC, 2009)

A NearOptimal Algorithm for Estimating the Entropy of a Stream.
Chakrabarti, Cormode, McGregor. (TALG, 2010)

Estimating PageRank on Graph Streams.
Das Sarma, Gollapudi, Panigrahy. (JACM, 2011)

Sketching Information Divergences.
Guha, Indyk, McGregor. (JML, 2008)

Analyzing Graph Structure via Linear Measurements.
Ahn, Guha, McGregor. (SODA, 2012)

Parameterized Streaming Algorithms for Vertex Cover.
Chitnis, Cormode, Hajiaghayi, Monemizadeh. (SODA, 2015)

The Communication Complexity of Distributed εApproximations.
Huang, Yi. (FOCS, 2014)

Streaming and Fully Dynamic Centralized Algorithms for Constructing and Maintaining Sparse Spanners.
Elkin. (ACM TALG, 2011)

Counting Triangles in RealWorld Graph Streams: Dealing with Repeated Edges and Time Windows.
Jha, Seshadri, Pinar. (ACM TKDD, 2015)
