Probabilistic Topic Modeling of Legal Data
Joseph Futoma, Emily Eisner, David Rice
Introduction
The rise of big data in recent years has necessitated the development of new techniques in information retrieval to organize, navigate, and search large amounts of data efficiently. One field where such methods may prove especially useful is in legal studies, where a large amount of legal documents, including decisions, orders, and opinions can be stored electronically in large repositories. This application is the primary focus of our work in this project, and we will apply existing probabilistic topic models, which are learning methods to infer the latent thematic structure of a corpus of documents, to real legal data. In court, many judges and attorneys use the decisions and work from other courts to help their own cases. The creation of a massive searchable archive of digitized legal documents, automatically organized by theme without the need for humans to manually label each individual document, would be extremely beneficial to legal professionals.
One fundamental flaw with many existing topic models, however, is their inability to scale to extremely large datasets. While simple topic models might be feasible to use on a corpus of legal documents for one county, it is unlikely that such methods would work well on a very large corpus of legal documents for a big region, perhaps for a growing state archive. As such, the development of approximate inference schemes that can scale to such massive datasets is an important area of research. In our project, we will apply an existing Bayesian nonparametric model with a tractable approximate online inference scheme that scales well to a massive dataset, and compare our results to that of a simpler model.
Methods
First, we will explore the simplest topic model, Latent Dirichlet Allocation (LDA), by creating our own implementation of the model using standard batch variational inference, and apply the results to our legal data. LDA is a generative probabilistic model that assumes the documents in a corpus share a fixed number of latent topics, which are distributions over the fixed vocabulary [1]. By assuming exchangeability and treating each document as a "bag of words", the model is able to exploit conditional independence leading to a joint distribution that is simple and factored, which leads to computationally efficient results. Since the posterior distribution in LDA is intractable to compute, however, approximate inference techniques must be applied to estimate the values of the latent variables and parameters of the model. In our implementation of LDA, we use batch mean field variational inference as in [1], which transforms the problem of posterior inference into an optimization problem. It accomplishes this by positing a simpler variational distribution that exploits conditional independence assumptions, and then through an optimization procedure finds a tight lower bound on the log likelihood (which equivalently minimizes the KL Divergence between the variational distribution and the true posterior). Typically this is accomplished through a coordinate ascent algorithm, as it is possible to derive closed form updates for the optimal variational parameters.
LDA with batch variational inference poses several shortcomings, however, and this has led to many natural extensions. The extension we implement from [2] improves on LDA in two important respects. First, it avoids the assumption of LDA that there is a fixed number of topics in the corpus by exploiting current Bayesian nonparametric methods. In LDA, the number of topics is fixed a priori, but often times this is not a reasonable assumption, especially in data where this is little prior knowledge about how many topics to expect. Following from [6,7] this is overcome by developing a nonparametric model that utilizes a hierarchical Dirichlet process (HDP) so that the number of topics is potentially infinite, and is allowed to grow as we observe more data. Note that the Dirichlet process is a stochastic process that is the infinite dimensional generalization of the Dirichlet distribution, and we utilize a hierarchical model using multiple Dirichlet processes; see [7] for details on the HDP.
Another important shortcoming relates to the use of batch variational inference. In batch variational inference, typically we solve the resulting optimization problem with a coordinate ascent algorithm, switching between analyzing every data point in the set and then re-estimating the parameters that summarize its latent structure. Clearly for massive data sets this can become intractable as every iteration of the algorithm requires re-examination of all the data. Using stochastic optimization as in [3,4,5] is one effective technique to avoid this problem, where we instead use noisy estimates of the gradient of the objective function, calculated from small subsets of the total data. The resulting inference algorithm resembles the batch inference algorithm but is applicable to massive, streaming data, hence the name "online variational inference". Finally, the technique in [2] further expands upon such online variational methods by intelligently splitting and merging portions of the variational posterior, which better avoids becoming trapped in local optima than traditional batch and online variational techniques. Following the terminology of [2] we refer to this inference algorithm as oHDP-SM (online hierarchical Dirichlet process - Split Merge).
Data
There are two novel data sets that we are considering for the implementation of these methods. The first is a collection of legal documents from a court that include judges' decisions and orders from certain cases. In addition, we will test our implementations on a corpus of Supreme Court decisions. Finally, because a massive dataset of legal material is hard to acquire, we will test our scalable algorithm on millions of Wikipedia articles out of convenience, to demonstrate the effectiveness of the algorithm on large data. Ideally, it would yield similar results on a large corpus of legal documents.
In addition, as a fallback plan, we already possess three interesting data sets that are already preprocessed. The first is a corpus of State of the Union Addresses, from Washington through Bush. The second is a corpus of global constitutions, and the third is a corpus of NIPS papers over 17 years.
Timeline
By the milestone we will have completed our implementations and begun to produce preliminary results. This will allow sufficient time to collect and organize our final results, polish the final paper, and create the poster. In the event that oHDP-SM proves too difficult to implement in the time constraint, as a fallback plan we will instead implement a different online inference algorithm, from [4] or [5].
References
- D. Blei, A. Ng, M. Jordan. Latent Dirichlet Allocation. JMLR, 2003.
- M. Bryant, E. Sudderth. Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes. NIPS, 2012.
- M. Hoffman, D. Blei, J. Paisley, C. Wang. Stochastic Variational Inference. JMLR (submitted), 2012.
- C. Wang, J. Paisley, D. Blei. Online Variational Inference for the Hierarchical Dirichlet Process. AISTATS, 2011.
- M. Hoffman, D. Blei, F. Bach. Online Learning for Latent Dirichlet Allocation. NIPS, 2010.
- Y. Teh, M. Jordan, M. Beal, D. Blei. Hierarchical Dirichlet Processes. JASA, 2006.
- D. Blei, T. Griffiths, M. Jordan, J. Tenenbaum. Hierarchical Topic Models and the Nested Chinese Restaurant Process. NIPS, 2003.