Learning to Classify and Forecast Telemarketing Phone Numbers

Project Proposal

Yiming Qi and Zhiyuan Zhang

April 21 2009
CS 34 Spring 2009

Motivation

Almost every person has experienced the annoyance of telemarketing phone calls, which occur at inopportune times and repeat with high frequency. While persons can sign themselves up on the national Do Not Call Registry, which is a list of phone owners that telemarketers cannot legally call, there are several disadvantages to doing so: it is bothersome to sign up, the jurisdiction of the list is limited in scope, and block all telemarketing calls uniformly even if the person wanted some calls to come through. Moreover, many telemarketers call with spoofed numbers to trick caller ID owners and avoid legal telemarketing restrictions. Therefore, it will be useful to know whether an incoming phone call is going to be a telemarketing call before picking it up. With 10 billion possible phone numbers in the US and no obvious underlying pattern of those numbers, it is difficult to differentiate telemarketing phone numbers from unidentified phone numbers which might come from friends, family, or clients. One of the authors has had extensive personal experiences with undesirable telemarketing calls, and therefore we are enthusiastic about finding a solution to tackle this real-life problem.

Goals

1. We will classify phone numbers as "undesirable" (telemarketers, debt collectors, political canvassers, prank callers, and any other complained-about numbers) and "desirable" (everything else).

2. Given an incoming phone number and a time, predict whether the call is going to be desirable or undesirable. This goal, while similar to the first, is different in that the time of the day will be used as an input for more accurate prediction; this means we will be treating the data as sequential and having a non-stationary sequential distribution.

Data

We obtain undesirable phone numbers and the time of the call from www.callercomplaints.com. We chose the site because there is no official online registry of telemarketing phone numbers. Caller Complaints is an online list of phone numbers submitted by people wishing to make their complaint public, and is the most complete data set we found online. These complaints contain three important data fields: the type of complaint, which users categorize into telemarketing, debt collection, political calls, prank calling, and 'unknown'; the time (day, month, year, hour, minute, seconds) the complaint is filed; and the name of the person who filed the complaint. The only drawback of the data set is that it is user-submitted, and therefore may contain phone numbers that are erroneously posted as undesirable. These false positives are not a problem because they will be filtered out by our machine learning techniques as noise.

Caller Complaints has over 20,000 unique undesirable phone numbers and over 344,000 reports filed against those numbers (multiple reports may be filed against a single phone number). We noticed that the reports are filed at arbitrary times, many of which are during working hours, so it is reasonable to assume that users file the report shortly after they receive a call, instead of waiting until it is more convenient to do so (e.g., filing a report after work). We will use the time the complaint is filed as a proxy for the time of the call. As for the name of the person who filed the complaint, the site does not enforce registration, so any name may be used; therefore, we cannot link multiple complaints with the same name because they may be filed by different persons. We will not use the name of the user who filed the complaint.

We obtain desirable phone numbers from www.yellowpages.com. This site is directory of all the phone numbers of businesses in the US, sorted by location and category. Because some businesses employ telemarketers, we will exclude any phone numbers found on Caller Complaints from the set of desirable phone numbers.

Methods

1. For the first goal, we will use boosting, neural networks, and support vector machines (SVM) as candidate algorithms for solving the problem. All three algorithms are suitable for solving the classification problem, and if time allows, we will implement each algorithm and train them on the data sets. Otherwise, we will choose only one of the algorithms to implement and train. Out of the approximately 20,000 undesirable telephone numbers, we will train on roughly 16,000 (80%) of them. We will also train on the same or larger amount of desirable phone numbers. The learned models will be evaluated by testing them on a set of phone numbers containing the remaining 20% of the data set, as well as several thousand desirable numbers. We will select the best model by selecting the model which classifies the test set with the highest accuracy.

2. For the second goal, we will either use a hidden Markov model or a linear dynamical system to describe the data. Because both models require states and state transitions, we choose our state to be the phone number of the phone call that one specific person is receiving at time t. After a certain amount n of time has passed, the state will transition to a new phone number at time t+n, which represents that same person receiving a phone call from the new number at time t+n. Because we cannot observe whether two complaints are filed by the same person, the knowledge of whether multiple undesirable phone calls occurred to one person is hidden. This motivates our use of the hidden Markov model, as opposed to a normal Markov model.

Literature Reviewed

Bishop, Christopher M., "Pattern Recognition and Machine Learning", Springer, 2006

Burges, Christopher J.C., A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, 121–167 (1998)

Lindsay D., Cox, Sian. Effective Probability Forecasting for Time Series Data Using Standard Machine Learning Techniques, Lecture notes in computer science, Springer Berlin, Volume 3686, 2005

Raivio K., R.?Vigário, and L.?Koivisto. Biennial report 2006-2007, Chapter 16 Time series prediction, Otaniemi, April 2008

Timeline

April 30: Finish parsing and classifying data for supervised training.
May 7: Construct, train, and evaluate boosting and neural networks on phone number data.
May 12: Construct, train, and evaluate SVM on data. For the milestone we will have decided on using one of the three candidate learning algorithms based on their performance.
May 26: Construct, train, and evaluate a hidden Markov model or linear dynamic system on complaint report data.
June 2: Present project.