Keystroke Dynamics for User Authentication

COSC-174 Project Milestone Review, Spring 2012

Patrick Sweeney

1. Problem Statement

Authenticating users on computer systems is an oft studied but still open issue. There's a generally accepted hierarchy of authentication mechanisms which range from the least secure to the most secure: [1, 2]

1) Authenticating via something you know.

2) Authenticating via something you have.

3) Authenticating via something you are.

This project attempts to determine the identity of a user based on how they type a password or passphrase. The characteristics of the user's typing (known as typing dynamics or keyboard dynamics) are examined and trained into a machine learning algorithm to implement a form of biometric--something you are--authentication. By using hardware which is already present on computer systems (keyboard only) and measuring a process which users already do (typing), I hope to have a solution which is inexpensive, non-intrusive, and effective.

A machine learning algorithm is implemented for classifying users based on their typing dynamics, and I compare the effectiveness of the system based on passwords which are:

- Complex passwords of shorter length (8-12 characters) including letters, numbers, and special characters

- Complex passwords of moderate length (12-20 characters)

- Longer, text only, passphrases which exhibit more "steady state" or "natural" typing dynamics

- Longer passphrases which also contain some special characters.

The progress toward meeting my project objectives is presented in section 2, with results in section 3. Section 4 outlines remaining work I hope to accomplish this term.

2. Progress

I have met my milestone goals. I wrote a data collection program and collected typing dynamic metrics from 9 volunteers. I developed a feature set which is used to differentiate users, and parsed the raw data into those features. Finally, I implemented four multiclass machine learning algorithms which are based on support vector machines. Progress is detailed in following sections.

2.1 Data Collection

I wrote a dialog-based program in C# to collect typing metrics (see Figure 1). This program presents the user with four different passwords which will subsequently be referred to as Groups 1-4*:

- Group 1: Z83u#af3*b

- Group 2: v_r=dEy=ke5e5ag8

- Group 3: This sentence contains no special characters or numbers.

- Group 4: Everyone knows the answer to: 2+2=4

*Note: Group 5, referenced later, combines all data collected for Groups 1-4.

The program records the time that each key is pressed and released with millisecond accuracy. The user types in each password/passphrase 20 times, and then exports a data file with the raw keystroke data.

Figure 1: Data collection program, written in C#.

A separate program is used to parse the raw data into features. For the midterm review, I decided to use a set of 14 features, some of which I had seen in other research [3, 4] and some of which were new/modified versions.

1) Mean key press duration for Keys other than Space and Shift

2) Mean key press duration for Space

3) Mean key press duration for Shift

4) Mean time between consecutive key-down events

5) Mean time between Shift-down and Key-down when Shift is used to modify a Key

6) Mean time between Key-up and Shift-up when Shift is used to modify a Key

7) Standard deviation of feature (1)

8) Standard deviation of feature (2)

9) Standard deviation of feature (3)

10) Standard deviation of feature (4)

11) Standard deviation of feature (5)

12) Standard deviation of feature (6)

13) Number of times Backspace is used

14) Total time to type the password/passphrase

Metrics such as 1&4 are fairly standard in research. I introduced 5&6 and, since I was providing the password text, made sure all the passwords exercised these key sequences at least once.

2.2 Machine Learning Algorithm

I implemented four multiclass algorithms to compare performance, all based on support vector machines (SVMs). I used Matlab's built-in SVM implementation for 1 vs. 1 comparisons, and extended it to perform the multiclass classification that I needed. Two algorithms were presented in class, one is from a separate research paper, and the fourth is a modified version of the latter.

1) One vs. Rest: In this implementation, the SVM is trained for one class at a time in a round-robin approach. All training examples are classified as either the class of interest, or not the class of interest, and the example is run through the trained SVM. Ideally only one class is identified as correct. However, it's possible that two or more similar classes are both identified as correct. It's also possible that NO classification is identified as correct.

Performance: scales linearly with N (where N is the number of classifications), however the large training sets increase the penalty of each comparison.

2) Max Wins: This algorithm pits each class against each other class in a 1 vs. 1 SVM. The "winner" of each comparison receives a vote, and the class with the most votes after all combinations have been tested is determined to be the winner [5]. Although this algorithm does guarantee at least one class will be selected as correct, it is also possible to have multiple "winners."

Performance: scales on the order of N² as more classifications are added, though the training sets are smaller than in One vs. Rest.

3) Directed Acyclic Graph (DAG): Approach was presented by Platt et al. in Large Margin DAGs for Multiclass Classification. DAG does multiclass classification via an efficient series of 1 vs. 1 comparisons based on directed acyclic graphs. A directed acyclic graph has directional edges and no cycles [6]. Each node represents a binary decision between the first class in the list of possible classes against the last class in the list. The losing class is eliminated, and the model proceeds to the next pairing.

Performance: scales linearly with N; for N possible classes, (N-1) 1 vs. 1 classifications will be performed, and the smaller training set size can be maintained.

Figure 2: A DAG showing the process to find the best of four different classes [6].

4) Voted Directed Acyclic Graph (VDAG): One shortcoming of the DAG approach is that the outcome can depend on the order of comparisons. To address this issue, I modified the DAG algorithm to perform a 3-fold vote. Three repetitions of DAG are executed with a randomly permuted class list, and the class with the most votes wins.

Performance: scales linearly with N, as in the original DAG algorithm, but with a 3x penalty for the repetitions.

2.3 Test Procedure

For each group, the 20 examples collected from each user are divided up into a training set and test set. Examples 1-5 are discarded as the user is just getting familiar with the phrase and probably hasn't built any rhythm. Examples 6-10 and 16-20 make up the training set, and examples 11-15 are used as the test set.

The training and test examples for every user are concatenated into one large training set and one large test set for each Group. In addition, a 5^thtest group combines all test and training data from Groups 1-4. A classification is performed by passing a set of trained svms to an algorithm along with an example X to classify.

As each example is classified, the result can be one of the following:

1) Correct: prediction given by the algorithm is the correct class for the example

2) Wrong: the wrong class was predicted, or if multiple possible classes were returned, the correct class was not among them

3) Multi-error: multiple possible classes were returned, and the correct class is among them (possible with One vs. Rest and Max Wins)*

4) No match: an empty set was returned--no class predicted (only possible with One vs. Rest)*

* Note: There's a technique to find the most likely class in the One vs. Rest model, as discussed in lecture; however, my implementation using the Matlab built-in SVM does accomplish this step. I chose not to spend time on that aspect since my focus is on the DAG and VDAG algorithms.

Along with the results listed above, the time is recorded to evaluate the test set and training set using each algorithm.

2.3 Test System

All tests are run on a Windows 7 desktop PC with AMD Phenom II X3 3.0GHz Triple-Core Processor and 4GB of RAM.

3. Results

The One vs. Rest model was quite susceptible to Multi-errors against both the test set and training set. The Group 1 password caused the most trouble, where almost 58% of the examples were multi-errors. This is most likely due to the short length and the odd cadence observed in complex alphanumeric and symbolic passwords. The model fared even worse in Group 5 which combined all the test and training data for Groups 1-4 which would indicate the model doesn't generalize well across different typing situations. In the cases of Multi-error, the model was able to narrow down the set of possible classes (usually to 2-3) but not distinguish one specific class as being more probable than the others (see note in Section 2.2). On the plus side, this model made very few Wrong predictions. Figure 3 shows the classification results for all 5 Groups against the training examples, and Figure 4 shows the same against the test examples. The One vs. Rest model also had the second shortest classification times (see Figure 5)--although the training time, shown in Figure 6, was much worse than the other models when the training/test datasets got larger as in Group 5.

The Max Wins model provided a much better Correct prediction rate, at the expense of a slightly larger Wrong prediction rate. For the training set, the Max Wins model predicted the correct answer with 100% accuracy in Groups 2-4, and ~90% accuracy for Groups 1 and 5. The Max Wins model had the slowest classification times in every group--significantly slower than the One vs. Rest and DAG models.

The DAG model provided Correct prediction rates comparable to the Max Wins model, but at a greatly reduced running time. As the number of classes (N) increases, the margin in performance will only get larger, making the DAG a very attractive option.

The VDAG model, while providing no improvement in the training set, improved the Correct prediction rate in the test set by 10-26%. In fact, it correctly classified 100% of the test examples in groups 2-4, and 98%, 91% in Groups 1 and 5 respectively. As expected, the runtime exceeds that of a single DAG repetition by ~3.5x, but is faster than the Max Wins model with greatly improved classification results. As the number of classes increases, this margin will likewise increase.

Overall it appears the VDAG model is an excellent alternative for this application even with the runtime penalty.




Figure 3: Classification results for the One vs. Rest, Max Wins, DAG, and VDAG against the examples in the Group 1-5 training sets.




Figure 4: Classification results for the One vs. Rest, Max Wins, DAG, and VDAG against the examples in the Group 1-5 test sets.

Figure 5: Runtimes represent the amount of time for the Group's entire test set to be classified with the model (not including training time). There are 45 test examples in Groups 1-4, and 180 examples in Group 5. Note that the training set has 2x as many examples as the test set, therefore the runtimes to classify the training examples are approximately 2x what is shown here for the test set.

Figure 6: Model training times for the One vs. Rest model and One vs. One based models which include Max Wins, DAG and VDAG. In the One vs. Rest model, N=9 SVMs are trained while in the One vs. One based models, .5*N(N-1)=36 models of smaller size are trained. 90 examples are used to train Groups 1-4, while 360 examples are used to train Group 5.

4. Issues, Future Goals & Timeline

I'm fairly satisfied with the VDAG algorithm, so I hope to make only minor changes and spend the remainder of the term enhancing my overall typing dynamics system.

I ran into one issue as a result of my decision to use Matlab's built-in SVM for 1v1 comparisons. The SVM related capability is part of a toolbox to which Dartmouth owns a limited number of licenses. Several times when I was working with my code, I was unable to use the toolbox because all licenses were in use. This is good motivation not to leave things to the last minute.

By the end of the term, I plan to:

1) Collect more data sets. At this point, I've gathered 9 datasets. I would like to get closer to 20 and see how they affect the models' accuracy. I will make the data collection program available to the class and hopefully some will take the time to do it (it requires ~10 minutes).

2) Provide a "live" classification. Create an interface which allows a known user to type a sentence and have the model attempt to identify the user immediately. I will start with a sentence that the model was trained on, but I would also like to allow typing any random sentence. Based on the Group 5 performance of the DAG and VDAG models, I believe the classification would still be good in this situation. The interface will be in C#, and it will call a back-end Matlab process to do the classification.

3) Experiment with different features as required. If the larger dataset causes problems with my classification algorithms, I may be able to derive some features which will provide better separation between classes (users). I may also be able to achieve improved classification with DAG; in which case I could abandon VDAG and recover the slight penalty it imposes.

Upcoming Milestones:

18 May: Up to 20 data sets collected

25 May: Live classification and final algorithm tweaks complete

29 May: Project Poster Session

30 May: Project Write-up Due

References

[1]	Shepherd, S.J., "Continuous authentication by analysis of keyboard typing characteristics," Security and Detection, 1995., European Convention on , vol., no., pp.111-114, 16-18 May 1995
[2]	Crawford, H., "Keystroke dynamics: Characteristics and opportunities," Privacy Security and Trust (PST), 2010 Eighth Annual International Conference on , vol., no., pp.205-212, 17-19 Aug. 2010
[3]	Ilonen, J., "Keystroke dynamics," Advanced Topics in Information Processing-Lecture, 2003
[4]	Bergadano, F., Gunetti, D., and Picardi, C., "User authentication through keystroke dynamics," ACM Trans. Inf. Syst. Secur. 5, 4 Nov. 2002
[5]	Friedman, F.H., "Another approach to polychotomous classification", Technical report, Stanford Department of Statistics, 1996
[6]	Platt, J., Cristanini, N., Shawe-Taylor, J., "Large margin DAGs for multiclass classification," Advances in Neural Information Processing Systems 12. MIT Press pp. 543-557, 2000

Appendix

Training Set Classification Results

Group 1	One vs. Rest	Max Wins	DAG	VDAG
Correct	41.11%	92.22%	92.22%	92.22%
Multi	57.78%	0.00%	0.00%	0.00%
Wrong	1.11%	7.78%	7.78%	7.78%
No Match	0.00%	0.00%	0.00%	0.00%

Group 2	One vs. Rest	Max Wins	DAG	VDAG
Correct	54.44%	100.00%	100.00%	100.00%
Multi	45.56%	0.00%	0.00%	0.00%
Wrong	0.00%	0.00%	0.00%	0.00%
No Match	0.00%	0.00%	0.00%	0.00%

Group 3	One vs. Rest	Max Wins	DAG	VDAG
Correct	80.00%	100.00%	100.00%	100.00%
Multi	20.00%	0.00%	0.00%	0.00%
Wrong	0.00%	0.00%	0.00%	0.00%
No Match	0.00%	0.00%	0.00%	0.00%

Group 4	One vs. Rest	Max Wins	DAG	VDAG
Correct	81.11%	100.00%	100.00%	100.00%
Multi	18.89%	0.00%	0.00%	0.00%
Wrong	0.00%	0.00%	0.00%	0.00%
No Match	0.00%	0.00%	0.00%	0.00%

Group 5	One vs. Rest	Max Wins	DAG	VDAG
Correct	28.61%	88.61%	89.17%	89.72%
Multi	68.33%	1.94%	0.00%	0.00%
Wrong	2.22%	9.44%	10.83%	10.28%
No Match	0.83%	0.00%	0.00%	0.00%

Test Set Classification Results

Group 1	One vs. Rest	Max Wins	DAG	VDAG
Correct	33.33%	73.33%	71.11%	97.78%
Multi	48.89%	4.44%	0.00%	0.00%
Wrong	8.89%	22.22%	28.89%	2.22%
No Match	8.89%	0.00%	0.00%	0.00%

Group 2	One vs. Rest	Max Wins	DAG	VDAG
Correct	40.00%	80.00%	82.22%	100.00%
Multi	44.44%	2.22%	0.00%	0.00%
Wrong	11.11%	17.78%	17.78%	0.00%
No Match	4.44%	0.00%	0.00%	0.00%

Group 3	One vs. Rest	Max Wins	DAG	VDAG
Correct	62.22%	86.67%	86.67%	100.00%
Multi	24.44%	0.00%	0.00%	0.00%
Wrong	4.44%	13.33%	13.33%	0.00%
No Match	8.89%	0.00%	0.00%	0.00%

Group 4	One vs. Rest	Max Wins	DAG	VDAG
Correct	53.33%	84.44%	84.44%	100.00%
Multi	28.89%	6.67%	0.00%	0.00%
Wrong	8.89%	8.89%	15.56%	0.00%
No Match	8.89%	0.00%	0.00%	0.00%

Group 5	One vs. Rest	Max Wins	DAG	VDAG
Correct	23.33%	79.44%	81.11%	91.11%
Multi	65.56%	3.33%	0.00%	0.00%
Wrong	7.22%	17.22%	18.89%	8.89%
No Match	3.89%	0.00%	0.00%	0.00%

Model Training Times

	Group 1	Group 2	Group 3	Group 4	Group 5
One vs. One	0.2187	0.1976	0.1895	0.1947	0.7923
One vs. Rest	0.2489	0.3046	0.1272	0.2048	1.831

Training Set Classification Runtimes

Runtimes	One vs. Rest	Max Wins	DAG	DAG
Group 1	0.18	0.73	0.17	0.56
Group 2	0.18	0.74	0.16	0.56
Group 3	0.18	0.73	0.16	0.56
Group 4	0.18	0.72	0.16	0.56
Group 5	0.81	3.04	0.67	2.29

Test Set Classification Runtimes

Runtimes	One vs. Rest	Max Wins	DAG	VDAG
Group 1	0.09	0.36	0.08	0.27
Group 2	0.09	0.36	0.08	0.28
Group 3	0.09	0.36	0.08	0.28
Group 4	0.09	0.36	0.08	0.27
Group 5	0.40	1.50	0.33	1.12