Problem Statement

Authenticating users on computer systems is an oft studied but still open issue. There's a generally accepted hierarchy of authentication mechanisms which range from the least secure to the most secure: [1, 2]

1) Authenticating via something you know.

2) Authenticating via something you have.

3) Authenticating via something you are.

The something you know is often simply a password, though on systems where security is a bit more important there may also be security questions, a PIN, a secret image, or some other means of attempting to confirm your identity. This technique is inexpensive, and doesn't have any issues with adoption because people have been conditioned to expect most systems with anything slightly personal on them will require a password.

Adding something you have raises the bar somewhat, because in order to access a system with this type of restriction you must possess a physical token, and that token presumably would not be passed along to a non-legitimate user. Smart cards and SecurID cards are two common implementations of the something you have authentication technique. It should also be noted that this is typically used in conjunction with something you know.

The obvious problem with the first two techniques is that they depend on the integrity of the users. They trust that legitimate users won't accidentally or purposefully provide their credentials to a non-legitimate person. In other words, they provide a means to authenticate legitimate users, but can't be guaranteed to deny access to non-legitimate users.

To cover this hole and provide superior access control in critical situations, some systems employ authentication via something you are. This is more commonly known as biometrics, or a measurement of your biological characteristics which uniquely identify you as you.

Unfortunately, biometric authentication systems typically incur both monetary and intangible costs. Retinal scanners, for example, are quite expensive and difficult to set up. As of this writing, there is one listed on ebay for $650. They're far from mainstream, however, and hard to locate. Fingerprint scanners are more reasonable; many USB connected models can be found on Amazon.com for less than $100. Current fingerprint scanners shipping from IBM in their Lenovo laptops claim a False Reject Rate (denying access to a legitimate user) of less than 0.5% (in 3 attempts) which is not much different from how often I accidentally lock myself out of an account by entering my password incorrectly. [3]

All of these systems require some additional hardware, and require the user to learn an additional process, however non-invasive it may be. Ideally, we would like to add biometric authentication capability to existing computers with no additional hardware, and which only require the user to do things they already do. This is where keystroke dynamics comes in.

Keystroke dynamics is a technique of identifying users by the way they type on a keyboard. It has the potential to provide strong biometric authentication--something you are--for very low cost.

Methods

Keystroke dynamics is not a new area of research, having first been presented in an IBM technical bulletin in 1975 entitled, Keyboard Apparatus for Personal Identification. In fact, its roots can be traced back further to identifying telegraph operators via their cadence. In the IBM study, a keyboard was created with the capability to measure "time pattern and key pressure" for a user on the system [4]. After a training period, the measurements were used to authenticate a user's identity.

Research efforts have since been classified into either static authentication or dynamic authentication [2]. In static, the user authenticates one time, such as when logging into a system for the first time. In dynamic, a background process continually monitors the user's typing to detect a change in users. Dynamic systems are preferable as they prevent a legitimate user from logging on and then allowing a non-legitimate user from "taking over." In this research, I will focus on static techniques to reduce complexity. I believe this is a justified concession, and simply places some onus on the legitimate user to re-secure a system when they are done using it.

In order to differentiate users, we first need to look at what possible data is available to measure. Essentially the only things we can measure with a standard keyboard are when a specific key is pressed and released. Researchers have combined these timestamps in many ways to for different classifiers: [5, 6, 7]

- Down-Up time: the time between pressing a key and releasing it

- Up-Down time: the time between releasing one key and pressing the next

- Down-Down time: the time between subsequent key presses

- Total typing speed

- Frequency of errors

- Overlapping keys

- Key press/release order for special characters and capital letters with shift key

- Digraphs and trigraphs (strings) of keystrokes as groups

Various studies have used statistical, support vector machine, and neural network techniques to process the data and determine the legitimacy of a user attempting to access a system. In comparing these schemes, two metrics are frequently used: False Reject Rate and False Accept Rate [2].

False Reject Rate (FRR): denying access to a legitimate user

False Accept Rate (FAR): allowing access to a non-legitimate user

Crawford provides a fairly comprehensive summary of the FAR and FRR rates for keyboard dynamic research [2]. It appears that statistical methods provide the lowest FAR rates, while neural network based approaches provide the lowest FRR rates (although Crawford questions the validity of the best result of 0% FRR and 0% FAR presented by Obaidat & Sadoun in [8]).

I intend to use support vector machines and/or cluster analysis in my implementation. I will implement a machine learning algorithm and experiment with different types of classifiers. Once I have a working algorithm, I will compare how password types affect the FAR and FRR results. I will try passwords that are:

- Complex passwords of shorter length (8-12 characters) including letters, numbers, and special characters

- Complex passwords of moderate length (12-20 characters)

- Longer, text only, passphrases which should exhibit more "steady state" or "natural" typing dynamics

- Longer passphrases which also contain some special characters.

I would like to see how the password type effects the capability of the machine learning algorithm to distinguish between users.

Data

I plan to collect my own data set from volunteers. This carries some risk, as I may not be able to muster a large enough sample set. I believe I should be able to get 10-20 users to participate.

My objectives hinge on being able to collect my own data and I'm confident I will be able to do that. However, there are some alternate data sources available if necessary.

1) Keystroke Dynamics - Benchmark Data Set: this data set, collected for 51 typists, was used by researchers Killourhy & Maxion in [9] and is freely available via their website.

2) BeiHang Keystroke Database: Described by Li et al. in [10], this database is freely available to researchers. I have not requested a copy of the database yet but suspect it may be in Chinese, and this would make it difficult (though not impossible) to use.

Goals & Timeline

Goals for First Milestone (8 May)

1) Implement data collection program and collect training and test data from a subset of users

2) Complete an initial implementation of the machine learning algorithm, though it may not be optimized at this time

After the milestone, I will complete any required data collection and focus on improving the performance of the FAR and FRR metrics by experimenting with different classifiers.

References

[1]	Shepherd, S.J., "Continuous authentication by analysis of keyboard typing characteristics," Security and Detection, 1995., European Convention on , vol., no., pp.111-114, 16-18 May 1995
[2]	Crawford, H., "Keystroke dynamics: Characteristics and opportunities," Privacy Security and Trust (PST), 2010 Eighth Annual International Conference on , vol., no., pp.205-212, 17-19 Aug. 2010
[3]	Fingerprint Reader. IBM/Lenovo, 6 Apr. 2012, <http://www.pc.ibm.com/us/security/fpr_faq.html>
[4]	Spillane, R., "Keyboard Apparatus for Personal Identification," IBM Technical Disclosure Bulletin, Tech. Rep. 17, 1975.
[5]	Ilonen, J., "Keystroke dynamics," Advanced Topics in Information Processing-Lecture, 2003
[6]	Bergadano, F., Gunetti, D., and Picardi, C., "User authentication through keystroke dynamics," ACM Trans. Inf. Syst. Secur. 5, 4 Nov. 2002
[7]	Rybnik, M.; Tabedzki, M.; Saeed, K.; , "A Keystroke Dynamics Based System for User Identification," Computer Information Systems and Industrial Management Applications, 2008. CISIM '08. 7th , vol., no., pp.225-230, 26-28 June 2008
[8]	Obaidat, M., and Sadoun, B., "Verification of Computer Users Using Keystroke Dynamics," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 27, no. 2, pp. 261-269, 1997
[9]	Killourhy, K.S.; Maxion, R.A.; , "Comparing anomaly-detection algorithms for keystroke dynamics," Dependable Systems & Networks, 2009. DSN '09. IEEE/IFIP International Conference on , vol., no., pp.125-134, June 29 2009-July 2 2009
[10]	Yilin Li; Baochang Zhang; Yao Cao; Sanqiang Zhao; Yongsheng Gao; Jianzhuang Liu; , "Study on the BeiHang Keystroke Dynamics Database," 2011 International Joint Conference on Biometrics (IJCB) , vol., no., pp.1-5, 11-13 Oct. 2011