BIB-VERSION:: CS-TR-v2.0 ID:: ncstrl.dartmouthcs//TR99-357 ENTRY:: June 10, 1999 ORGANIZATION:: Dartmouth College, Computer Science TITLE:: Investigating Measures for Pairwise Document Similarity TYPE:: Technical Report (paper) REVISION:: 1 AUTHOR:: Isaacs, Jeffrey D. AUTHOR:: Aslam, Javed A. DATE:: June 1999 RETRIEVAL:: For a paper copy, email RETRIEVAL:: For a paper copy, write to Technical Report Librarian Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755-3510 USA RETRIEVAL:: Compressed Postscript at http://www.cs.dartmouth.edu/reports/TR99-357.ps.Z RETRIEVAL:: PDF at http://www.cs.dartmouth.edu/reports/TR99-357.pdf ABSTRACT:: The need for a more effective similarity measure is growing as a result of the astonishing amount of information being placed online. Most existing similarity measures are defined by empirically derived formulas and cannot easily be extended to new applications. We present a pairwise document similarity measure based on Information Theory, and present corpus dependent and independent applications of this measure. When ranked with existing similarity measures over TREC FBIS data, our corpus dependent information theoretic similarity measure ranked first. NOTE:: Undergraduate Honors Thesis. Advisor: Jay Aslam. END:: ncstrl.dartmouthcs//TR99-357