Mike D'Andrea - CS 34 - 11S - Dartmouth College

Enabling the search for Exoplanets

Motivation

The discovery of exoplanets, those planets orbiting stars outside our own solar system, is a fairly new enterprise in astronomy: the first ever confirmed exoplanet was found in 1992, and to date only 539 have been documented (check for updates: http://exoplanet.eu/). This is partly because the recently begun search has not been as empirically guided as possible. Most astronomers have simply tested newly discovered methods of detection on familiar star clusters to search for exoplanets, rather than exploring novel areas of the sky. This is because the cosmology for planet formation is not yet so developed as to inform astronomers “where to look” for exoplanets, specifically which stars are likely candidates for planet-hosting. An example of a well-established guiding parameter is stellar size (mass and radius): only stars within a certain range are expected to host stars. However, many other stellar properties, such as metallicity, luminosity, galactic location, and age, are not yet significant parameters. Thus I would like to program a learning algorithm to distinguish between established host stars and non-host stars based on these and similar properties.

METHODS

In particular, I would like to design a learning algorithm to conduct a multi-dimensional analysis of stellar properties. Ultimately I would like the program to discriminate between stars that do and do not host exoplanets. Given the nature of the problem, a supervised learning algorithm will certainly be most appropriate.
As astronomical data tends to be difficult in its size to sort through manually, data mining has been utilized extensively in the field to address similar issues in classification. A majority of such projects have utilized artificial neural networks (ANN) for this purpose, so I am strongly considering following this precedent. However, decision trees and k-nearest neighbor algorithms (kNN) have also been used frequently, and may be more appropriate for my project given its scope and timeline. Preliminarily, I will want to use data of roughly 300 of the recorded stars with exoplanets and a great deal more without exoplanets (in the range of 1000) in my training sample, and the remaining with-exoplanet and other without-exoplanet star data for the test sample.
I foresee a few potential issues in the methodology. For one, there may not yet be extensive enough data to accurately represent the underlying astrophysical phenomena at play, specifically how certain stellar properties correlate with planet-hosting. A related non-trivial task is to determine which stellar properties should be read by the program. I also must consider that astronomical data is not as empirically sound as other lab data, and while the measurements for stellar properties are likely reliable, the classification of a star as having host planets may be up to scrutiny. For now I will be content to follow NASA’s guidelines in identifying exoplanets, though the final product of my project may not reflect the actual cosmology to a high degree of accuracy as a result.
Finally, a secondary project goal I’ve considered is using another classification rather than host star vs. non-host star; two likely examples are distinguishing stars by the size and number of their exoplanets, two important questions with regards to the search for stellar systems and planets similar to ours.

Data

Fortunately, NASA and other astronomical organizations make the information relevant to this project readily available. In particular, I will want to use specific stellar properties of stars which have explicitly been reported to either host exoplanets or not.

Data Sources:
http://en.wikipedia.org/wiki/List_of_exoplanetary_host_stars
http://exoplanet.eu/catalog-RV.php
http://exoplanets.org/table/
http://nsted.ipac.caltech.edu/

Timeline

By the milestone deadline, I will have decided which algorithm to use and written the necessary code. I will have gathered all the necessary data, training and testing, and have it made readily available to my program. I will also have chosen the most appropriate stellar properties to use in my multi-dimensional analysis. Finally I would like to have started running through my training data at this point in time.

References

Ball, Nicholas, and Robert J. Brunner. “Data Mining and Machine Learning in Astronomy.” International Journal of Modern Physics. 11 June 2009.
Bazell, David, and Yuan Peng. “A Comparison of Neural Network Algorithms and Preprocessing Methods for Star-Galaxy Discrimination.” The Astrophysical Journal Supplement Series. 116: 47-55. May 1998.
Soto, A., A. Cansado and F. Zavala. “Detection of Rare Objects in Massive Astronomical Datasets Using Innovative Knowledge Discovery Technology.” Astronomical Data Analysis Software and Systems XIV. Ed.s Shopbell, P.L., M.C. Britton and R. Ebert. ASP Conference Series. 30: 1-5. 2005.

Computer Science 34:
Machine Learning and statistical data analysis

mike d’andrea
dartmouth college

←Artist’s rendition of PSR 1257b, the first confirmed exoplanet.
http://www.scientificamerican.com/slideshow.cfm?id=top-10-exoplanets