Enhancing Astrometric Methods for Exoplanet Discovery with Machine Learning

Introduction

Astrometry is a vital method for discovering non-transiting exoplanets; however, current techniques make it exceedingly hard to use due to the required submilliarcsecond precision. This research project focuses on enhancing astrometric techniques to discover non-transiting exoplanets using data from the European Space Agency's Gaia mission. Astrometry measures the subtle 'wobble' in a star's position caused by an orbiting exoplanet, which can reveal Earth-like planets in high-inclination orbits that other methods like Radial Velocity may miss. The dataset utilized includes false positives - detections of transit-like signals in the data that are not due to planetary transits - and confirmed exoplanets discovered and validated through astrometric and radial velocity methods. By applying machine learning algorithms to this dataset, the project aims to identify key patterns in Gaia data, train a model on these patterns, and use it to output the top ten most probable exoplanet candidates from a given dataset. The creation of this model will improve the accuracy and reliability of exoplanet detection through astrometry, potentially increasing our understanding of exoplanetary systems and habitability.

Position of Science Star Over Time

Illustration of an Exoplanet's Gravitational Influence on its Host Star's Orbit

Methodology

Data Acquisition and Preprocessing

I began by querying the gaiadr3.gaia_source table from Gaia Archive for false positives and confirmed exoplanets through astrometry and radial velocity methods. One million host star candidates were also queried but put in a separate dataset to be used later on the machine-learning model created. Confirmed exoplanets using Radial Velocity techniques are included because only three exoplanets have been confirmed using astrometry, necessitating additional samples to enhance the algorithm model's accuracy. Since Radial Velocity and Astrometry techniques are very similar, it was deemed best to utilize data on confirmed exoplanets using Radial Velocity. Through this query, I selected relevant features like parallax, proper motion in RA & DEC, G-band Mean Magnitude, etc. Since Gaia Archive doesn't label source IDs as confirmed exoplanets or false positives based on other databases, NASA Exoplanet Archive and Simbad were used to cross-query for the Gaia source_id of such objects. Once this data is obtained, both data frames are preprocessed using SimpleImputer to replace any NaN or missing values with the median of the feature column, ensuring that the algorithm can be run properly on the dataset with no errors. StandardScaler was also used to normalize the data, making the dataset more manageable for the machine learning model and removing any significant outliers from the data frame that could harm its performance.

Gaia Query Search

Queried necessary data from the Gaia Archive, including parallax, proper motion in RA and Dec, G-band mean magnitude, and other key astrometric parameters.

Data Analysis and Algorithm Development

1. Engineered Astrometry Features

The features present within the data frame are sufficient for fundamental analysis; however, to find discrete patterns, new features are created: motion metric, magnitude difference, parallax-to-motion ratio, normalized radial velocity, noise-to-signal ratio, total proper motion, and color index. Using these new features enables the identification of hidden patterns that are not previously identifiable with the initial data from the archives.

Engineered Astrometry Feature Functions

Includes the motion metric, magnitude difference, parallax-to-motion ratio, normalized radial velocity, noise-to-signal ratio, total proper motion, and color index. These engineered features enhance the model’s ability to distinguish between confirmed exoplanets and false positives, revealing patterns that are not immediately apparent in the raw data.

2. Random Forest Decision Making

With the preprocessed data, labels are assigned to the data frames. A new column labeled 'exoplanet' is created, and numerical values are assigned to each data group: False Positives and Confirmed Exoplanets. For all the False Positives, 0 is assigned. For all the confirmed exoplanets, 1 is assigned. This allows the modeling language to identify the characteristics of confirmed exoplanets that are significant in identifying strong candidates based on false positive features. In order to create a model that would find patterns, random forest classifiers were used. The Random Forest classifier evaluates the features from the Gaia query and the engineered features by constructing decision trees. Each tree makes a classification, and the final prediction is determined by the majority vote across all trees. In the decision tree below, each node represents a feature-based decision, showing:

Feature and Threshold: The feature used to split the data and the threshold value (e.g., "Proper Motion in Dec ≤ 0.202").
Gini Impurity: A measure of node purity, with lower values indicating more homogeneous classes.
Samples: The percentage of total samples that reach this node.
Value: The distribution of samples between the two classes (e.g., [0.4, 0.6] for false positives and exoplanets).
Predicted Class: The majority class at that node.

The classifier distinguishes exoplanets from false positives by identifying consistent patterns from these features within the dataset while filtering out noisy data that may lead to false positives. Each of the four hyperparameters for the classifiers was given two choices, resulting in 16 different models to be evaluated to see which performed the best. These were the following choices. n_estimators had either 5000 or 6000 decision trees. For each tree, max_depth was either 5 or 10 splits/levels, min_samples_split was either 10 or 15 minimum samples for each node, and min_samples_leaf was either 80 to 1000 minimum samples for each leaf node to be created.

First decision tree from the Random Forest Classifier

Illustrates feature-based splits such as proper motion in Dec and brightness. Important features like color index and noise signal ratio help differentiate exoplanet candidates from false positives.

The figure above is the first decision tree from the random forest classifier algorithm from the chosen model. From what is shown, the initial split is based on "Proper Motion in Dec," where values less than or equal to 0.202 indicate a higher likelihood of being a false positive. Subsequent splits refine this classification by examining features like "Brightness (Blue Photometric Band)" and motion_metric, which further differentiate between exoplanet candidates and false positives. For example, nodes that consider color_index and noise_signal_ratio help isolate exoplanet candidates from the noise in the data, with lower Gini impurity values indicating more certainty in the classification. The tree identifies key patterns, such as exoplanets often having specific parallax and motion characteristics, as well as particular photometric magnitudes that distinguish them from false positives. This decision-making process is repeated across many trees in the random forest, ensuring a robust model is created for identifying exoplanets.

3. Visualization of Feature Relationships & Correlations

Next, exploratory data analysis (EDA) is performed using the Seaborn (sns) Python package to visualize feature relationships and correlations using a heatmap and pairplot.

Correlation Heatmap of Astrometric Features between Confirmed Exoplanets and False Positives

Visualizes the correlations between various astrometric features, providing insight into which features are most influential in distinguishing between false positives and confirmed exoplanets. Strong correlations are indicated by darker red hues (larger correlation coefficients), while lighter colors represent weaker or negative correlations (smaller correlation coefficients).

In the heatmap above, I can observe the correlations between various astrometric features, providing insights into which features may be influential in distinguishing between false positives and confirmed exoplanets. Larger numbers and darker red hues indicate stronger correlations, while lighter blue hues represent weaker or negative correlations. This visualization allows us to identify key patterns the model could leverage to improve classification accuracy.

Notably, there is a strong correlation between astrometric_excess_noise and astrometric_excess_noise_sig, which suggests that these features are closely related and could play a significant role in identifying false positives. Similarly, phot_g_mean_mag, phot_bp_mean_mag, and phot_rp_mean_mag show high correlations with one another, indicating that photometric measurements are consistently linked and could be critical in distinguishing between false positives and confirmed exoplanets.

On the other hand, some features, such as parallax and pmdec, show weaker correlations with other features, implying that they may contribute less to the classification model. The relatively lower correlation values in these areas suggest that these features might add variability to the model, helping to refine predictions by providing unique information not captured by more strongly correlated features. By identifying which features are strongly correlated and which are less so, the model can more effectively distinguish between false positives and confirmed exoplanets, ultimately improving the reliability of exoplanet detection.

Pairplot of Candidates and Confirmed Exoplanets

Pairplot of False Positives and Confirmed Exoplanets

Visualizes the relationships and patterns between different astrometric features, allowing for the comparison of distributions and correlations between candidates, false positives and confirmed exoplanets.

In the pairplots shown above, I can visualize the intricate relationships between astrometric features, allowing for a comparison of distributions and correlations among candidates, confirmed exoplanets, and false positives. The pairplot on the left, which displays candidates and confirmed exoplanets, highlights the variability of candidates compared to confirmed exoplanets, showcasing distinct patterns that help differentiate the two groups. In contrast, the pairplot on the right, which compares false positives with confirmed exoplanets, reveals patterns that are nearly identical between the two, illustrating the subtle differences that make classification challenging. These distinctions are critical for the Random Forest classifiers, which effectively capture these patterns and use them to construct a robust model that can accurately classify and distinguish between candidates, false positives, and confirmed exoplanets in the testing phase.

Upon closer examination, some features exhibit stronger correlations than others. For instance, in the pairplot on the right, the relationship between proper motion components (e.g., pmra and pmdec) shows a clustering pattern similar to that of confirmed exoplanets, reflecting the difficulty in distinguishing these two groups. On the other hand, in the pairplot on the left, the feature phot_g_mean_mag shows clearer separation between candidates and confirmed exoplanets, with distinct groupings observed across multiple comparisons. Meanwhile, features like parallax and motion_metric display more spread-out distributions, particularly in the candidates' plot, indicating higher variability that may influence the classification process. These patterns reinforce the importance of feature selection and the classifier’s ability to leverage these relationships to enhance model accuracy.

4. Finalization & Evaluation of Model

The Synthetic Minority Over-sampling Technique (SMOTE) is applied to address class imbalances between false positives and confirmed exoplanets. Initially, the target distribution was slightly imbalanced, with 843 false positives and 730 confirmed exoplanets. After applying SMOTE, the distribution was balanced to 843 samples for each class, resulting in a total of 1,686 samples.

Using GridSearchCV, a Random Forest classifier's hyperparameters are tuned to be strict without sacrificing performance for the model. From the hyperparameters I listed in section 2, 16 model candidates were created and cross-validated with 5-folds using GridSearchCV, resulting in a total of 80 model fits. This means the data is split into 5 parts (folds). In each round, 4 parts are used for training the model, and the remaining 1 part is used for validation. This process is repeated 5 times for each of the 16 model candidates, with each fold being used as the validation set once.

Classification Report from the Random Forest Classifier

Demonstrates the high performance of the random forest classifier, with F1-scores of 0.96 for both false positives and confirmed exoplanets classes.

The best model is evaluated with a classification report and ROC AUC score. The classification report confirms the model's strong performance, with balanced precision and recall across both classes. Class 0 (false positives) shows a precision of 0.92 and recall of 1.00, while Class 1 (confirmed exoplanets) has a precision of 1.00 and a recall of 0.92, resulting in F1-scores of 0.96 for both. The overall accuracy of 0.96, along with consistent macro and weighted averages, further supports the model's reliability. The ROC AUC score returned 0.98, indicating that the top model performs exceptionally well. While such a high ROC AUC score could be a sign of overfitting, cross-validation was conducted to ensure the model's performance was consistent and reliable. These metrics, combined with the high ROC AUC score, indicate that the model is both accurate and robust, minimizing the risk of overfitting.

Finally, the model is run on the training set to assess its accuracy and reliability, outputting predictions for the most probable exoplanet candidates. The top ten candidates are then identified as confirmed exoplanets from class 1, further validating my model.

Results

From the top candidate's list, the strongest candidate — SOURCE ID 53009212219874304 — was selected based on its astrometric signatures, such as its parallax of 1.5195 ± 0.0198 mas and its RA and DEC proper motion of 5.9454 ± 0.0215 mas yr⁻¹ and -15.9609 ± 0.0138 mas yr⁻¹, respectively.

A Keplerian Model was generated by first querying the Gaia Archive for astrometric data of the source ID of our strongest candidate. The retrieved data included ra, dec, ref_epoch, pmra, pmdec, parallax.

A Keplerian orbital model function, incorporating parallax effects, was then defined and used to fit the RA and Dec residuals over a specified time range. This keplerian_orbit function models the astrometric motion of a star by simulating its orbit under gravitational forces, accounting for both orbital dynamics and parallax. The function first calculates the mean anomaly, which tracks the planet's position in its orbit over time. This is corrected for orbital eccentricity by solving for the eccentric anomaly iteratively. From this, the true anomaly is derived, providing the planet's actual position in its elliptical orbit. These positions are converted from the orbital plane to sky coordinates, accounting for the orbit's orientation relative to the observer on Earth. Finally, the parallax effect is added to simulate the apparent shift in the star's position due to Earth's movement around the Sun. This combined model generates the predicted RA and Dec offsets, enabling a detailed comparison with observed data.

The fitting process involved using curve_fit to optimize the model parameters based on the residuals calculated by comparing the predicted RA and Dec values (using proper motion) against the observed values. The resulting fitted RA and Dec values were plotted alongside the residuals, showing the model's performance in capturing the astrometric orbit of the star.

Keplerian Orbit Model of Chosen Candidate from Algorithm

Exhibits sinusoidal motion in both RA and Dec residuals, indicating the presence of a gravitational influence on the host star. The periodic behavior, coupled with the star’s parallax and proximity, further strengthens the case for the existence of an orbiting exoplanet.

The figure above illustrates the astrometric orbit fitting for my chosen candidate using an improved Keplerian model. It reveals a clear periodic pattern in both RA (blue points) and Dec (green points) residuals, oscillating smoothly around zero. This periodic sinusoidal behavior is a strong indicator of a gravitational influence on the host star, potentially from an orbiting exoplanet. The fitted Keplerian orbital model, represented by the blue and orange lines, aligns well with the observed residuals, particularly over the time span from 2014 to 2018. This alignment suggests that the model is effectively capturing the star's motion, including the effects of an unseen companion. The amplitude of the residuals, deviating by up to 2 milliarcseconds, further supports this hypothesis, as such deviations are consistent with the expected wobble of a star influenced by an exoplanet, especially given the star's parallax and relative proximity.

Importantly, these residuals are observed after correcting for parallax and proper motion, which isolates the remaining signal as likely being due to the gravitational pull of an orbiting body rather than artifacts from Earth's motion or inherent stellar drift. The sinusoidal shape, combined with the model's fit, strongly suggests that the star's movement deviates from simple proper motion in a way that aligns with the influence of an exoplanet.

Keplerian Orbit Model of Random Candidate in the Gaia Database

Shows linear residuals with minimal deviations. The lack of a sinusoidal pattern suggests that the star is not influenced by a significant gravitational companion. The wave motion from the Dec Residuals could indicate the presence of more than one gravitational influence, but nothing can be assumed based on this graph.

The figure above illustrates the astrometric orbit fitting for a random candidate using an improved Keplerian model. However, unlike the previously discussed candidates, the RA and Dec residuals do not exhibit a clear, periodic sinusoidal pattern. Instead, the residuals show a more linear or erratic behavior, particularly in the RA data (red points), which suggests that the star's motion is not significantly influenced by a gravitational companion. The fitted Keplerian curves (blue and orange lines) attempt to model the data, but the lack of periodicity in the residuals indicates that the model is not capturing a periodic orbital motion.

This absence of a regular sinusoidal pattern weakens the case for an orbiting exoplanet around this star. In astrometric terms, a companion's gravitational influence would typically cause the star to exhibit a periodic wobble, reflected in the residuals as a repeating sinusoidal curve. The lack of such a pattern in both the RA and Dec data suggests that this candidate is less likely to host an exoplanet, as the observed motion is more consistent with simple proper motion or noise rather than the gravitational pull of a companion.

While this comparison of Keplerian Models provides compelling evidence of a potential exoplanet around Source ID: 53009212219874304, further investigation, such as radial velocity measurements, would be needed to confirm the presence and mass of the orbiting body. Nonetheless, the data presented here strongly supports the hypothesis that this star may indeed host an exoplanet.

Conclusion

By leveraging the Gaia database and advanced machine learning techniques, this study shows proof of concept that an algorithm for identifying candidate stars astrometrically appearing to have possible exoplanets orbiting them can be developed. The algorithm was preliminary validated by successfully identifying confirmed exoplanet host stars from the NASA exoplanets found in the Gaia dataset used. The algorithm was further validated through the application of a Keplerian model, with the resulting candidate - obtained through applying the model to one million candidates in the Gaia Archive - exhibiting strong evidence of gravitational influences on its host star. The model primarily focuses on identifying stars that exhibit significant gravitational influences due to the presence of an exoplanet. However, further investigation is needed to confirm whether the object is indeed an exoplanet or other celestial body, such as a binary star or brown dwarf. Future research could focus on expanding the algorithm’s parameters and incorporating additional features, such as the type and metallicity of the star, to further enhance the model’s complexity and reliability. This approach would allow for more tailored models to meet specific research needs and improve the precision of exoplanet detection and characterization.

MPS Research Symposium

On August 29, 2024, I had the opportunity to present my research project, "Enhancing Astrometric Methods for Exoplanet Discovery with Machine Learning," at the 2024 Math & Physical Sciences Scholars Research Symposium. I showcased my hard work and described the process I took to preprocess data, create a model, and apply it to certain datasets to uncover strong exoplanet candidates to faculty, staff, and undergraduate and graduate students.

Special thanks are extended to Alan Chew for his mentorship, continued support, and valuable insights into astrometry, as well as to the Math & Physical Sciences Scholars team for funding this research.

Published Research Poster

Published Research Paper

MPS_2024_Research_Paper___Enhancing_Astrometric_Methods_for_Exoplanet_Discovery_Using_the_Gaia_Database_Final_Version_2.pdf

References

"Gaia Archive." Data, ESA, https://gea.esac.esa.int/archive/
"NASA Exoplanet Archive." California Institute of Technology, https://exoplanetarchive.ipac.caltech.edu/
"SIMBAD Astronomical Database." Centre de Données astronomiques de Strasbourg, Université de Strasbourg, https://simbad.cds.unistra.fr/simbad/
"Astrometry: The Oldest Method." RIT Physics Department, http://spiff.rit.edu/classes/resceu/lectures/astrom/astrom.html
Sahlmann, J., & Gómez, P. (2023). "Machine learning-based identification of Gaia astrometric exoplanet orbits." Publications of the Astronomical Society of the Pacific. https://iopscience.iop.org/article/10.1088/1538-3873/ad59c5
Holl, B., et al. (2023). "Gaia Data Release 3 - Astrometric orbit determination with MCMC and genetic algorithms." Astronomy & Astrophysics, 674, A10. https://doi.org/10.1051/0004-6361/202244161
"Astrometry." National Schools' Observatory, https://www.schoolsobservatory.org/learn/astro/exoplanets/detection_methods/astrometry