Habitable Exoplanets: A Machine Learning Project

The mission of this project is to use machine learning to determine how similar exoplanets are to Earth based on the exoplanets' characteristics. Evaluating the similarity of exoplanets to Earth is intriguing due to its potential to identify habitable worlds beyond our solar system. This approach enables the efficient allocation of resources for space missions, enhances our understanding of planetary evolution, and inspires future space exploration.

Introduction

The Planetary Habitability Laboratory (PHL) at the University of Puerto Rico has aggregated data on exoplanets discovered by telescopes in the past few decades into a catalog of potentially habitable worlds. Amongst the data, they include a metric called the Earth Similarity Index (ESI), which is a measure of Earth-likeness from 0 (no similarity) to 1 (identical to Earth). Although, the ESI score is not a direct measure of habitability, it can be used as a secondary classification scheme due to a lack of information about properties correlated to the potential for biological life such as the chemical composition of the surface and atmosphere, presence of liquid solvents, available energy, etc.

Feature Selection

My goal was to be as inclusive and comprehensive as possible, meaning that I would train my models using all the data available to me. In addition to determining which exoplanets were more likely to be habitable, I hoped to extract the key properties which habitability had greater dependence on. However, I did perform some pre-processing on the data to ensure I was training with a good, mostly complete dataset. From the 200+ features provided in PHL’s Exoplanet Catalog, I eliminated those that

Were non-numerical (not as suitable for a regression task)
Had unknown quantities for a majority of exoplanets
Were not plausibly related to habitability (i.e. the year the planet was discovered), measurement errors)

df = pd.read_csv('phl_exoplanet_catalog.csv')
# Features descriptions: https://phl.upr.edu/projects/habitable-exoplanets-catalog/hec-data-of-potentially-habitable-worlds/phls-exoplanets-catalog 

#  Delete error columns 
error_columns = [col for col in df.columns if 'ERROR' in col]
df = df.drop(columns=error_columns)
print("Number of error columns:", len(error_columns))

# Delete columns where majority is unknown 
num_col = len(df.columns)
threshold = len(df) * 0.8 # hyper-parameter 
df = df.dropna(axis=1, thresh=threshold)
print("Number of empty columns:", num_col-len(df.columns))

# Delete non-numeric columns
num_col = len(df.columns)
pname_column = df.pop("P_NAME")
df = df.select_dtypes(include='number')
df.insert(0, 'P_NAME', pname_column)
print("Number of non-numeric columns:", num_col-len(df.columns))

# Remove non-related features
df = df.drop(columns=['P_YEAR', 'P_HABITABLE'])

Some of the features I selected were the exoplanet’s mass, radius, period, and equilibrium temperature. In total, I trained using 49 features.

Learning Models

I planned to train two models, a multivariate linear regression and a feedforward neural network, and compare their performance. I evaluated our success by calculating the mean-squared error, comparing the predicted ESI score (the output of our machine learning algorithms) and the true ESI score, as provided by the Habitable Exoplanets Catalog.

Multivariate Linear Regression

A multivariate linear regression model is a model that attempts to predict the relationship between independent variables or our features with the dependent variable (planets with possible life) by attempting to find a nth-degree polynomial expression to capture the relationship. I found that a polynomial of degree 2, whose architecture is graphically represented above, was best suited for this task. Our model was implemented using PyTorch. I used one linear layer with a bias term. The hyperparameters I focused on tuning were the learning rate, the maximum number of epochs the model is allowed to train for, and the stopping threshold (the change in the loss of the validation dataset).

class PolynomialRegression(Module):
  def __init__(self, input_size, degree):
      super(PolynomialRegression, self).__init__()
      self.degree = degree
      self.fc = Linear(degree*input_size, 1)

  def forward(self, x):
      x_poly = cat([x.squeeze() ** i for i in range(1, self.degree + 1)], dim=1)
      return self.fc(x_poly)

Feedforward Neural Network

A feedforward neural network model is a model that is an interconnected network of nodes, which applies various functions and weights on input features to produce some output. This output is used to determine the relationship between the input features and the given weights to adjust said weights to capture and model complex patterns. This model was also implemented using PyTorch. I used two hidden layers, one input layer, and one output layer, for a total of four layers. The input layer had 49 nodes (each corresponding to a feature), the first hidden layer had 32 nodes, the second hidden layer had 16 nodes, and the output layer had 1 node (since I am performing a regression). The hyperparameters I focused on tuning were the learning rate, the maximum number of epochs the model is allowed to train for, and the stopping threshold (the change in the loss of the validation dataset).

class FeedForwardNN(Module):
  def __init__(self):
    super(FeedForwardNN, self).__init__()
    self.linear1 = Linear(49, 32)
    self.relu1 = Tanh()
    self.linear2 = Linear(32, 16)
    self.relu2 = Tanh()
    self.linear_out = Linear(16, 1)

  def forward(self, x):
    x = self.linear1(x)
    x = self.relu1(x)
    x = self.linear2(x)
    x = self.relu2(x)
    x = self.linear_out(x)
    return x.squeeze().unsqueeze(1)

Training

A simplified snippet of code running through one epoch of training is shown below.

def train(dataloader, model, loss_func, optimizer, device):
  model.train()
  for batch, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)
    pred = model(X)
    loss = loss_func(pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Similarly, a snippet is shown for training the feedforward neural network, stopping after a set number of epochs. The early stopping criterion evaluating the validation loss has been omitted for simplicity’s sake. Feel free to examine the full set of code, which shows how the loss per epoch is recorded and used to inform when the model begins to overfit.

def FFN(data, lr=1e-4, max_epoch=10, loss_thres=1e-2):
    device = "cuda" if cuda.is_available() else "cpu"
    ff = FeedForwardNN().to(device)
    loss_func = MSELoss() 
    optimizer = Adam(ff.parameters(), lr=lr)

    epochs = 0
    while epochs < max_epoch: 
        train(data.train_ld, ff, loss_func, optimizer, device)
        epochs += 1

Explore Full Code

Results

Training and Validation Loss: Regression Model

On the left, the learning curve shows the model fitting to the training data relatively quickly, dropping below 0.01 at around 15 epochs and then plateauing. In the validation learning curve (on the right), a steep negative correlation that levels out at around 100 to 125 epochs with a small hump around 25 to 30 epochs, likely caused by noise in the data set, can be observed.

Predicted vs Actual ESI Score: Regression Model

The graph shows a strong positive correlation between predicted ESI and true ESI scores, suggesting that the model generalizes the data quite well, with a mean squared error of 0.008 and mean absolute error of 0.06. However, there is a slight downward skew, causing predicted ESI scores to be slightly lower than actually score.

On the left, the training learning curve shows a less volatile decline in loss compared to the polynomial regression model and levels out at a slightly higher epoch of 40. On the right, the validation learning curve is smooth, leveling out around 40 epochs, showing that the dataset is better fitted to than that in the regression model.

The graph above is similar to that of the polynomial regression model, with a strong positive correlation, suggesting a good generalization of the dataset. This model had a slightly better mean squared error of 0.005 and a slightly better average absolute error of 0.05. There is a similar trend of a slightly downward skew on the data, suggesting that the model predicts ESI scores slightly lower than actual.

Analysis

90% of the top 5 habitable planets each model predicted were the same. This is evidence that further establishes confidence that the two models are accurately predicting ESI scores.

By examining the weights of the features in the polynomial regression model, I was able to extract the features that are the most correlated to habitability. These features include the planets’ mass, density, surface temperature, mean stellar flux, star snow line, and whether the planet is in the optimistic habitable zone.

Limitations

This project has its inherent limitations. The models are not fit for use on exoplanets with a multitude of unknown characteristics. Additionally, the project doesn't account for measurement uncertainty, which introduce a potential source of error in our assessments. The indeterminate feature significance for the Feedforward Neural Network (FNN) further complicates my ability to discern critical factors influencing habitability. As of now, I am basing feature significance only on weights of the polynomial regression model. Furthermore, the continuous updates and additions to the exoplanet database pose a challenge as the evolving dataset may influence the stability and reliability of the models as new information is gained. This issue stems from the extraction of data from a downloaded CSV file (done out of ease); the solution would be to query data from NASA’s API, which are continuously updated.

Key Assumptions

The critical assumption I made is that the similarity to Earth is strongly correlated with habitability, providing the foundational basis for the task’s purpose. As discussed previously, the ESI score is a crude measure that attempts to make predications about habitability with limited available data. To handle missing data, I've replaced unknown values with the mean of the respective features. A better way to handle unknown values is to perform a regression imputation using other known features of that sample. Additionally, I assumed the polynomial relationship between features and the ESI score was of degree 2. A higher-degree polynomial may have been needed to capture more complex relationships in the data. For the Feedforward Neural Network model, I assume that inputs are independent; however, I know this to be false since there are features, such as mass, radius, and density, I train with that affect one another. Lastly, we've assumed that the FNN model architecture, with two hidden layers (totaling to 50 nodes), is adequate for capturing the intricacies of the relationships within the dataset. More extensive tests are recommends to see if the performance can be improved by added more hidden layers or nodes.

Habitable Exoplanets