Based on the imaging classification presented in Nevin+2019 and Nevin+2023.
There are three main steps to this process:
- Creating the classification from simulations of merging galaxies (I've done this for you here, although I provide some of the tools to do this for yourself),
- Measuring predictor values from observed images of galaxies,
- Classifying galaxies and obtaining probability values using the output from steps 1 and 2, and finally,
- Run the merger fraction analysis from Nevin+2023
The simplest way to use this repo is to use the pre-prepared classifications and SDSS predictor values (see the tables/ folder, which will point you to these data) to get classifications for individual or tables of SDSS galaxies. I've also set this repo up to use these tables to re-run some of the analysis steps for the paper (#4).
However, the code for steps 1, 2, and 3 is also included here in case you'd like to re-run any steps of the classification for a separate set of simulated images (#1), measure predictor values from a separate sample of galaxies (#2), and/or rerun the classification on a different set or for different threshold values (#3).
Let's dig in:
The simulations are fully described in Nevin+2019.
There are several variations of the same function within the MergerMonger.py code (load_LDA*) that load up the predictor tables from the simulated galaxies and run the LDA. They return three things:
- The LDA results, which include information to standardize the predictors, the LDA coefficients, and the LD1 values of the simulated galaxies used as input,
- The terms selected by the random forest, and
- The dataframe used to create the LDA
An example of how to run the classification is provided in classify_SDSS.py. Here's the relevant code:
from MergerMonger import load_LDA_from_simulation
run = 'major_merger'
# load_LDA_from_simulation loads in a Table of predictor values from the simulation (LDA_merged_'+str(run)+'.txt')
LDA, RFR, df = load_LDA_from_simulation(run, verbose=False, plot=False)
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Output from LDA~~~~~~~~~~~~~~~~~~~~~~~~~~~')
print('selected features', LDA[2])
print('coefficients', LDA[3])
print('intercept', LDA[4])
print('accuracy, precision, and recall for simulated galaxies [5-7]', LDA[5], LDA[6], LDA[7])
print('Standardized means LDA[0]', LDA[0])
print('standardized stds LDA[1]', LDA[1])
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
print('~~~~~~~~~~~~~~~~~Output from RFR~~~~~~~~~~~~~~~~~~~~~~~~~~~')
print(RFR)
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# All of the dimensions of `LDA':
# 0 = standardized means on all of the coefficients
# 1 = standardized stds
# 2 = inputs
# 3 = coefficients
# 4 = intercept
# 5 = Accuracy
# 6 = Precision
# 7 = Recall
# 8 = LDA values of all simulated galaxies
# 9 = myr
# 10 = myr_non
# 11 = covariance matrix
# 12 = means of all classes
Other versions of the load_LDA_from_simulation function are:
-load_LDA_from_simulation_sliding_time, which allows you to adjust the time of post-coalescence cut-off
-load_LDA_from_simulation_changing_priors, which allows you to input the starting prior, we used this to explore how changing the input prior alters the measured merger fraction
-load_LDA_from_simulation_changing_priors_changing_validation_set, which allows you to play around with the relative fraction of mergers in the validation set
Within util_smelter.py, there are a number of utilities for creating massive tables of predictor values. I also include some code (analysis/lookup_galaxy_classify_and_diagnostic_plotting.py) for visualizing individual galaxies and their predictor values like this:
The basic process (again, see this example code) involves using util_SDSS.py to download SDSS frame images and then using util_smelter.py to measure imaging predictor values from these images. This process utilizes a combination of Source Extractor, Galfit, and statmorph to measure Gini, M20, Concentration, Asymmetry, Clumpiness, Sersic n, Shape Asymmetry, and average S/N value for SDSS r-band images that are downloaded using wget. Within util_smelter.py there are some tools for automatically downloading and then options for deleting SDSS frame images as you go, which is nice because these files are huge.
Here's an example of how to use these utilities to download galaxies from a list of IDs, make cutout images, measure the imaging predictor values for these images, and save them to a table that is the input for step #3:
import sys
sys.path.insert(0,"..")
from utils.util_SDSS import SDSS_objid_to_values, download_galaxy
from utils.util_smelter import get_predictors
import pandas as pd
# Path to where you're storing all of the tables:
prefix = '/Users/rnevin/Documents/MergerMonger-Public/tables/'
# Load in the predictor table from SDSS with all of the flags
df_predictors = pd.io.parsers.read_csv(prefix+'SDSS_predictors_all_flags_plus_segmap.txt', sep='\t')
#df_predictors.columns = ['ID','Sep','Flux Ratio', 'Gini','M20','Concentration (C)','Asymmetry (A)','Clumpiness (S)','Sersic N','Shape Asymmetry (A_S)', 'Sersic AR', 'S/N', 'Sersic N Statmorph', 'A_S Statmorph']
if len(df_predictors.columns) ==15: #then you have to delete the first column which is an empty index
df_predictors = df_predictors.iloc[: , 1:]
# Change the type of the ID in predictors:
df_predictors = df_predictors.astype({'ID': 'int64'})#.dtypes
# Choose only clean galaxies (not flagged):
df_no_flags = df_predictors[(df_predictors['low S/N']==0) & (df_predictors['outlier predictor']==0) & (df_predictors['segmap']==0)]
print('length of no flags', len(df_no_flags))
# Select the first 200 IDs
ID_list = df_no_flags['ID'].values[0:200]
# Alternately, you could just input a list of your own IDs:
# i.e., these are GalaxyZoo spirals that are classified as non-mergers by my major merger classifier:
# ID_list = [1237655693557170541, 1237648673456455904, 1237668649315008771, 1237668649315074455, 1237668649315205346, 1237668568247370040, 1237668649852076495, 1237668649852207363, 1237668568784437849, 1237668568784437611, 1237668568784503051, 1237668568784503170, 1237668650389340761, 1237668650926342749, 1237668650926342753, 1237668650926408199, 1237668569858638234, 1237668650926539340, 1237668569858703829, 1237668569858703926]
# Go through and look up the RAs and decs, alternately enter in your own:
ra_dec_lookup = pd.read_csv(prefix + 'five_sigma_detection_saturated_mode1_beckynevin.csv')
RA_list = []
dec_list = []
for j in range(len(ID_list)):
counter = 0
for i in range(len(ra_dec_lookup)):
if ID_list[j]==ra_dec_lookup['objID'].values[i]:
RA_list.append(ra_dec_lookup['ra'].values[i])
dec_list.append(ra_dec_lookup['dec'].values[i])
counter+=1
if counter==0:
RA_list.append(0)
dec_list.append(0)
for i in range(len(ID_list)):
# First, uses wget to download the frame corresponding to this individual galaxy.
# SO YOU MUST HAVE WGET INSTALLED
# Makes a cutout in arcsec corresponding to 'size'
# Also automatically removes the frame once you're done
# Returns a 2D array that is the image cutout of this galaxy
img = download_galaxy(id, ra, dec, prefix + '../downloaded_images/', size, remove=True)
# Now measure the predictors
# The function get_predictors writes out a single line in a data table with all of the predictor values for a single galaxy image:
preds = get_predictors(id, img, prefix + '../', size)
# The output is:
# return camera_data, segmap, gini, m20, con, asy, clu, ser, n, inc, s_n
# 0 is the image, 1 is the segmentation map used to measure the imaging predictors, gini, m20, C, A, S, sersic n, inclination from galfit, and S/N from statmorph
Examples of running this for a list of SDSS IDs can be seen in analysis/lookup_galaxy_classify_and_diagnostic_plotting.py, where you have the option to measure predictor values. find_galaxy.py also has an option to download galaxies by ID but use their predictor values from the existing table.
Also in util_smelter.py, I include utilities that I used on the supercomputer to run the full SDSS dataset in parallel.
The final step is to apply the LDA classification derived in step #1 to the predictor value tables generated in step #2 to obtain LDA and the accompanying probability values for the different classifications. Don't worry, I've done this for you, and these tables are available on Zenodo (see the tables/ folder). This section gives an overview of how to re-derive LDA and probability values.
Here's a schematic of how this works:
Code example of how to generate the output probability tables (.txt) are provided in classify_SDSS.py. Here's a snippet:
from MergerMonger import classify
# After running step #1:
LDA, p_merg, CDF = classify('../Tables/','../frames/',type_gal, run, LDA, RFR, df, 10000, verbose=True, all = False, cut_flagged = True)
# The classify utility in MergerMonger.py loads up the predictor value table (SDSS_predictors_all.txt),
# standardizes all values,
# plugs them into the LD1 formula for a given classification,
# and saves an LDA table (LDA_out_all_SDSS_predictors....txt) with the LD1 and probability values.
I also include some code for interpreting these probability values (analysis/lookup_galaxy_classify_and_diagnostic_plotting.py). This plots diagnostic images (see above) of individual galaxies with their predictor values, as well as their LD1 values, probability, and CDF values for a given LDA classification. This diagnostic plot also shows the three leading (most influential) LD1 terms alongside their contribution to LD1 (standardized predictor value * coefficient). This is described in the paper and is useful for interpreting why a given galaxy is classified a certain way.
You must have this stuff installed and updated for everything to run smoothly:
- Galfit (directions for installing Galfit and Source Extractor are provided in the imaging/ folder)
- Source Extractor
- Statmorph
- wget
- python packages
Use conda and the requirements.txt file to complete #5 via setting up a conda environment -
conda create --name <env> --file requirements.txt
And then in order to run any of the code in this repo, activate your conda environment first:
conda activate <env>