Read .fits data


[ ]:
DATA_DIR = './database/dataset_exoplanets_confirmed'

Our database, storage on DATA_DIR, can be found on CoRoT IAS Archive website or in my Google Drive folder, which for now is structured as …

./database
   │
   └── exoplanets_confirmed
         │
         └── dataset_exoplanets_confirmed
               │
               ├── ..._.fits
               ├── ..._.fits
               ├── ..._.fits
               ⋮       ⋮
               └── ..._n.fits

In database/dataset_exoplanets_confirmed has all .fits files of CoRoT targets with confirmed exoplanets, obtained in CoRoT IAS Archive.

The main purpose of reading .fits data-set is to turn them into a pandas DataFrame. For this, it was decided to use Astropy==4.2 library.

[ ]:
from astropy.io import fits

image_file = fits.open(DATA_DIR + '/EN2_STAR_CHR_0101086161_20070516T060226_20071005T074409.fits')

print(type(image_file), "\n")
print(len(image_file))
<class 'astropy.io.fits.hdu.hdulist.HDUList'>

4

Now, in the variable image_file we have all the information storage into .fits file.

It is great to know how this file type works, so in order to know better the data, the type METHOD returns that image_file belongs to HDUList class.

A little breath for FITS: The Flexible Image Transport System (FITS) is a portable file that it is used for store, transmit and process data formatted as multi-dimensional arrays or tables. FITS is widely used in the astronomy community to store images and tables. We use the software `QFits View <https://www.mpe.mpg.de/~ott/QFitsView/>`__ to open this files in our machine and, as soon as it opens, it is ask to choose an extension.

So, .fits files can have several extensions, in CoRoT database case, by the len command, it only has 4. We are only interessed in the most informative extension with the minimal computacional cost, since we are not interessed in dealing with big-data issues, the third table is the most attractive one.

[ ]:
scidata = image_file[3].data

print(type(scidata), "\n")
print(scidata)
<class 'astropy.io.fits.fitsrec.FITS_rec'>

[(54236.75758185, 112626.77,   0) (54236.76350826, 112605.61,   0)
 (54236.76943468, 112771.5 ,   8) ... (54378.80910033, 112496.13, 256)
 (54378.81502574, 112344.83,   0) (54378.82095114, 112318.5 ,  80)]

The variable scidata contains the information of the third table. We can see that is belongs to the FITS_rec class and, to manipulate this type, it is decided to transform it into an array, using NumPy==1.19.5 library and them plot the raw data, using Plotly==4.4.1.

At this point, we come to a problem that is worth commenting on. The original file, when transformed into an array, according to the code: x = np.array(scidata) presents the following error when being plotted, using the Plotly library: ValueError: Big-endian buffer not supported on little-endian compiler This error occurs when the data being worked on was created on a machine with a different byte order than the one on which we are running Python. To deal with this problem, we must convert the NumPy array to the byte order of the native system before transforming it into a DataFrame or Series. Therefore, we use the methods byteswap() e newbyteorder(). Reference: Pandas Documentation - Byte-ordering issues

[ ]:
import numpy as np

data_aux = np.array(scidata).byteswap().newbyteorder()
[ ]:
import plotly.express as px

fig = px.line(data_aux, x='DATEBARTT', y='WHITEFLUXSYS', title='Raw Light Curve')
fig.show()

image.png

Knowing and fixing the header


With normalized data, stored in data_normalized, we transform them into Pandas==1.1.5 DataFrame and now we are going to understand what DATEBARRT, WHITEFLUXSYS and STATUSSYS means.

Reference: Part II.4 The “ready to use” CoRoT data from The CoRoT Legacy Book

[ ]:
import pandas as pd

data = pd.DataFrame(data_aux)
data.head()
DATEBARTT WHITEFLUXSYS STATUSSYS
0 54236.757582 112626.773438 0
1 54236.763508 112605.609375 0
2 54236.769435 112771.500000 8
3 54236.775361 113113.601562 0
4 54236.781288 112621.789062 256

Quick data analysis

We will do an data analysis in our light curves, aiming to prepare them for the filtering techniques and future Machine Learning model.

We can see by the shape method that, in this sample, there are 23951 rows and 3 columns.

[ ]:
(row, columns) = data.shape
print(row, columns)
23951 3

The isnull().values.any() checks if there’s any Not a Number (NaN) value on our data and as it returned False there is not missing values.

Note. No Machine Learning model can work with NaN values.

[ ]:
data.isnull().values.any()
False

The describe() method returns a DataFrame with the statistical summary of the data. Thus, we are aware of parameters such as Average, Standard Deviation, Minimum Value and Maximum Value of the values.

[ ]:
data.describe().T
count mean std min 25% 50% 75% max
DATEBARTT 23951.0 54307.807634 41.011608 54236.757582 54272.313222 54307.818667 54343.325513 54378.820951
WHITEFLUXSYS 23951.0 112501.031250 292.866730 111110.515625 112276.492188 112529.890625 112724.828125 113609.890625
STATUSSYS 23951.0 101.434429 204.139657 0.000000 0.000000 0.000000 256.000000 1024.000000

STATUSSYS

It represents the Flag of the status in a int type. As this information is not useful in this research, we will just discart it.

[ ]:
data.drop('STATUSSYS', axis=1, inplace=True)
data.head()
DATEBARTT WHITEFLUXSYS
0 54236.757582 112626.773438
1 54236.763508 112605.609375
2 54236.769435 112771.500000
3 54236.775361 113113.601562
4 54236.781288 112621.789062

​​WHITEFLUXSYS

It represents the amount of white light flux captured by CoRoT, after the correction of the SYSTEMATIC.

SYSTEMATIC is a data corrections procedure applied to Faint stars data. It is formed by the correction of residual systematics skews in the whole set of light curves of the run added to BARFILL.

The BARFILL is composed of correction of the jumps and replacement of the invalid and missing data using the Inpainting method added to BAR.

The BAR os the process where it occurs the correction from aliasing, offsets, backgrounds and of the jitter of the satellite and correction of the change of the temperature set point and of the loss of long-term efficiency.

DATEBARTT

It represents the date of the measurement in the solar barycentric reference frame. It is usually measured in Julian Days.

The method of counting days sequentially, starting at an arbitrary date in the past, determined at noon on January 1, 4713 BC by the Julian calendar, or November 24, 4714 BC, by the Gregorian calendar, became known as the Julian Date or Julian Day (JD), proposed by Joseph Justus Scaliger and based on the Julian calendar.

Julian Days are counted continuously, without separating them into days, weeks, months or even years. Each day starts at noon and lasts until the next noon. Thus, there is a great advantage in the night period (where Astronomical observations are made) as there are no time interruptions in this period, unlike the Gregorian Caledary in which the day starts at midnight and ends at midnight following. In addition, with the adoption of Julian Day in astronomical activities, it becomes much easier to measure the period between two events since it is only necessary to subtract one DJ from the other.

With that in mind, it is useful to turn DATEBARTT into convetional date form (day, month and year).

So, we are going to define a function that will transform Julian date to standard time and them apply to data.

Note. We are using julian==0.14.

[ ]:
import julian
from datetime import datetime

def julian_to_stdtime(old_date):
  """
  This is the function used to convert Julian
  date to Gregorian date

  :param numpy.float64 old_date: Represents Julian date of float type
  """
  aux_1 = julian.from_jd(old_date, fmt='mjd')

  try:
    aux_2 = datetime.strptime(str(aux_1), '%Y-%m-%d %H:%M:%S.%f')
  except:
    aux_2 = datetime.strptime(str(aux_1), '%Y-%m-%d %H:%M:%S')

  new_date = str(aux_2)

  return new_date
[ ]:
data.DATEBARTT = data.DATEBARTT.apply(lambda x: julian_to_stdtime(x))

data.rename(columns={'DATEBARTT': 'DATE'}, inplace = True)
data.rename(columns={'WHITEFLUXSYS': 'WHITEFLUX'}, inplace = True)

data.head()
DATE WHITEFLUX
0 2007-05-16 18:10:55.071642 112626.773438
1 2007-05-16 18:19:27.113766 112605.609375
2 2007-05-16 18:27:59.155929 112771.500000
3 2007-05-16 18:36:31.198092 113113.601562
4 2007-05-16 18:45:03.240256 112621.789062

Now, we can see our x-axis in Year/Month/Day format.

[ ]:
fig = px.line(data, x='DATE', y='WHITEFLUX', title='Light Curve in standard time format')
fig.show()

2.png

Algorithms


Here, the fits_to_csv is implemented for each light curve and it is the sumary of all the work we did until Section Sampling Light Curve but expanded to all files into our database.

Note. fits_to_csv use julian_to_stdtime function (Knowing and fixing the header - DATEBARTT)

[ ]:
# Libs already imported
from astropy.io import fits
import numpy as np
import plotly.express as px
import pandas as pd
import julian
from datetime import datetime

# Libs used for files/folder manipulation
import os
import shutil

# Lib for mesure the execution time
import time
[ ]:
def fits_to_csv(path):
  '''
  Normalize one .fits files and convert it
  into a .csv file

  :param str path: Path to .fits data-set folder
  '''
  image_file = fits.open(path)
  scidata = image_file[3].data
  data_aux = np.array(scidata).byteswap().newbyteorder()
  data = pd.DataFrame(data_aux)
  data.drop('STATUSSYS', axis=1, inplace=True)
  data.rename(columns={'DATEBARTT': 'DATE'}, inplace = True)
  data.rename(columns={'WHITEFLUXSYS': 'WHITEFLUX'}, inplace = True)

  # Creating folder with .csv files
  CSV_DIR = 'csv_files'
  if not os.path.isdir(CSV_DIR):
    os.mkdir(CSV_DIR)

  # Renaming .csv file
  name = path[path.rfind('/')+1:path.rfind('.')] + '.csv'
  name = name.split('_')[3] + "_" + name.split('_')[4] + ".csv"

  data.to_csv(name, index=False)

  # Move to .csv folder
  shutil.move(name, CSV_DIR)

Applying fits_to_csv() to ./database.

[ ]:
i = 0
my_dir = DATA_DIR
t_o = time.time()

for root_dir_path, sub_dirs, files in os.walk(my_dir):
  for i in range(0, len(files)):
    fits_to_csv( my_dir + os.path.abspath(files[i])[os.path.abspath(files[i]).rfind('/'):] )

t_f = time.time()

print("It takes:", round(t_f-t_o, 2), "seconds to apply")
It takes: 39.21 seconds to apply

Updating folder structure

Now that we created csv_files folder, our ./database folder will be..

./database
   │
   └── exoplanets_confirmed
         │
         └── dataset_exoplanets_confirmed
         │     │
         │     ├── ..._.fits
         │     ├── ..._.fits
         │     ├── ..._.fits
         │     ⋮       ⋮
         │     └── ..._n.fits
         │
         └── csv_files
               ├── ..._.csv
               ├── ..._.csv
               ⋮       ⋮
               └── ..._k.csv

Zipping csv_files folder

[ ]:
!zip -r /content/csv_files.zip /content/csv_files

Downloading zipped folder

[ ]:
from google.colab import files
files.download("csv_files.zip")

Preprocessing data


Extracting informations about dataset

[ ]:
import os
import pandas as pd
[ ]:
DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/exoplanets_confirmed/csv_files'

sample_rates = []
sample_size = []

for root_dir_path, sub_dirs, files in os.walk(DATA_DIR):
    for j in range(0, len(files)):
        if files[j].endswith('.csv'):
            path = root_dir_path + "/" + files[j]

            data = pd.read_csv(path)
            data.DATE = pd.to_datetime(data.DATE)

            sample_rates.append(data.DATE.diff().min())
            sample_size.append(len(data.DATE))
[ ]:
from statistics import median

print("Minimum size:", min(sample_size))
print("Median size:", median(sample_size))
print("Maximum size:", max(sample_size))
print()
print(sorted(sample_size))
Minimum size: 584
Median size: 15050
Maximum size: 25631

[584, 584, 584, 584, 4097, 4097, 4097, 5356, 9228, 9734, 12930, 12930, 12930, 13062, 14101, 14101, 14102, 14725, 15050, 15051, 18840, 19345, 19346, 21548, 22168, 22169, 22169, 23947, 23951, 23952, 24403, 24403, 24404, 24404, 24404, 24456, 25631]
[ ]:
print("Minimum size:", min(sample_rates))
print("Median size:", median(sample_rates))
print("Maximum size:", max(sample_rates))

# sample_rates
Minimum size: 0 days 00:08:31.935304
Median size: 0 days 00:08:31.958318
Maximum size: 0 days 00:08:32.036530

It’s notable that the sampling time doesn’t vary much, so let’s default all to median sample time and resampling the entire dataset

[ ]:
sample_time = median(sample_rates)
sample_size = median(sample_size)

print(f'The new sample time is {sample_time}, which will contain {sample_size} points')
The new sample time is 0 days 00:08:31.958318, which will contain 15050 points

Resampling dataset

[ ]:
print("All curves will have:" , sample_size, "points")
All curves will have: 15050 points
[ ]:
import scipy.signal as scs

DF = pd.DataFrame()

DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/exoplanets_confirmed/csv_files'
count = 0

for root_dir_path, sub_dirs, files in os.walk(DATA_DIR):
    for j in range(0, len(files)):
        if files[j].endswith('.csv'):
            path = root_dir_path + "/" + files[j]
            data = pd.read_csv(path)

            # If sample size is less than 584, the lightcurve will be discarded
            if (data.DATE.size != 584):
              flux = data.WHITEFLUX
              time = data.DATE

              flux_resampled, time_resampled = scs.resample(flux, sample_size, time)

              # Creating a new pd.DataFrame
              concat_dict = {
                  "DATE": pd.Series(time_resampled),
                  "WHITEFLUX": pd.Series(flux_resampled)
              }
              data_resampled = pd.concat(concat_dict, axis=1)

              # Creating folder with lightcurves resampled
              RESAMPLED_DIR = 'resampled_files'
              if not os.path.isdir(RESAMPLED_DIR):
                os.mkdir(RESAMPLED_DIR)

              # Renaming lightcurve
              file_name = 'RESAMPLED_' + files[j]

              # Saving lightcurves resampled
              FILE_DIR = file_name

              data_resampled.to_csv(file_name, index=False)
              shutil.move(FILE_DIR, RESAMPLED_DIR)

              count += 1
              print('Resampled and saved: ' + files[j])
print("\nTotal of files resampled:", count)
Resampled and salved: EN2_STAR_CHR_0102708694_20071023T223035_20080303T093502.csv
Resampled and salved: EN2_STAR_CHR_0102764809_20071023T223035_20080303T093502.csv
Resampled and salved: EN2_STAR_MON_0102725122_20071023T223035_20080303T093534.csv
Resampled and salved: EN2_STAR_MON_0102671819_20071023T223035_20080303T093502.csv
Resampled and salved: EN2_STAR_CHR_0101086161_20070516T060226_20071005T074409.csv
Resampled and salved: EN2_STAR_CHR_0101206560_20070516T060226_20071005T074409.csv
Resampled and salved: EN2_STAR_CHR_0101368192_20070516T060050_20071015T062306.csv
Resampled and salved: EN2_STAR_MON_0221686194_20081011T143035_20081112T081512.csv
Resampled and salved: EN2_STAR_MON_0105118236_20100708T204534_20100924T063628.csv
Resampled and salved: EN2_STAR_MON_0100725706_20070516T060226_20071005T074409.csv
Resampled and salved: EN2_STAR_MON_0105228856_20100408T223049_20100705T044435.csv
Resampled and salved: EN2_STAR_MON_0310247220_20090403T220030_20090702T022725.csv
Resampled and salved: EN2_STAR_CHR_0652180991_20110708T151253_20110930T044950.csv
Resampled and salved: EN2_STAR_MON_0311519570_20090403T220030_20090702T022757.csv
Resampled and salved: EN2_STAR_MON_0630831435_20110708T151253_20110930T044950.csv
Resampled and salved: EN2_STAR_MON_0652180928_20110708T151253_20110930T044950.csv
Resampled and salved: EN2_STAR_MON_0110839339_20081116T190224_20090311T103233.csv
Resampled and salved: EN2_STAR_MON_0110864907_20081116T190224_20090311T103233.csv
Resampled and salved: EN2_STAR_MON_0300001097_20081116T190224_20090308T103056.csv
Resampled and salved: EN2_STAR_IMAG_0102671819_20120112T183055_20120329T093058.csv
Resampled and salved: EN2_STAR_IMAG_0102708694_20120112T183055_20120329T093058.csv
Resampled and salved: EN2_STAR_IMAG_0102725122_20120112T183055_20120329T093058.csv
Resampled and salved: EN2_STAR_CHR_0102890318_20070206T133547_20070402T070302.csv
Resampled and salved: EN2_STAR_CHR_0102912369_20070203T130553_20070402T070126.csv
Resampled and salved: EN2_STAR_CHR_0105819653_20080415T231048_20080907T224903.csv
Resampled and salved: EN2_STAR_CHR_0105833549_20080415T231048_20080907T224903.csv
Resampled and salved: EN2_STAR_CHR_0105891283_20080415T231048_20080907T224903.csv
Resampled and salved: EN2_STAR_CHR_0106017681_20080415T231048_20080907T224903.csv
Resampled and salved: EN2_STAR_MON_0105793995_20080415T231048_20080907T224903.csv
Resampled and salved: EN2_STAR_CHR_0315198039_20100305T001525_20100329T065610.csv
Resampled and salved: EN2_STAR_CHR_0315211361_20100305T001525_20100329T065610.csv
Resampled and salved: EN2_STAR_CHR_0315239728_20100305T001525_20100329T065610.csv
Resampled and salved: EN2_STAR_MON_0105209106_20080415T231048_20080907T230359.csv
Total of files resampled: 33

Updating folder structure

Now that we created resampled_files folder, our ./database folder will be..

./database
   │
   └── exoplanets_confirmed
         │
         └── dataset_exoplanets_confirmed
         │     │
         │     ├── ..._.fits
         │     ├── ..._.fits
         │     ├── ..._.fits
         │     ⋮       ⋮
         │     └── ..._n.fits
         │
         └── csv_files
         │     ├── ..._.csv
         │     ├── ..._.csv
         │     ⋮       ⋮
         │     └── ..._k.csv
         │
         └── resampled_files
               ├── ..._.csv
               ├── ..._.csv
               ⋮       ⋮
               └── ..._v.csv

Zipping resampled_files folder

[ ]:
!zip -r /content/resampled_files.zip /content/resampled_files

Downloading zipped folder

[ ]:
files.download("resampled_files.zip")