Machine Learning Data


Now, we are going to apply some ML algorithms on lightcurves contained:

  • 0: Confirmed Exoplanets

  • 1: Eclipsing Binaries

  • 2: Non Eclipsed

With that in mind, the data from (1) and (2) will be downloaded from the CoRoT Public Archive and transformed into CSV files, just like we did for (0): Confirmed Exoplanets on 01 - Manipulating fits files

[ ]:

SkTime

[ ]:
import pandas as pd
import numpy as np
import os

!pip install control
from tools import *

Preprocessing data

Creating matrix of features (CoRoT targets with confirmed exoplanets)

[ ]:
# DATA_DIR = 'C:/Users/guisa/Google Drive/01 - Iniciação Científica/02 - Datasets/csv_files'
# DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/csv_files'
DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/resampled_files'
[ ]:
X = pd.DataFrame()

for root_dir_path, sub_dirs, files in os.walk(DATA_DIR):
    for j in range(0, len(files)):
        if files[j] != ('desktop.ini' and 'csv_files.rar'):
            # File path
            path = root_dir_path + "/" + files[j]

            # Reading data
            # print(path)
            data = pd.read_csv(path)
            flux = data.WHITEFLUX

            # Add timeseries to pd.DataFrame
            X = X.append([[flux]], ignore_index=True)
[ ]:
X.columns = ['time_series']
X.head()
[ ]:
X.iloc[0][0]
[ ]:
X.shape

Labeling matrix of features

  • 0: confirmed_exoplanets

  • 1: eclipsing_binaries

  • 2: none

[ ]:
labels = np.zeros(X.size, dtype='int')
labels
[ ]:
y = pd.Series(labels)
y.head()
[ ]:
y.shape

Creating dataset, X and y

[ ]:
# Creating pd.DataFrame with X data, and setted columns
df = pd.DataFrame(X, columns=['time_series', 'label'])

# Adding labels
df.label = y

df.head()

How many Labels we got ?

[ ]:
labels, counts = np.unique(y, return_counts=True)
print('Labels =', labels, '\nCounts =', counts)

Machine Learning - SkTime

https://github.com/alan-turing-institute/sktime/tree/v0.4.3

https://github.com/alan-turing-institute/sktime/blob/main/sktime/classification/compose/init.py

Preliminaries

[ ]:
# !pip install sktime[all_extras]

Splitting the dataset into the Training set and Test set

[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(24, 1) (24,) (9, 1) (9,)

Time Series Classification

https://towardsdatascience.com/sktime-a-unified-python-library-for-time-series-machine-learning-3c103c139a55

[ ]:
from sktime.classification.all import TimeSeriesForestClassifier

classifier = TimeSeriesForestClassifier()
classifier.fit(X_train, y_train)
TimeSeriesForestClassifier()
[ ]:
from sklearn.metrics import accuracy_score

y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
1.0
[ ]:

[ ]:

Feature extraction

[ ]:
import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.ar_model.AR', FutureWarning)
[ ]:
from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor

transformer = TSFreshFeatureExtractor(default_fc_parameters="minimal")

extracted_features = transformer.fit_transform(X_train)
extracted_features.head()

Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]
Feature Extraction:  20%|██        | 1/5 [00:00<00:00,  7.32it/s]
Feature Extraction: 100%|██████████| 5/5 [00:00<00:00, 18.29it/s]
time_series__sum_values time_series__median time_series__mean time_series__length time_series__standard_deviation time_series__variance time_series__root_mean_square time_series__maximum time_series__minimum
0 2.124862e+09 141218.330072 141186.836848 15050.0 282.683706 79910.077814 141187.119841 142021.360654 138999.945791
4 6.136100e+08 40704.273331 40771.427763 15050.0 560.265206 313897.101078 40775.277055 44921.695568 39239.685767
16 6.258849e+08 41653.547836 41587.034197 15050.0 364.389989 132780.064141 41588.630578 42472.076073 40477.653465
5 4.512183e+09 299686.539820 299812.826420 15050.0 613.298072 376134.525086 299813.453701 302311.608084 297465.565232
13 5.857504e+08 38822.561668 38920.292034 15050.0 553.820584 306717.239608 38924.232160 40531.501716 37855.665849
[ ]:
# If the result is 1, it means that the entire dataset has de same lenght

extracted_features.time_series__length.nunique()
1

Time Series Classification with Feature Extraction

[ ]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

classifier = make_pipeline(
    TSFreshFeatureExtractor(show_warnings=False), RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]
Feature Extraction:  20%|██        | 1/5 [01:28<05:53, 88.44s/it]
Feature Extraction:  40%|████      | 2/5 [03:13<04:40, 93.37s/it]
Feature Extraction:  60%|██████    | 3/5 [04:51<03:09, 94.94s/it]
Feature Extraction:  80%|████████  | 4/5 [06:30<01:36, 96.16s/it]
Feature Extraction: 100%|██████████| 5/5 [07:48<00:00, 93.60s/it]

Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]
Feature Extraction:  20%|██        | 1/5 [00:45<03:02, 45.55s/it]
Feature Extraction:  40%|████      | 2/5 [01:17<02:03, 41.33s/it]
Feature Extraction:  60%|██████    | 3/5 [01:52<01:18, 39.44s/it]
Feature Extraction:  80%|████████  | 4/5 [02:32<00:39, 39.77s/it]
Feature Extraction: 100%|██████████| 5/5 [02:50<00:00, 34.17s/it]
1.0

SkLearn

[ ]:
import pandas as pd
import numpy as np
import os

!pip install control
from tools import *

Preprocessing data

Creating matrix of features (CoRoT targets with confirmed exoplanets)

[ ]:
# DATA_DIR = 'C:/Users/guisa/Google Drive/01 - Iniciação Científica/02 - Datasets/csv_files'
# DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/csv_files'
DATA_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/resampled_files'
[ ]:
X = pd.DataFrame()

for root_dir_path, sub_dirs, files in os.walk(DATA_DIR):
    for j in range(0, len(files)):
        if files[j] != ('desktop.ini' and 'csv_files.rar'):
            # File path
            path = root_dir_path + "/" + files[j]

            # Reading data
            # print(path)
            data = pd.read_csv(path)
            flux = data.WHITEFLUX

            # Add timeseries to pd.DataFrame
            X = X.append(flux, ignore_index=True)
[ ]:
X.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 15010 15011 15012 15013 15014 15015 15016 15017 15018 15019 15020 15021 15022 15023 15024 15025 15026 15027 15028 15029 15030 15031 15032 15033 15034 15035 15036 15037 15038 15039 15040 15041 15042 15043 15044 15045 15046 15047 15048 15049
0 1.411572e+05 1.412424e+05 1.411326e+05 1.413731e+05 1.412133e+05 1.413579e+05 1.411959e+05 1.412497e+05 1.413829e+05 1.414096e+05 1.412703e+05 1.412327e+05 1.411492e+05 1.412873e+05 1.412144e+05 1.411442e+05 1.412001e+05 1.412225e+05 1.413235e+05 1.412076e+05 1.411856e+05 1.411647e+05 1.412128e+05 1.413366e+05 1.412635e+05 1.413088e+05 1.413307e+05 1.411438e+05 1.412988e+05 1.411502e+05 1.411364e+05 1.414083e+05 1.410876e+05 1.411729e+05 1.412924e+05 1.413673e+05 1.412330e+05 1.411206e+05 1.413310e+05 1.411721e+05 ... 1.411859e+05 1.411025e+05 1.410474e+05 1.411354e+05 1.411319e+05 1.412317e+05 1.411159e+05 1.412205e+05 1.411621e+05 1.411025e+05 1.412063e+05 1.411235e+05 1.411927e+05 1.412480e+05 1.412074e+05 1.411396e+05 1.411576e+05 1.410457e+05 1.411399e+05 1.408885e+05 1.409958e+05 1.411260e+05 1.411951e+05 1.411575e+05 1.411057e+05 1.411124e+05 1.412933e+05 1.411076e+05 1.410541e+05 1.411741e+05 1.409121e+05 1.409989e+05 1.412270e+05 1.411838e+05 1.413012e+05 1.411474e+05 1.411009e+05 1.413782e+05 1.412287e+05 1.413100e+05
1 2.605181e+04 2.611330e+04 2.601663e+04 2.614152e+04 2.587125e+04 2.587146e+04 2.602901e+04 2.604010e+04 2.611140e+04 2.607349e+04 2.612061e+04 2.601961e+04 2.608381e+04 2.615512e+04 2.605209e+04 2.615393e+04 2.595893e+04 2.613356e+04 2.608572e+04 2.604287e+04 2.609076e+04 2.603981e+04 2.604673e+04 2.605018e+04 2.605711e+04 2.603637e+04 2.602297e+04 2.610288e+04 2.599291e+04 2.599706e+04 2.596696e+04 2.610334e+04 2.616852e+04 2.615221e+04 2.600709e+04 2.604340e+04 2.602915e+04 2.623798e+04 2.596439e+04 2.610146e+04 ... 2.621469e+04 2.632878e+04 2.624308e+04 2.622704e+04 2.618278e+04 2.624397e+04 2.632099e+04 2.627644e+04 2.625360e+04 2.632872e+04 2.625611e+04 2.633979e+04 2.629440e+04 2.627620e+04 2.635345e+04 2.628232e+04 2.634446e+04 2.636739e+04 2.625771e+04 2.648313e+04 2.638753e+04 2.626968e+04 2.623826e+04 2.630487e+04 2.624348e+04 2.638560e+04 2.620561e+04 2.630678e+04 2.627786e+04 2.616603e+04 2.631367e+04 2.620188e+04 2.618542e+04 2.624512e+04 2.625785e+04 2.642315e+04 2.621238e+04 2.636027e+04 2.629231e+04 2.618336e+04
2 1.298393e+06 1.299550e+06 1.299725e+06 1.299612e+06 1.299747e+06 1.299215e+06 1.299576e+06 1.299769e+06 1.299262e+06 1.299409e+06 1.299280e+06 1.299889e+06 1.299150e+06 1.299826e+06 1.298902e+06 1.299552e+06 1.299346e+06 1.298708e+06 1.299628e+06 1.299107e+06 1.299239e+06 1.299363e+06 1.299605e+06 1.299160e+06 1.299955e+06 1.299210e+06 1.299477e+06 1.299130e+06 1.299318e+06 1.298997e+06 1.299127e+06 1.299335e+06 1.299339e+06 1.299389e+06 1.299585e+06 1.299507e+06 1.298837e+06 1.299754e+06 1.298997e+06 1.300436e+06 ... 1.295584e+06 1.296259e+06 1.295880e+06 1.296397e+06 1.295613e+06 1.295237e+06 1.295789e+06 1.295417e+06 1.295453e+06 1.295508e+06 1.295937e+06 1.294957e+06 1.295125e+06 1.294599e+06 1.294709e+06 1.295073e+06 1.295429e+06 1.295154e+06 1.295264e+06 1.295769e+06 1.295695e+06 1.295337e+06 1.295557e+06 1.295314e+06 1.295710e+06 1.295153e+06 1.295031e+06 1.295029e+06 1.295460e+06 1.295186e+06 1.294849e+06 1.295283e+06 1.294897e+06 1.294750e+06 1.294881e+06 1.294939e+06 1.295167e+06 1.295158e+06 1.295069e+06 1.294454e+06
3 1.125213e+05 1.127580e+05 1.129430e+05 1.125623e+05 1.127893e+05 1.125752e+05 1.127852e+05 1.126351e+05 1.126462e+05 1.126747e+05 1.128206e+05 1.126230e+05 1.127497e+05 1.127325e+05 1.127567e+05 1.127885e+05 1.126766e+05 1.127609e+05 1.125398e+05 1.127966e+05 1.126471e+05 1.126480e+05 1.128501e+05 1.128040e+05 1.127078e+05 1.128669e+05 1.126771e+05 1.127147e+05 1.127916e+05 1.126816e+05 1.127761e+05 1.126781e+05 1.127678e+05 1.127868e+05 1.125911e+05 1.127481e+05 1.127409e+05 1.126717e+05 1.126739e+05 1.125557e+05 ... 1.124980e+05 1.125445e+05 1.125411e+05 1.125625e+05 1.124583e+05 1.125062e+05 1.123094e+05 1.126122e+05 1.126024e+05 1.123122e+05 1.125150e+05 1.124099e+05 1.124539e+05 1.123920e+05 1.124828e+05 1.124690e+05 1.125900e+05 1.125468e+05 1.123983e+05 1.125112e+05 1.124692e+05 1.124070e+05 1.125379e+05 1.124257e+05 1.125522e+05 1.124184e+05 1.125103e+05 1.123842e+05 1.126233e+05 1.123789e+05 1.123924e+05 1.123847e+05 1.125097e+05 1.125469e+05 1.124996e+05 1.125005e+05 1.124207e+05 1.124281e+05 1.124471e+05 1.123491e+05
4 4.064368e+04 4.024597e+04 4.043663e+04 4.031514e+04 4.029457e+04 4.023700e+04 4.029928e+04 4.044337e+04 4.056287e+04 4.024665e+04 4.043309e+04 4.036776e+04 4.041253e+04 4.034370e+04 4.067833e+04 4.021538e+04 4.045781e+04 4.032538e+04 4.037169e+04 4.034290e+04 4.028767e+04 4.024320e+04 4.023320e+04 4.024396e+04 4.034883e+04 4.034518e+04 4.028057e+04 4.029106e+04 4.048474e+04 4.029657e+04 4.035592e+04 4.021696e+04 4.020960e+04 4.021597e+04 4.026372e+04 4.027219e+04 4.024717e+04 4.023869e+04 4.038073e+04 4.031858e+04 ... 4.300994e+04 4.303990e+04 4.306673e+04 4.301858e+04 4.267300e+04 4.247712e+04 4.262169e+04 4.277094e+04 4.303920e+04 4.267297e+04 4.267512e+04 4.263271e+04 4.274761e+04 4.247607e+04 4.258747e+04 4.217027e+04 4.206883e+04 4.236280e+04 4.214163e+04 4.226997e+04 4.200494e+04 4.193790e+04 4.209874e+04 4.221375e+04 4.254759e+04 4.279462e+04 4.288303e+04 4.289924e+04 4.287694e+04 4.282583e+04 4.266600e+04 4.277455e+04 4.272497e+04 4.275998e+04 4.296628e+04 4.300132e+04 4.278209e+04 4.265919e+04 4.269699e+04 4.274217e+04

5 rows × 15050 columns

[ ]:
X.shape
(33, 15050)

Labeling matrix of features

  • 0: confirmed_exoplanets

  • 1: eclipsing_binaries

  • 2: none

[ ]:
labels = np.zeros(X.size, dtype='int')
labels
array([0, 0, 0, ..., 0, 0, 0])
[ ]:
y = pd.Series(labels)
y.head()
0    0
1    0
2    0
3    0
4    0
dtype: int64
[ ]:
y.shape
(496650,)

Creating dataset, X and y

[ ]:
# Creating pd.DataFrame with X data, and setted columns
df = pd.DataFrame(X)

# Adding labels
df['label'] = y

df.sample(5)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 15011 15012 15013 15014 15015 15016 15017 15018 15019 15020 15021 15022 15023 15024 15025 15026 15027 15028 15029 15030 15031 15032 15033 15034 15035 15036 15037 15038 15039 15040 15041 15042 15043 15044 15045 15046 15047 15048 15049 label
7 30819.826000 30766.330239 30751.556425 30749.078411 30735.472031 30705.364719 30673.455704 30662.058557 30682.811774 30726.006400 30766.044428 30779.528462 30762.480430 30733.670298 30721.967523 30747.701120 30810.995926 30892.582773 30963.134137 30994.637790 30971.285524 30899.489245 30812.158646 30757.956592 30773.076489 30850.996851 30937.516699 30965.531315 30909.779524 30815.873666 30770.754493 30830.580119 30965.570660 31074.862790 31066.062397 30933.718685 30766.639144 30675.022829 30700.963010 30791.397016 ... 31257.034885 31264.336563 31300.065245 31342.684879 31375.525377 31386.046775 31362.258516 31301.359103 31225.968013 31185.069625 31225.200043 31348.360808 31494.844160 31574.995403 31532.203536 31387.986425 31230.251847 31151.297131 31183.599756 31282.844881 31368.501950 31386.865644 31347.153080 31306.651124 31321.815477 31405.234907 31518.720306 31602.739344 31617.101315 31564.063024 31481.233700 31413.127641 31382.205800 31377.323280 31364.368751 31310.075393 31203.441067 31062.404104 30923.186993 0
0 141157.216020 141242.434636 141132.564812 141373.143346 141213.262888 141357.927056 141195.854576 141249.723060 141382.882136 141409.646127 141270.347310 141232.742194 141149.203020 141287.347367 141214.443625 141144.196655 141200.096238 141222.533249 141323.471805 141207.648435 141185.552578 141164.668742 141212.764561 141336.627107 141263.521595 141308.810296 141330.700205 141143.809870 141298.757841 141150.205593 141136.397407 141408.260213 141087.606264 141172.925450 141292.442324 141367.329433 141232.994283 141120.579153 141330.990802 141172.077723 ... 141102.476944 141047.400969 141135.385006 141131.922378 141231.656466 141115.855797 141220.542768 141162.097924 141102.469348 141206.299004 141123.514021 141192.723871 141248.047431 141207.440681 141139.602971 141157.554097 141045.650680 141139.907409 140888.479585 140995.832130 141126.000693 141195.104186 141157.523028 141105.662748 141112.441533 141293.285098 141107.636577 141054.134890 141174.120525 140912.054405 140998.875288 141227.024526 141183.770462 141301.178908 141147.423729 141100.943892 141378.202818 141228.656766 141309.960163 0
28 62789.448650 63084.529078 62888.248116 62879.160690 62902.299203 62856.785176 62838.438216 62890.327459 62978.468824 62915.586396 63030.280126 62972.033937 63002.387872 62917.744686 62882.950698 62963.445695 63047.080991 62936.457158 62918.843671 62903.944131 62860.251827 62981.319394 62955.393306 62950.430855 62790.031253 62864.886726 62973.441523 62832.014189 62840.955762 62894.206974 62961.589149 63016.263609 62849.370593 62860.779191 62823.774972 62836.445243 62939.162851 63026.291359 63070.556424 62920.886694 ... 62681.452641 62818.463320 62658.489891 62751.509989 62673.658088 62645.568598 62681.461059 62508.688815 62410.438476 62357.747106 62424.788206 62287.583036 62463.003606 62461.858904 62661.955794 62695.748268 62763.846757 62617.561068 62713.372410 62667.105538 62616.660149 62616.715432 62717.656437 62704.556538 62715.951555 62749.903309 62744.565477 62690.301100 62640.328884 62562.900647 62681.503019 62595.074849 62785.115156 62726.137782 62662.004908 62639.195962 62595.452252 62665.525918 62690.144817 0
16 41432.330660 41693.673194 41615.060995 41733.206360 41407.073111 41575.178851 41481.167264 41461.696150 41574.132506 41545.980086 41589.135266 41631.806337 41678.861443 41666.674866 41573.258211 41595.650027 41704.488771 41548.087189 41514.108672 41592.132381 41587.728893 41642.447618 41577.163200 41561.930921 41516.422593 41608.456849 41626.788574 41504.738696 41597.093006 41628.753414 41551.683274 41666.509823 41582.059959 41581.388399 41583.151091 41794.523488 41602.794663 41608.601705 41609.603589 41571.278191 ... 40881.435582 40843.288013 40754.713626 40845.671971 40768.896734 40795.521882 40733.957968 40801.082108 40758.474369 40777.197609 40654.588433 40805.613933 40743.934771 40834.311881 40829.452126 40898.833956 40790.762090 40842.380033 40691.121858 40828.179011 40902.506017 40846.943964 40717.051200 40755.964071 40761.241218 40749.390892 40764.210918 40783.309633 40688.716724 40899.485940 40846.857493 40836.554608 40791.558598 40873.087750 40697.934377 40749.979366 40747.427558 40981.845986 40791.714704 0
8 75697.102041 75521.999008 75698.416356 75705.738417 75615.635709 75589.288889 75636.802773 75605.037888 75643.661031 75676.159731 75686.112903 75529.686928 75619.574181 75598.209878 75671.834220 75653.166464 75668.853037 75677.115840 75667.335173 75869.542965 75494.598934 75647.836979 75566.587210 75589.594379 75699.678912 75702.470303 75733.356330 75581.316242 75790.350131 75540.179441 75675.170180 75585.319226 75586.911065 75641.780776 75606.990303 75701.682365 75516.024677 75547.293254 75563.844719 75755.697044 ... 76268.041648 76281.419232 76284.440650 76394.796219 76213.197332 76266.147764 76103.233510 76216.735548 76239.222373 76242.488306 76374.396560 76234.833050 76231.972588 76326.670377 76266.439862 76203.261758 76211.216236 76228.558971 76245.008354 76174.797288 76321.157824 76245.700392 76255.793168 76386.641211 76193.530045 76262.074742 76154.024313 76303.203262 76377.507464 76282.547285 76156.168117 76283.238530 76317.845976 76357.921520 76262.509002 76403.950151 76152.819830 76197.209589 76313.579812 0

5 rows × 15051 columns

How many Labels we got ?

[ ]:
labels, counts = np.unique(y, return_counts=True)
print('Labels =', labels, '\nCounts =', counts)
Labels = [0]
Counts = [496650]

Machine Learning

Preliminaries

Splitting the dataset into the Training set and Test set

[ ]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(24, 15050) (24,) (9, 15050) (9,)

Time Series Classification

[ ]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)

# from sklearn import svm
# classifier = svm.SVC()

classifier.fit(X_train, y_train)
KNeighborsClassifier()
[ ]:
y_pred = classifier.predict(X_test)
[ ]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
[[9]]
1.0

Decision Trees - 0.57


https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Feature: Periodograms

[ ]:
import pandas as pd

FEATURES_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/features'
PERIODOGRAMS_DIR = FEATURES_DIR + '/feature_periodograms.csv'

data = pd.read_csv(PERIODOGRAMS_DIR)
data.sample(5)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 label
127 1.771419e-21 3.118507e+10 1.792958e+10 1.571999e+10 1.392014e+10 3.661464e+10 1.002512e+10 1.852784e+10 1.907169e+10 1.278961e+10 2.216953e+09 9.314578e+09 6.515043e+09 3.546007e+09 4.409342e+09 9.162630e+09 8.302242e+09 8.718788e+09 6.957977e+09 5.611064e+09 1.877066e+09 2.836627e+09 2.936736e+09 3.912005e+09 3.473252e+09 4.388687e+09 6.221741e+09 7.087409e+09 5.035225e+09 4.960782e+09 4.799346e+09 2.262591e+09 2.778995e+09 4.723886e+09 4.334889e+09 1.247480e+09 1.892225e+09 1.962624e+09 1.284465e+09 9.996783e+08 ... 1.383741e+03 1.383737e+03 1.383732e+03 1.383728e+03 1.383724e+03 1.383720e+03 1.383716e+03 1.383712e+03 1.383708e+03 1.383705e+03 1.383701e+03 1.383698e+03 1.383695e+03 1.383692e+03 1.383689e+03 1.383686e+03 1.383683e+03 1.383681e+03 1.383678e+03 1.383676e+03 1.383674e+03 1.383671e+03 1.383670e+03 1.383668e+03 1.383666e+03 1.383664e+03 1.383663e+03 1.383661e+03 1.383660e+03 1.383659e+03 1.383658e+03 1.383657e+03 1.383656e+03 1.383656e+03 1.383655e+03 1.383655e+03 1.383654e+03 1.383654e+03 6.918270e+02 1
92 3.003987e-22 2.193934e+10 3.476167e+10 5.756229e+09 3.338234e+09 1.596978e+09 3.901232e+09 3.096381e+09 1.385702e+08 2.057867e+08 1.498203e+09 2.778013e+08 1.972097e+09 1.002296e+08 4.203961e+08 5.003830e+08 9.213325e+08 3.639927e+07 1.532101e+08 7.530989e+08 5.727591e+07 3.229936e+08 1.240524e+08 1.682591e+08 4.352538e+08 1.146596e+08 2.886079e+07 1.750333e+07 2.888187e+07 1.230348e+08 5.430957e+06 1.471160e+08 1.814644e+08 1.891106e+07 1.950586e+08 1.176811e+08 3.555871e+07 9.392426e+07 7.405261e+09 2.248148e+08 ... 6.895641e+00 6.895618e+00 6.895596e+00 6.895575e+00 6.895554e+00 6.895534e+00 6.895515e+00 6.895496e+00 6.895477e+00 6.895460e+00 6.895443e+00 6.895426e+00 6.895410e+00 6.895395e+00 6.895380e+00 6.895366e+00 6.895352e+00 6.895340e+00 6.895327e+00 6.895315e+00 6.895304e+00 6.895294e+00 6.895284e+00 6.895275e+00 6.895266e+00 6.895258e+00 6.895250e+00 6.895243e+00 6.895237e+00 6.895231e+00 6.895226e+00 6.895222e+00 6.895218e+00 6.895215e+00 6.895212e+00 6.895210e+00 6.895208e+00 6.895207e+00 3.447603e+00 1
112 6.085895e-23 2.508199e+10 2.671433e+10 3.458125e+10 2.183046e+09 2.913235e+10 1.529532e+10 2.041126e+10 9.492592e+09 4.173585e+09 8.286883e+09 3.807406e+09 1.621156e+09 3.544895e+09 1.688447e+09 2.456768e+09 9.215532e+08 3.701348e+09 7.629576e+08 8.599160e+08 9.432845e+07 2.130174e+09 8.154774e+08 1.670346e+09 4.049332e+08 2.506906e+08 1.076488e+09 7.620423e+07 3.262416e+08 4.253027e+08 1.258279e+08 4.688591e+08 1.020505e+09 2.253089e+07 1.977092e+08 1.187127e+09 1.304308e+08 7.328006e+08 1.594333e+09 3.496810e+08 ... 1.450557e+03 1.450552e+03 1.450547e+03 1.450543e+03 1.450538e+03 1.450534e+03 1.450530e+03 1.450526e+03 1.450522e+03 1.450519e+03 1.450515e+03 1.450511e+03 1.450508e+03 1.450505e+03 1.450502e+03 1.450499e+03 1.450496e+03 1.450493e+03 1.450491e+03 1.450488e+03 1.450486e+03 1.450484e+03 1.450482e+03 1.450480e+03 1.450478e+03 1.450476e+03 1.450474e+03 1.450473e+03 1.450472e+03 1.450470e+03 1.450469e+03 1.450468e+03 1.450468e+03 1.450467e+03 1.450466e+03 1.450466e+03 1.450466e+03 1.450465e+03 7.252327e+02 1
18 7.520660e-23 6.582516e+10 1.176575e+09 5.726639e+10 5.566852e+09 2.512593e+10 6.645201e+09 2.633803e+09 2.373441e+09 2.738804e+09 1.108361e+10 1.011122e+09 1.796283e+10 4.548535e+09 4.153863e+09 5.986967e+09 7.591259e+08 4.204495e+09 2.445218e+09 2.999276e+09 2.288641e+08 5.048715e+09 3.212317e+09 1.767702e+08 3.379057e+09 5.470107e+08 1.534445e+06 8.561774e+07 2.678881e+08 5.831388e+08 2.250089e+09 2.945777e+09 2.354936e+09 4.394168e+08 5.295828e+08 1.999850e+09 3.491288e+08 9.152734e+07 6.722281e+08 2.420303e+09 ... 1.157560e+07 4.970938e+06 1.044088e+07 2.076864e+07 5.652504e+06 3.240631e+06 1.397898e+07 1.008912e+06 2.327728e+07 2.909226e+07 3.869044e+06 3.251341e+06 2.399123e+07 5.366716e+07 1.863748e+07 4.281250e+06 1.950633e+07 3.989715e+07 3.906992e+07 1.556306e+07 1.263990e+07 2.418054e+07 1.361411e+07 8.309645e+06 2.176511e+07 8.594606e+06 5.784685e+04 8.229419e+04 2.710502e+07 4.228367e+07 1.220436e+07 1.449948e+06 1.991303e+07 1.383334e+06 3.488514e+06 2.693400e+07 6.823640e+06 5.278711e+06 4.999322e+06 0
35 2.005767e-23 2.483386e+11 3.034385e+11 5.397407e+09 9.714590e+10 3.074372e+09 4.607364e+10 6.113658e+09 2.487312e+10 3.281864e+09 3.171297e+09 4.322904e+08 7.593255e+08 5.041733e+09 2.014204e+10 8.716218e+07 1.438071e+10 2.324841e+09 5.329362e+09 2.122016e+09 2.336579e+08 3.276469e+08 3.614434e+08 4.081245e+09 2.247445e+09 5.308821e+08 7.717831e+09 7.687563e+08 3.279740e+09 2.631919e+08 6.609370e+09 2.438674e+09 1.032339e+09 6.609401e+09 2.769244e+09 2.386537e+09 2.067472e+10 2.984367e+10 7.234662e+10 5.054561e+10 ... 3.646149e+06 3.104552e+06 6.063856e+07 7.379961e+06 4.232041e+06 8.361054e+06 1.008574e+07 2.675340e+07 2.353878e+06 3.933722e+07 1.294616e+07 6.858054e+07 1.564118e+07 1.564992e+07 1.929105e+07 1.895119e+07 2.445280e+06 1.220851e+07 1.957260e+07 3.177603e+07 9.153447e+06 3.882828e+07 6.555127e+07 3.676038e+06 5.286998e+07 2.373915e+07 3.191280e+06 1.308186e+08 1.445847e+07 2.031333e+07 2.910264e+07 3.628487e+07 8.872919e+07 1.077391e+07 5.264475e+06 2.564387e+07 1.219609e+07 2.248579e+06 4.098014e+06 1

5 rows × 7527 columns

[ ]:
X = data.iloc[:, :-1].values # Matrix of features (Independent variable), X: numpy.ndarray
y = data.iloc[:, -1].values  # Dependent variable vector, y: numpy.ndarray

Preprocessing

1. Normalization

[ ]:
from sklearn import preprocessing

normalized_data = preprocessing.normalize(X)

2. PCA - Dimensionality Reduction

[ ]:
import numpy as np
from sklearn.decomposition import PCA

# What is the minimun value of `n_components` to keep 95% of variance on data ?

pca = PCA()
pca.fit(normalized_data)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

print("The minimun value is:", d)
The minimun value is: 49
[ ]:
pca = PCA(n_components=d)
pca.fit(normalized_data)
X_reduced = pca.transform(normalized_data)

Best altenative… set the n_components to fluctuate between 0.0 and 1.0, indicating the rate of variance you want to preserve

[ ]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(normalized_data)

Splitting the dataset into the Training set and Test set

[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(98, 49) (98,) (33, 49) (33,)

Train model

[ ]:
from sklearn.tree import DecisionTreeClassifier

class_trees = DecisionTreeClassifier(random_state=42)
class_trees.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

Results

[ ]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

labels_formated = ['confirmed exoplanets', 'eclipsing binaries']

fig = plot_confusion_matrix(class_trees, X_test, y_test,
                             display_labels=labels_formated,
                             cmap=plt.cm.Blues,
                             normalize='true')

fig.ax_.set_title('Decision Trees Classifier - Confusion matrix')
plt.show()
_images/06_-_Machine_Learning_77_0.png

XGBoost - 0.585


https://xgboost.readthedocs.io/en/latest/parameter.html

Feature: Periodograms

[ ]:
import pandas as pd

FEATURES_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/features'
PERIODOGRAMS_DIR = FEATURES_DIR + '/feature_periodograms.csv'

data = pd.read_csv(PERIODOGRAMS_DIR)
data.sample(5)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 label
97 6.567484e-24 1.178041e+10 5.298207e+09 9.132480e+09 1.002822e+10 3.540825e+09 4.228655e+09 4.095593e+09 2.244863e+08 1.012111e+09 2.164379e+09 1.789251e+09 1.365437e+09 1.876186e+09 2.027531e+09 7.245684e+12 1.720589e+09 7.179984e+08 1.050979e+09 9.693505e+08 1.354755e+09 1.007977e+09 3.747895e+08 2.814626e+08 4.937656e+08 2.295634e+09 3.026242e+09 3.029127e+09 1.860285e+09 2.587133e+09 1.779195e+13 8.697306e+08 1.296085e+09 1.402465e+08 1.454082e+09 4.639711e+08 1.589614e+08 1.291053e+09 1.517971e+08 7.411420e+08 ... 1.076525e+03 1.076522e+03 1.076519e+03 1.076515e+03 1.076512e+03 1.076509e+03 1.076506e+03 1.076503e+03 1.076500e+03 1.076497e+03 1.076495e+03 1.076492e+03 1.076489e+03 1.076487e+03 1.076485e+03 1.076483e+03 1.076480e+03 1.076478e+03 1076.476488 1.076475e+03 1.076473e+03 1.076471e+03 1076.469734 1.076468e+03 1.076467e+03 1.076466e+03 1.076464e+03 1.076463e+03 1.076462e+03 1.076462e+03 1.076461e+03 1.076460e+03 1.076459e+03 1076.458898 1.076458e+03 1.076458e+03 1.076458e+03 1.076458e+03 5.382289e+02 1
59 5.748580e-20 6.211685e+10 2.216964e+09 1.435994e+10 1.719760e+10 8.704939e+09 4.907214e+09 2.450727e+09 6.121095e+09 4.615305e+08 4.269709e+08 2.070437e+09 1.608734e+08 1.740199e+09 2.628103e+08 1.164062e+09 3.439496e+09 3.440799e+09 3.752920e+09 2.090612e+09 1.269435e+09 2.729621e+09 3.853905e+09 8.723574e+09 2.646283e+09 8.650304e+09 6.046799e+09 1.536936e+10 1.372588e+10 3.027827e+10 2.291537e+10 1.746399e+10 2.597201e+11 6.334982e+12 1.774388e+12 2.501299e+11 4.467781e+10 8.605329e+10 3.695356e+10 3.505263e+10 ... 2.414816e+04 2.414808e+04 2.414800e+04 2.414793e+04 2.414785e+04 2.414778e+04 2.414772e+04 2.414765e+04 2.414759e+04 2.414752e+04 2.414746e+04 2.414741e+04 2.414735e+04 2.414730e+04 2.414724e+04 2.414719e+04 2.414715e+04 2.414710e+04 24147.059038 2.414702e+04 2.414698e+04 2.414694e+04 24146.907523 2.414687e+04 2.414684e+04 2.414682e+04 2.414679e+04 2.414677e+04 2.414674e+04 2.414672e+04 2.414671e+04 2.414669e+04 2.414668e+04 24146.664472 2.414666e+04 2.414665e+04 2.414664e+04 2.414664e+04 1.207332e+04 1
24 1.752484e-21 9.756491e+10 3.995481e+10 1.105439e+10 3.336733e+09 3.161758e+09 3.782595e+09 5.017278e+09 1.508046e+09 1.459446e+09 5.922758e+08 4.391054e+08 1.723186e+09 1.404142e+09 9.070158e+08 4.035657e+08 8.472543e+08 2.859164e+08 2.233725e+08 4.385475e+08 2.653935e+07 1.300230e+09 1.120066e+09 1.688150e+08 1.741311e+07 5.441521e+08 2.447330e+07 6.426762e+07 2.427007e+08 6.739195e+07 6.132567e+08 1.442627e+08 9.345792e+07 1.235102e+08 4.746666e+08 1.192272e+08 2.053329e+08 2.421255e+08 9.038041e+07 4.684413e+07 ... 5.069784e+06 1.481839e+06 6.506158e+06 4.203149e+06 1.004154e+07 7.254977e+06 1.675987e+07 1.362856e+07 3.276551e+06 3.198822e+07 3.470285e+06 2.147650e+07 1.048925e+07 2.111648e+06 7.221949e+06 1.317321e+07 2.028878e+07 2.733158e+06 373819.128620 1.328582e+07 9.991568e+06 4.730944e+06 474384.659373 5.020889e+06 9.238025e+06 2.269643e+07 2.639189e+07 1.203043e+07 6.267373e+06 4.170246e+06 4.640980e+06 6.198973e+06 1.228420e+07 223599.331136 4.635344e+06 2.317602e+07 4.720558e+06 1.783050e+07 7.135303e+07 0
71 4.444052e-21 1.225255e+11 8.684239e+11 3.046022e+11 3.763060e+11 3.998026e+10 2.059423e+07 1.908615e+09 1.245052e+10 1.664964e+10 1.750149e+09 1.974743e+09 1.564496e+10 9.473577e+09 4.174117e+09 3.589926e+08 3.528037e+08 4.494764e+08 1.807520e+09 1.496267e+09 9.872100e+07 1.281638e+09 9.822630e+07 2.448930e+09 1.786816e+07 8.014748e+08 9.356294e+08 8.695086e+08 1.502131e+09 8.623401e+08 2.380475e+09 6.490083e+08 1.295033e+09 5.574792e+08 7.684658e+08 2.706459e+09 1.572336e+08 8.620298e+08 2.541334e+08 5.012295e+07 ... 3.042985e+04 3.042975e+04 3.042965e+04 3.042956e+04 3.042947e+04 3.042938e+04 3.042929e+04 3.042921e+04 3.042913e+04 3.042905e+04 3.042897e+04 3.042890e+04 3.042883e+04 3.042876e+04 3.042870e+04 3.042864e+04 3.042858e+04 3.042852e+04 30428.464346 3.042841e+04 3.042836e+04 3.042832e+04 30428.273418 3.042823e+04 3.042819e+04 3.042816e+04 3.042812e+04 3.042809e+04 3.042807e+04 3.042804e+04 3.042802e+04 3.042800e+04 3.042798e+04 30427.967140 3.042796e+04 3.042795e+04 3.042794e+04 3.042794e+04 1.521397e+04 1
32 4.534147e-21 4.810687e+11 7.156296e+11 7.907086e+10 1.827524e+11 3.066684e+10 2.281262e+10 2.472757e+10 2.943501e+10 3.255523e+10 3.103233e+10 2.130254e+10 3.009069e+10 7.761600e+09 1.505235e+10 6.974905e+09 1.466913e+10 5.582512e+08 1.547837e+10 8.222439e+09 8.399665e+08 1.454228e+09 3.245127e+09 4.154653e+09 1.935284e+09 5.700948e+09 3.176125e+09 3.062322e+09 3.942368e+08 1.405007e+09 5.606020e+08 8.074106e+08 8.842706e+08 1.022849e+08 2.209409e+09 3.976962e+09 3.228091e+08 1.296049e+09 1.387544e+09 3.794390e+08 ... 2.085999e+05 2.085992e+05 2.085985e+05 2.085979e+05 2.085973e+05 2.085966e+05 2.085961e+05 2.085955e+05 2.085949e+05 2.085944e+05 2.085939e+05 2.085934e+05 2.085929e+05 2.085924e+05 2.085920e+05 2.085916e+05 2.085911e+05 2.085908e+05 208590.383540 2.085900e+05 2.085897e+05 2.085894e+05 208589.074708 2.085888e+05 2.085885e+05 2.085883e+05 2.085881e+05 2.085878e+05 2.085877e+05 2.085875e+05 2.085873e+05 2.085872e+05 2.085871e+05 208586.975144 2.085869e+05 2.085868e+05 2.085868e+05 2.085868e+05 1.042934e+05 0

5 rows × 7527 columns

[ ]:
X = data.iloc[:, :-1].values # Matrix of features (Independent variable), X: numpy.ndarray
y = data.iloc[:, -1].values  # Dependent variable vector, y: numpy.ndarray

Preprocessing

1. Normalization

[ ]:
from sklearn import preprocessing

normalized_data = preprocessing.normalize(X)

2. PCA - Dimensionality Reduction

[ ]:
import numpy as np
from sklearn.decomposition import PCA

# What is the minimun value of `n_components` to keep 95% of variance on data ?

pca = PCA()
pca.fit(normalized_data)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

print("The minimun value is:", d)
The minimun value is: 49
[ ]:
pca = PCA(n_components=d)
pca.fit(normalized_data)
X_reduced = pca.transform(normalized_data)

Best altenative… set the n_components to fluctuate between 0.0 and 1.0, indicating the rate of variance you want to preserve

[ ]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(normalized_data)

Splitting the dataset into the Training set and Test set

[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(98, 49) (98,) (33, 49) (33,)

Train model

[ ]:
import xgboost as xgb

class_xgb = xgb.XGBClassifier(learning_rate=0.3, max_depth=6, verbosity=0, random_state=42)

class_xgb.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.3, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)

Results

[ ]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

labels_formated = ['confirmed exoplanets', 'eclipsing binaries']

fig = plot_confusion_matrix(class_xgb, X_test, y_test,
                             display_labels=labels_formated,
                             cmap=plt.cm.Blues,
                             normalize='true')

fig.ax_.set_title('XGBoost Classifier - Confusion matrix')
plt.show()
_images/06_-_Machine_Learning_95_0.png

Gaussian Mixture Models


https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

Feature: Periodograms

[ ]:
import pandas as pd

FEATURES_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/features'
PERIODOGRAMS_DIR = FEATURES_DIR + '/feature_periodograms.csv'

data = pd.read_csv(PERIODOGRAMS_DIR)
data.sample(5)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 label
121 3.099730e-21 1.450455e+11 5.851971e+10 1.371264e+10 6.332105e+10 7.659462e+10 2.038631e+10 1.701385e+10 3.411757e+10 1.877369e+10 8.258293e+10 4.761234e+09 1.889632e+10 4.737847e+10 6.443619e+10 2.390949e+10 3.833547e+10 4.000678e+10 1.699051e+10 1.877011e+10 2.347108e+10 1.919521e+10 1.872913e+10 5.364253e+10 9.076523e+09 4.618874e+10 2.637520e+10 1.318005e+10 3.798276e+10 2.442435e+10 4.000654e+10 4.623934e+10 3.615225e+10 2.993127e+10 3.303417e+10 3.022440e+10 2.011551e+10 3.900373e+10 4.138960e+10 2.368466e+10 ... 9970.968395 9970.935809 9970.904092 9970.873244 9970.843265 9970.814155 9970.785914 9970.758543 9970.732040 9970.706406 9970.681642 9970.657746 9970.634719 9970.612562 9970.591273 9970.570853 9970.551303 9970.532621 9970.514808 9970.497864 9970.481790 9970.466584 9970.452247 9970.438779 9970.426180 9970.414449 9970.403588 9970.393596 9970.384472 9970.376218 9970.368832 9970.362315 9970.356668 9970.351889 9970.347979 9970.344938 9970.342765 9970.341462 4985.170514 1
104 2.544691e-21 9.069646e+10 6.643117e+10 5.117582e+10 3.413736e+10 4.787929e+10 1.866267e+10 1.943006e+10 2.532991e+09 9.193865e+09 2.942133e+08 7.123645e+09 1.065354e+10 2.174687e+10 2.695494e+11 2.432953e+11 1.574357e+10 3.968706e+09 2.984037e+09 4.355211e+09 4.921942e+09 6.969427e+09 9.030065e+09 6.355329e+09 9.656856e+09 4.718208e+09 3.562666e+09 1.698194e+09 8.258263e+09 7.824319e+11 8.895930e+08 1.499607e+09 1.739245e+10 2.566922e+09 9.158103e+09 6.630252e+10 1.937514e+10 1.056860e+11 1.433715e+10 1.082412e+10 ... 2453.104169 2453.096152 2453.088348 2453.080759 2453.073384 2453.066222 2453.059274 2453.052540 2453.046019 2453.039713 2453.033620 2453.027741 2453.022076 2453.016625 2453.011387 2453.006364 2453.001554 2452.996957 2452.992575 2452.988406 2452.984452 2452.980711 2452.977183 2452.973870 2452.970770 2452.967884 2452.965212 2452.962754 2452.960509 2452.958478 2452.956661 2452.955058 2452.953669 2452.952493 2452.951531 2452.950783 2452.950248 2452.949928 1226.474910 1
83 7.221651e-21 4.505717e+08 7.566718e+06 9.076482e+08 5.055431e+08 5.201719e+08 3.460490e+08 2.954526e+08 3.242125e+08 4.557197e+07 3.167933e+07 2.440220e+07 1.403096e+08 9.453829e+07 1.654833e+07 1.208286e+06 3.350352e+07 5.316386e+07 3.449430e+07 1.229629e+08 1.553346e+08 2.783925e+08 8.946271e+07 7.675415e+06 1.505378e+07 6.678298e+05 1.965005e+08 4.079288e+08 9.018535e+07 3.032997e+07 5.564186e+07 1.726893e+08 2.075812e+08 7.274091e+08 1.861569e+08 2.016792e+08 4.417020e+08 2.089950e+08 3.106678e+08 9.919087e+08 ... 121.901107 121.900708 121.900321 121.899943 121.899577 121.899221 121.898876 121.898541 121.898217 121.897904 121.897601 121.897309 121.897027 121.896756 121.896496 121.896247 121.896008 121.895779 121.895561 121.895354 121.895158 121.894972 121.894797 121.894632 121.894478 121.894334 121.894202 121.894079 121.893968 121.893867 121.893777 121.893697 121.893628 121.893570 121.893522 121.893485 121.893458 121.893442 60.946718 1
126 2.472828e-21 8.480713e+10 3.326409e+10 1.004276e+10 1.105842e+10 9.378111e+09 5.750926e+08 2.931827e+09 7.503412e+09 2.734914e+09 7.226858e+09 8.336502e+08 3.357986e+09 9.016144e+08 1.089214e+09 2.594842e+09 2.696162e+09 1.423190e+09 3.079399e+09 1.936290e+09 4.090007e+06 3.107745e+08 4.000904e+08 3.199724e+07 3.106817e+08 1.349838e+09 1.368688e+07 2.230739e+08 9.444796e+08 5.780070e+08 3.091359e+08 1.091311e+08 7.396224e+07 1.655690e+09 2.581546e+09 8.553230e+06 3.868140e+08 1.264882e+09 1.846532e+08 4.150413e+08 ... 8732.550481 8732.521943 8732.494165 8732.467148 8732.440893 8732.415398 8732.390665 8732.366693 8732.343482 8732.321032 8732.299343 8732.278416 8732.258249 8732.238843 8732.220199 8732.202315 8732.185193 8732.168832 8732.153231 8732.138392 8732.124313 8732.110996 8732.098440 8732.086645 8732.075610 8732.065337 8732.055825 8732.047073 8732.039083 8732.031854 8732.025386 8732.019678 8732.014732 8732.010546 8732.007122 8732.004459 8732.002556 8732.001415 4366.000517 1
13 1.750600e-22 7.424050e+09 1.358892e+09 3.309422e+08 5.295420e+08 2.036254e+09 8.268143e+08 1.861110e+09 9.343690e+07 1.156138e+08 2.263122e+07 2.557775e+08 4.054398e+07 3.574917e+08 1.151586e+08 1.266685e+08 2.280232e+08 1.757015e+08 6.112853e+07 4.058141e+06 3.930772e+07 2.389294e+08 1.018442e+07 2.026846e+08 8.033815e+07 5.084299e+07 2.686861e+08 2.468639e+07 7.259111e+07 6.207652e+07 1.287359e+06 1.152442e+07 1.407730e+08 1.035752e+08 4.983365e+07 7.823538e+07 3.741193e+06 2.443456e+07 5.221536e+07 5.451975e+07 ... 82.035061 82.034793 82.034532 82.034278 82.034032 82.033792 82.033560 82.033335 82.033117 82.032906 82.032702 82.032505 82.032316 82.032134 82.031958 82.031790 82.031630 82.031476 82.031329 82.031190 82.031058 82.030932 82.030815 82.030704 82.030600 82.030504 82.030414 82.030332 82.030257 82.030189 82.030128 82.030075 82.030028 82.029989 82.029957 82.029932 82.029914 82.029903 41.014950 0

5 rows × 7527 columns

[ ]:
X = data.iloc[:, :-1].values # Matrix of features (Independent variable), X: numpy.ndarray
y = data.iloc[:, -1].values  # Dependent variable vector, y: numpy.ndarray

Preprocessing

1. Normalization

[ ]:
from sklearn import preprocessing

normalized_data = preprocessing.normalize(X)

2. PCA - Dimensionality Reduction

[ ]:
import numpy as np
from sklearn.decomposition import PCA

# What is the minimun value of `n_components` to keep 95% of variance on data ?

pca = PCA()
pca.fit(normalized_data)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

print("The minimun value is:", d)
The minimun value is: 49
[ ]:
pca = PCA(n_components=d)
pca.fit(normalized_data)
X_reduced = pca.transform(normalized_data)

Best altenative… set the n_components to fluctuate between 0.0 and 1.0, indicating the rate of variance you want to preserve

[ ]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(normalized_data)

Splitting the dataset into the Training set and Test set

[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(98, 49) (98,) (33, 49) (33,)

Train model

[ ]:
# from sklearn.mixture import GaussianMixture

# classifier = GaussianMixture(n_components=2, random_state=42)

# classifier.fit(X_train, y_train)

Results

[ ]:
# import matplotlib.pyplot as plt
# from sklearn.metrics import plot_confusion_matrix

# labels_formated = ['confirmed exoplanets', 'eclipsing binaries']

# fig = plot_confusion_matrix(classifier, X_test, y_test,
#                              display_labels=labels_formated,
#                              cmap=plt.cm.Blues,
#                              normalize='true')

# fig.ax_.set_title('Gaussian Mixture Classifier - Confusion matrix')
# plt.show()

Lazy Predict


https://lazypredict.readthedocs.io/en/latest/usage.html#classification

Feature: Periodograms

[1]:
import pandas as pd

FEATURES_DIR = '/content/drive/MyDrive/01 - Iniciação Científica/02 - Datasets/features'
PERIODOGRAMS_DIR = FEATURES_DIR + '/feature_periodograms.csv'

data = pd.read_csv(PERIODOGRAMS_DIR)
data.sample(5)
[1]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 label
20 5.244937e-22 1.656106e+11 4.241954e+10 1.295592e+10 1.441850e+10 7.333357e+09 3.654204e+09 8.038780e+08 2.345564e+09 3.428295e+08 1.486851e+08 4.355384e+07 1.922543e+09 1.190937e+09 1.042278e+09 2.177381e+09 2.186346e+08 2.266618e+08 1.933830e+08 1.024137e+09 6.308557e+07 3.175023e+08 3.831862e+08 1.347591e+08 4.991692e+07 6.616579e+07 3.089654e+08 2.338376e+08 1.757878e+08 1.022475e+08 1.619658e+08 2.161862e+07 7.733964e+07 1.881846e+08 2.409856e+07 1.216204e+08 5.106637e+07 2.844362e+07 1.700110e+08 1.287244e+07 ... 2.901623e+06 9.709432e+06 6.766949e+06 3.311118e+07 3.102628e+06 6.014224e+06 1.052508e+07 1.792224e+06 5.597886e+06 8.938615e+06 2.654949e+06 3.223749e+05 1.365606e+07 5.680955e+06 1.315171e+07 1.306965e+07 1.334836e+06 2.123376e+07 3.334838e+04 4.742393e+06 6.822047e+05 4.105110e+07 9.704217e+06 1.313942e+06 7.024191e+06 3.920979e+05 1.611153e+07 8.167207e+06 1.913767e+07 3.936906e+06 4.087972e+06 2.124753e+07 1.382830e+07 3.013891e+06 1.112224e+07 1.568576e+07 6.561321e+06 2.489063e+06 2.142671e+05 0
109 5.222241e-23 7.189921e+10 9.901414e+10 1.476100e+10 1.398462e+10 1.147834e+10 6.674100e+09 4.094150e+09 3.790864e+10 4.962443e+09 9.110133e+08 1.529316e+09 2.659406e+08 5.565563e+09 2.297864e+06 2.821536e+09 2.618699e+09 2.476775e+09 2.353264e+08 3.537825e+09 4.176115e+07 1.394874e+08 2.349405e+09 4.841487e+07 4.160097e+09 3.921834e+08 1.123126e+09 2.430828e+08 1.808513e+09 4.106579e+08 5.394205e+07 1.265701e+09 2.936783e+08 4.351451e+06 5.507590e+08 2.131361e+08 1.194628e+09 4.435846e+08 6.775218e+07 6.498562e+07 ... 1.676719e+07 1.069171e+06 1.431339e+07 7.417587e+06 2.241231e+07 4.796347e+06 7.506176e+05 2.630452e+06 9.664950e+05 2.038954e+07 1.462608e+07 2.389343e+07 1.720616e+06 4.120930e+05 3.470017e+06 3.228340e+06 5.715047e+05 1.583645e+07 2.447099e+07 2.556097e+07 4.713815e+06 1.216467e+07 1.930779e+07 1.756057e+07 2.729222e+07 1.440361e+07 1.199978e+07 5.090961e+06 4.848588e+06 1.571179e+07 3.844894e+06 2.743101e+06 4.370159e+05 5.439133e+06 7.427570e+04 1.760637e+06 2.320080e+06 6.450329e+06 5.866100e+06 1
12 1.741909e-23 7.970192e+09 3.650109e+09 1.582624e+09 3.266343e+10 3.377317e+10 1.471925e+09 1.665996e+08 8.887137e+09 4.330463e+10 2.479880e+09 2.478121e+08 7.060146e+08 1.036437e+10 3.542269e+08 4.027595e+06 3.590214e+08 1.353871e+09 4.792989e+08 5.705030e+08 8.400022e+08 7.151057e+08 2.327222e+08 2.219457e+08 1.349461e+08 2.387996e+09 5.975332e+09 3.989845e+08 1.002950e+08 3.872512e+07 9.498571e+07 2.608256e+08 1.725462e+08 5.317832e+08 1.437279e+08 1.145169e+08 8.002391e+07 2.828761e+08 6.701132e+09 1.186455e+09 ... 1.947367e+03 1.947360e+03 1.947354e+03 1.947348e+03 1.947342e+03 1.947337e+03 1.947331e+03 1.947326e+03 1.947320e+03 1.947315e+03 1.947311e+03 1.947306e+03 1.947301e+03 1.947297e+03 1.947293e+03 1.947289e+03 1.947285e+03 1.947282e+03 1.947278e+03 1.947275e+03 1.947272e+03 1.947269e+03 1.947266e+03 1.947263e+03 1.947261e+03 1.947258e+03 1.947256e+03 1.947254e+03 1.947253e+03 1.947251e+03 1.947250e+03 1.947248e+03 1.947247e+03 1.947246e+03 1.947245e+03 1.947245e+03 1.947244e+03 1.947244e+03 9.736221e+02 0
107 1.537079e-21 2.334249e+10 1.246708e+11 7.693346e+10 3.661008e+10 9.257208e+08 1.661148e+09 4.462807e+09 2.758899e+09 2.238848e+09 8.078527e+09 3.787129e+09 5.939399e+09 3.755190e+09 3.769742e+09 4.052975e+08 5.671221e+09 6.723360e+09 1.641852e+09 5.226829e+09 6.062483e+08 1.519938e+09 1.789390e+09 2.119600e+07 1.049363e+09 1.143246e+08 3.384769e+08 5.382716e+08 7.376932e+08 8.038340e+08 8.351175e+08 2.170651e+08 6.091711e+06 1.795124e+09 8.766678e+07 6.037906e+07 9.266256e+08 3.747338e+08 1.156053e+09 5.333725e+07 ... 7.612817e+00 7.612792e+00 7.612768e+00 7.612744e+00 7.612721e+00 7.612699e+00 7.612677e+00 7.612656e+00 7.612636e+00 7.612617e+00 7.612598e+00 7.612580e+00 7.612562e+00 7.612545e+00 7.612529e+00 7.612513e+00 7.612498e+00 7.612484e+00 7.612470e+00 7.612457e+00 7.612445e+00 7.612434e+00 7.612423e+00 7.612412e+00 7.612403e+00 7.612394e+00 7.612385e+00 7.612378e+00 7.612371e+00 7.612365e+00 7.612359e+00 7.612354e+00 7.612350e+00 7.612346e+00 7.612343e+00 7.612341e+00 7.612339e+00 7.612338e+00 3.806169e+00 1
90 1.676619e-22 6.268024e+10 1.892612e+09 4.589268e+09 2.553798e+09 1.436860e+09 1.871399e+09 6.763201e+08 8.569126e+08 4.346068e+08 3.224972e+09 6.266573e+08 5.095192e+08 2.118835e+07 2.047743e+09 1.136897e+10 9.023897e+10 1.642913e+10 1.140840e+10 8.220013e+09 7.671658e+09 3.662531e+09 7.649781e+09 7.552201e+09 9.279831e+09 6.586815e+09 8.289219e+09 1.490018e+10 2.078385e+10 3.645875e+10 4.881829e+10 1.412379e+11 1.441203e+12 9.568005e+11 1.428168e+11 5.584073e+10 2.796076e+10 1.304114e+10 1.200738e+10 8.144469e+09 ... 1.037832e+03 1.037829e+03 1.037825e+03 1.037822e+03 1.037819e+03 1.037816e+03 1.037813e+03 1.037810e+03 1.037807e+03 1.037805e+03 1.037802e+03 1.037800e+03 1.037797e+03 1.037795e+03 1.037793e+03 1.037791e+03 1.037789e+03 1.037787e+03 1.037785e+03 1.037783e+03 1.037781e+03 1.037780e+03 1.037778e+03 1.037777e+03 1.037776e+03 1.037774e+03 1.037773e+03 1.037772e+03 1.037771e+03 1.037770e+03 1.037770e+03 1.037769e+03 1.037768e+03 1.037768e+03 1.037767e+03 1.037767e+03 1.037767e+03 1.037767e+03 5.188833e+02 1

5 rows × 7527 columns

[2]:
X = data.iloc[:, :-1].values # Matrix of features (Independent variable), X: numpy.ndarray
y = data.iloc[:, -1].values  # Dependent variable vector, y: numpy.ndarray

Preprocessing

1. Normalization

[3]:
from sklearn import preprocessing

normalized_data = preprocessing.normalize(X)

2. PCA - Dimensionality Reduction

[4]:
import numpy as np
from sklearn.decomposition import PCA

# What is the minimun value of `n_components` to keep 95% of variance on data ?

pca = PCA()
pca.fit(normalized_data)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

print("The minimun value is:", d)
The minimun value is: 49
[23]:
pca = PCA(n_components=d)
pca.fit(normalized_data)
X_reduced = pca.transform(normalized_data)

Best altenative… set the n_components to fluctuate between 0.0 and 1.0, indicating the rate of variance you want to preserve

[25]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(normalized_data)

Splitting the dataset into the Training set and Test set

[26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.25, shuffle=True, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(98, 49) (98,) (33, 49) (33,)

Train model

[ ]:
!pip install lazypredict
[18]:
from lazypredict.Supervised import LazyClassifier

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

100%|██████████| 29/29 [00:01<00:00, 24.28it/s]

Results

[19]:
print(models.Accuracy.sort_values(ascending=False))
Model
QuadraticDiscriminantAnalysis   0.70
SVC                             0.70
AdaBoostClassifier              0.70
XGBClassifier                   0.70
ExtraTreesClassifier            0.70
RandomForestClassifier          0.70
CalibratedClassifierCV          0.70
KNeighborsClassifier            0.67
LGBMClassifier                  0.67
BaggingClassifier               0.67
NearestCentroid                 0.67
DecisionTreeClassifier          0.64
BernoulliNB                     0.64
GaussianNB                      0.64
DummyClassifier                 0.61
SGDClassifier                   0.58
LinearDiscriminantAnalysis      0.58
RidgeClassifier                 0.55
ExtraTreeClassifier             0.55
LogisticRegression              0.55
Perceptron                      0.52
PassiveAggressiveClassifier     0.52
LinearSVC                       0.52
RidgeClassifierCV               0.52
LabelSpreading                  0.33
LabelPropagation                0.33
Name: Accuracy, dtype: float64