Skip to content
gngdb edited this page Nov 13, 2014 · 18 revisions

2014-11-08 Investigating Raw vs ICA vs CSP for MVAR GPDC

Suspect apriori that this should be a very useful feature. Found exceptionally high separation on iso-plots. Expected ICA to outperform CSP, and both to outperform RAW. Found CSP outperformed ICA (though very narrowly)!

mvar_raw
Public AUROC:    0.68743
Predicted AUROC: 0.76961
Expected values:
Dog_1           Dog_2           Dog_3           Dog_4           Dog_5           Patient_1       Patient_2           Overall
0.450020833333  0.777798564477  0.769036265432  0.753514200268  0.881382632633  0.814868731309  0.360648148148  0.769611833552

mvar_ica
Public AUROC:    0.74684
Predicted AUROC: 0.83023
0.476815972222  0.953311776156  0.809512731481  0.779338422456  0.929014014014  0.869317201518  0.425092592593  0.830230271653

mvar_csp
Public AUROC:    0.74837
Predicted AUROC: 0.81972
0.426253472222  0.919725826918  0.825064236111  0.776440593067  0.930520520521  0.853518907563  0.432222222222  0.819721454993

emvar_ica
Public AUROC:    0.74074
Predicted AUROC: 0.85650
0.557541666667	0.957400035901	0.871861111111	0.750087083459	0.965962962963	0.895380813144	0.595833333333	0.856497773628

emvar_csp
Public AUROC:    0.75006
Predicted AUROC: 0.84330
0.532333333333	0.954033433758	0.830027777778	0.740407731803	0.941833333333	0.877332219437	0.6125	0.843300878237

emvar_cspdr
Public AUROC:    0.71194
Predicted AUROC: 0.607055555556	0.926507223807	0.852361111111	0.683154132689	0.96625	0.846177944862	0.552222222222	0.847620757582

emvar_cspdr,ica
Public AUROC:    0.70129
Predicted AUROC: 0.84603
0.591930555556	0.947151957074	0.822472222222	0.673227121715	0.97087037037	0.882466583124	0.514444444444	0.84602997907

probablygoodplusraw

Thought that combining the features used in our previous best classifier along with those we used for our current best submission would probably work well. The following features were used:

"FEATURES": ["cln,csp,dwn_feat_lmom-3_",
        "cln,ica,dwn_feat_xcorr-ypeak_",
        "cln,csp,dwn_feat_pib_ratioBB_",
        "cln,ica,dwn_feat_mvar-GPDC_",
        "cln,ica,dwn_feat_PSDlogfcorrcoef_",
        "cln,ica,dwn_feat_pwling1_",
        "raw_feat_corrcoef_",
        "raw_feat_cov_",
        "raw_feat_pib_",
        "raw_feat_var_",
        "raw_feat_xcorr_"],

And the predicted performance was:

predicted AUC score for Dog_1: 0.53
predicted AUC score for Dog_2: 0.95
predicted AUC score for Dog_3: 0.82
predicted AUC score for Dog_4: 0.77
predicted AUC score for Dog_5: 0.94
predicted AUC score for Patient_1: 0.77
predicted AUC score for Patient_2: 0.45
predicted AUC score over all subjects: 0.83

Then, submitted and got 0.76012.

Only slightly worse, probably relies on the features for the current best and these raw features aren't useful.

Feature selection

Thought this was what I was doing, but looking at the code I wasn't actually including any feature selection:

Using a simple variance threshold and then also filtering by f-scores. Doing both predicted AUC was 0.86, but I had been fiddling with the cross-val code so that won't map onto other results. Full AUC results can be found here.

Submitted and got 0.77171, moving up the leaderboard 5 places.

Now running with just the variance threshold to see what its contribution is on its own.

Also got a predicted AUC of 0.86. Submitted and got 0.77171, exactly the same, which doesn't make a lot of sense. Going to check that it actually took out the f1-score selector.

Reran it with the VarianceThreshold actually enabled and got the following results:

probablygood.gavin	0.589681818182	0.98621623775	0.890074380165	0.772748201247	0.979468319559	0.87131210293	0.537975206612	0.863925818882

Submitted and scored 0.77171 again.

Looking through commit logs to see if I did actually enable variance threshold in the last submission. Looks like I didn't have it enabled, unless it was enabled between commits.

Enabled f1-score SelectKBest with defaults, which are pretty aggressive, removes everything but the 10 best features. Training scores were the following:

f1selectionpg_gavin	0.596227272727	0.955673747151	0.622037190083	0.700814745942	0.740404958678	0.874980003656	0.59132231405	0.779228904792

Switched form KBest to Percentile 10%:

f1selectionpg_gavin	0.594309917355	0.993204113356	0.875190082645	0.807148622483	0.947581267218	0.913800047991	0.537231404959	0.865991069517

So this should in theory perform better on the leaderboard. Submitted and got 0.76693.

Then decided to try random forest feature selection. Now implemented in settings.

Used 1000 estimators and got the following training results:

forestselection_gavin	0.607252066116	0.991547999016	0.885080578512	0.769613255618	0.978782369146	0.910679190091	0.588863636364	0.86575473285

Submitted and got 0.78329, moving up the leaderboard 9 positions. Appears the improved performance on Patient_2 and Dog_1 make a difference.

Single channel statistics

Investigating which modtyp is best for the single channel time domain statistics. Intuitively, CSP should be best. We need to check this against the leaderboard because CSP takes knowledge of all features into account.

singlech_timestats_raw
Public AUROC:    0.69832
Predicted AUROC: 0.73461
Dog_1           Dog_2           Dog_3           Dog_4           Dog_5           Patient_1       Patient_2           Overall
0.479159722222	0.779531588137	0.734937885802	0.687944059975	0.781799299299	0.74687284334	0.551018518519	0.734610190333

singlech_timestats_ica
Public AUROC:    0.68967
Predicted AUROC: 0.71428
0.494840277778	0.770932191196	0.677896219136	0.678842378697	0.772084584585	0.666515700483	0.555925925926	0.714281427218

singlech_timestats_csp
Public AUROC:    0.70466
Predicted AUROC: 0.72564
0.506454861111	0.781471208263	0.72861246142	0.665038026046	0.803454704705	0.578890614217	0.578240740741	0.725640056988

We predicted raw > csp > ica We found on leaderboard csp > raw > ica Only narrow margins between them though.

Individual features which are supposed to be good overall

Ran the batch train and predict script on all single features. Sorted the list by overall ROC prediction.

A lot of the top features are MVAR flavours, so I just used the best overall and no others.

NB: these are 20-times-CV predictions, not 10.

SVC_ica_mvar-arf
Public AUROC:    0.68460
Predicted AUROC: 0.8427
Dog_1   Dog_2   Dog_3   Dog_4   Dog_5   Patnt1  Patnt2  Overall
0.4651	0.9571	0.8439	0.7830	0.9539	0.7216	0.3756	0.8427

SVC_csp_coher_logf
Public AUROC:    0.75269
Predicted AUROC: 0.8301
0.5468	0.9579	0.8264	0.7656	0.9296	0.8225	0.5712	0.8301

SVC_ica_phase-high_gamma-sync
Public AUROC:    0.68427
Predicted AUROC: 0.8247
0.6218	0.9789	0.7523	0.7990	0.9082	0.8884	0.5332	0.8247

SVC_ica_pib_ratioBB
Public AUROC:    0.77110
Predicted AUROC: 0.8012
0.6928	0.8832	0.7556	0.7968	0.8065	0.8332	0.5598	0.8012

MVAR-ARF is supposed to be better than GPDC according to CV, but is not on the public leaderboard (see top of this page). Not sure how we should pick the best of the MVARs.

ica_phase-high_gamma-sync does reasonably well with CV/public = 0.8247/0.68427

csp_coher_logf is surprisingly well with CV/public = 0.8301/0.75269

ica_pib_ratioBB does incredibly well on the public leaderbord, with 0.8012/0.77110. This is basically as good as the best current submission, which is Gavin's probablygood with automatic dropping of worst elements.

At the moment, it seems like the public score is most correlated with the Patient_2 score. This might be because it is the worst performing subject. We probably overestimate the number of Patient_2 preictals. Might be we could improve the overall score with an improvement to the Patient_2 prior. Open to suggestions on why the worst subject would be linked to overall performance, and open to other suggestions to the relationship between prediction and public leaderboard.

Best by subject combined

Made up of the features which have highest cross-validation for each subject (approximately). Includes the 4 above which were the best overall.

Prediction is with 10 CV cross folds.

Also tested this with/without pseudodata, and it did worse without pseudodata.

bestbysubj
Public AUROC:    0.74468
Predicted AUROC: 0.84500
Expected values:
Dog_1           Dog_2           Dog_3           Dog_4           Dog_5           Patient_1       Patient_2           Overall
0.474360020661	0.970092652542	0.852643652433	0.778758206978	0.972751098206	0.741838942507	0.475785123967	0.845005605453

bestbysubject, no pseudo data
Public AUROC:    0.72625
Predicted AUROC: 0.83970
0.460887152778	0.961280343093	0.841602430556	0.765463976091	0.971615365365	0.761653151599	0.491018518519	0.839706036032

"cln,ica,dwn_feat_mvar-ARF_",
"cln,csp,dwn_feat_coher_logf_",
"cln,ica,dwn_feat_phase-high_gamma-sync_",
"cln,ica,dwn_feat_pib_ratioBB_",
"cln,raw,dwn_feat_PSDlogfcorrcoef_",
"cln,csp,dwn_feat_psd_logf_",
"cln,raw,dwn_feat_spearman_",
"cln,csp,dwn_feat_corrcoefeig_",
"cln,ica,dwn_feat_ampcorrcoef-theta-eig_",
"cln,raw,dwn_feat_pwling4_"

Does worse than SVC_ica_pib_ratioBB and SVC_csp_coher_logf, despite including these. Clearly some kind of penalty for including extra (noisier) elements when doing the classification. Should submit again with Gavin's f1-score and gaussian thresholding

Does better than SVC_ica_phase-high_gamma-sync alone, despite having worse Dog_1 and Patient_2 predicted scores, which goes against my suggestion that Patient_2 is weighing more heavily than it should for leaderboard scores.

Dirty test

Since Patient_2 is doing badly and I cleaned and downsampled it, I wanted to check it is not just because of the cleaning that it is worse.

Could have just looked at CV scores, but we had spare submission slots.

I selected the features which were used in Individual features which are supposed to be good overall, but only two of those are available dirty (no dirty phase, ampcorr or mvar processed).

Some of these came out worse after cleaning, so it seems whilst cleaning cleaning out the line noise in Patient_1 improved its scores, cleaning out non-existent line noise for Patient_2 for the purpose of "having a consistent model and not switching on subject names" has made it worse.

I think it might be best if I re-do all of Patient_2 without line noise removal... :(

SVC_csp_coher_logf CLEAN
Public AUROC:    0.75269
Predicted AUROC: 0.8301
0.5468	0.9579	0.8264	0.7656	0.9296	0.8225	0.5712	0.8301

SVC_dirtycsp_coher_logf DIRTY
Public AUROC:    0.75400
Predicted AUROC: 0.8304
0.5468	0.9579	0.8264	0.7656	0.9296	0.8280	0.6394	0.8304


SVC_ica_pib_ratioBB CLEAN
Public AUROC:    0.77110
Predicted AUROC: 0.8012
0.6928	0.8832	0.7556	0.7968	0.8065	0.8332	0.5598	0.8012

SVC_dirtyica_pib_ratioBB DIRTY
Public AUROC:    0.77161
Predicted AUROC: 0.8007
0.6928	0.8832	0.7556	0.7968	0.8065	0.7338	0.5886	0.8007


SVC_csp_psd_logf CLEAN
Public AUROC:    0.74494
Predicted AUROC: 0.8182
0.6226	0.8966	0.8246	0.7577	0.9060	0.8481	0.5824	0.8182

SVC_dirtycsp_psd_logf DIRTY
Public AUROC:    0.74458
Predicted AUROC: 0.8189
0.6226	0.8966	0.8246	0.7577	0.9060	0.7947	0.7046	0.8189


SVC_raw_pwling4 CLEAN
Public AUROC:    0.65123
Predicted AUROC: 0.6756
0.4787	0.7702	0.6441	0.6748	0.7051	0.7326	0.7392	0.6756

SVC_dirtyraw_pwling4 DIRTY
Public AUROC:    0.64573
Predicted AUROC: 0.6656
0.4787	0.7702	0.6441	0.6748	0.7051	0.3036	0.7128	0.6656

Best by subject with RF feature selection

Thought that with feature selection the idea of combining features that perform well on each subject made sense. Tried the settings used by Scott above for the same thing, but added RF feature selection. Training score:

forestselection_bestbysubj 0.630185950413 0.993507726101 0.863694214876 0.763790561504 0.970876033058 0.904247509027 0.63541322314 0.868102807597

Submitted and got 0.76471, no improvement.

Best by subject with RF feature selection

Taking features reported to perform well on the subjects we find difficult and adding them to the probablygood features along with the RF feature selection. Training scores were:

bbsubj_pg.json
0.618776859504 0.991768830696 0.885159090909 0.775496269191 0.979101928375 0.906237145208 0.592644628099 0.868513930942

Submitted and score was 0.77768, not an improvement.

New features

Scott finished making the version 3 features so rerunning the highest performing classifier on this list. Random forest feature selection with the probably good features. Scores on training were:

forestselection_gavin	0.614710743802	0.991181376025	0.879867768595	0.772283341498	0.980170798898	0.911199095023	0.635041322314	0.867716533627

Submitted and the score was: 0.78169

Stochastic optimisation initial results

Watchin results coming in saw AUC cross-val higher than I'd seen before. Results were:

stochopthighscoringearly	0.765004132231	0.960021774526	0.897760330579	0.776119092609	0.945316804408	0.877876616847	0.686776859504	0.8784984861

Submitted and score was: 0.77176

At least the cross-val score on Patient_2 and Dog_1 were good.

Platt scaling

Submitted simply Platt scaled with version 3 forest selection results. Predicted score was: 0.84980742318223301

On the leaderboard it scored: 0.78169. Before scaling the same model scored exactly the same.

Tried using an extra 1-of-k feature for the platt scaling to see if that might be useful to tune for each subject. Predicted AUC score was: 0.8638437. On the leaderboard scored 0.69822. Potentially damaging then.

1-minute features

Tried 1-minute feature with only ratio BB PIB as just got these working with simple averaging. Predicted scores were:

testing_10features	0.804301823638	0.943478642754	0.819731186519	0.736204835556	0.82283128142	0.891375291106	0.318511465066	0.790385411813

Submitted and scored 0.74661.