Compare predicted results from deep learning to actual in the validation set
I am a beginner so I apologize in advance if this is obvious, but the online chat folks suggested I post here!
I am trying to train a deep neural network to make a binary prediction ("hard" vs "easy") based on a bunch of real number parameters and a couple of nominal parameters. I input the data from excel for the labelled training set and put a set role block to indicate the "answer" called "class" as a label. Then I passed the data to the deep learning block. I took the trained model and used a apply model block, giving an unlabelled validation set of data as the input. Wired both outputs to the results on the far right. What I get is the assigned predictions in a new column ("Prediction(class)" where "class" was the label). What I need to do now is see how well it did by comparing the actual to the prediction. Because the validation set is unlabeled, it's not present in that excel. I have them of course, in the original data, but I had removed them to make the validation set unlabeled. So basically I want to evaluate the performance of the prediction.
My wiring and output data are appended.
Thanks so much!
Best Answer
-
bsegal Member, University Professor Posts: 7 University Professor
OK thanks, i will run these for now. We do have a bunch more data, though it's not "enriched" in difficult (vs easy) cases like these original sets, which were derived after the fact to yield exactly 50/50. The new data set is prospective and has only ~10% difficult but does have several hundred rows and growing. I'll likely be back for help with the DL!
0
Answers
hello @bsegal welcome to the community! I'd recommend posting your XML process here (see "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.
Scott
Thanks. Enclosed is the xml and the excel file with the data. The labelled training set is tab 2, and the unlabelled validation set is tab 3. All of the data together is on tab 1.
hi @bsegal - ok I think I understand. So normally we prefer to use cross-validation when building our models to prevent overfitting. And then we measure the performance of the training model, and apply the trained model to the unlabeled data.
Scott
Thank you again. Sorry to persist, but i'm really trying to learn how to do this correctly. When I look at the results under Performance Vector, this appears to be how well it fit the training set in the cross validation step, rather than how it performed on the separate validation data (tab 3 in the excel file). What I am trying to do is train the model on a set of data, and then validate it (instead of an n-fold cross validation) on another set of data that was not used to train the model.
A little background if this helps: the first 128 columns are the output of an open source deep neural net that analyzes facial photographs and ouputs these embeddings. We are adding a few fields of data describing the subject which were part of a medical research study. The outcome is difficulty in performing a medical procedure known as endotracheal intubation. We're trying to predict difficulty based on facial appearance. So I'm hoping to train the model on a bunch of easy and hard cases, and then test the model on a different set of data.
hi @bsegal - no problem at all. Thanks for the background as it always helps to understand the use case. So the reason I'm not showing the performance of the validation set is exactly because it's unlabeled (hard/easy). How would we able to measure the performance of a model if we don't know with what we are comparing?
So another way to skin this cat would be to split the training set (usually we use 80/20) and create a model with the larger piece (with cross-validation) and then test the performance with the remaining smaller piece. Then, once we're satisfied with the performance of the model, we can apply the model on unlabeled data to make informed, probabilistic predictions - the "validation set" tab in your case.
Does this help at all?
Scott
[EDIT: on a side note, you have < 100 rows of data in your training set which is almost impossible to use to train any kind of decent model. Hopefully you have more data hiding somewhere?]
Scott, thanks for your patience. So yes, of course, we do have the actual results for the validation step. If you look at tab 1 of the excel file, I have all 80 cases, but I had manually stripped out the actual before sending it to the apply model block. We arbitrarily divided them in our original study into 40/40. In that study we used a supervised facial analysis model that required human intervention to jump start the fitting; here we are trying to skip the human intervention.
So it sounds like you would not recommend what I'm doing (50/50 split of the cases, with one for training and one for testing) but rather combine all 80 cases and use a 10 fold cross validation step instead?
But even with the risk of overfitting, is it possible to program RapidMiner to do what we were trying (the half and half split)?
ah ok! Silly me - should have looked at the first tab and read your query better. My apologies. Sometimes I jump before looking...
So yes, my feeling is you're at risk of overfitting with so few data - particularly with an algorithm like Deep Learning which is prone to overfitting in general. I like the selection of DL as a model in general when looking at sets like yours due to its inherent feature selection properties, but I think you're using a tool that does not fit your current data resources. For initial data model selection, I always recommend using Ingo's amazing mod.rapidminer.com. If I insert your information, I get Decision Tree and Naive Bayes as models that will most likely serve your purposes better.
Here's what I would try:
Scott
EDIT - just to be clear, what I'm showing you is a process that you use once you have more data. With only your 80 rows, yes, you can try putting everything in 10-fold x-validation but again you're not going to get very good results. 49% accuracy with binary classes is worse than flipping a coin.
Scott
sounds good. FYI you don't need a completely balanced data set to perform analysis. We have some nice tools to help in that manner. I would much rather have a data set of a few thousand rows that is unbalanced than a data set less than 100 rows that is balanced.
Scott