The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Issue with Supervised Leaning
Experts,
Can you please help me on how to get this work. I'm trying to perform a supervised learning on a known data set and predict un-known. I'm attaching the sample data and my model. As you can see i have quite a few missing and I have handled Missing data a swell, yet my model predicts everything as positive. If you see the sample data i'm training the model on 1 and predicting how many of 0 are having similar behavior. I tried different ensembles and no luck
How do i get this to produce a reasonable output.. do i have too many variables(inputs) data missing,causing these issue, why cant handle missing data take care of it?
Appreciate your valuable guidance
Thx
Can you please help me on how to get this work. I'm trying to perform a supervised learning on a known data set and predict un-known. I'm attaching the sample data and my model. As you can see i have quite a few missing and I have handled Missing data a swell, yet my model predicts everything as positive. If you see the sample data i'm training the model on 1 and predicting how many of 0 are having similar behavior. I tried different ensembles and no luck
How do i get this to produce a reasonable output.. do i have too many variables(inputs) data missing,causing these issue, why cant handle missing data take care of it?
Appreciate your valuable guidance
Thx
0
Answers
Sir can u please help
How do i get this to produce a reasonable output...
- Do whatever is in your hands to eliminate the missings as much as possible. If it means replacing with zeroes, it is all good; if it means replacing with the minimum value, it is all good.
- You have too many columns, so you are trying what I call "brute force scoring". Nope, not a good thing to do: sit down and understand how your data works before getting to model stuff. Your data has a lot of potential when handled correctly, but you have to be careful.
I can't open your RapidMiner process because I don't have a machine that can properly open these. My main computer is dead.Let me start by saying that I may not have correctly understood your RapidMiner process or what you're trying to accomplish, but perhaps the following will be helpful.
Your "Supervised" data set has a column "sold_notsold" which if I understand you correctly, you want to predict the value of - this would be what RapidMiner calls the Label field.
In your your "Supervised" Excel file, all rows for the "sold_notsold" are coded with the value of 1 or 0. As your file name suggests, your use case looks like a case of supervised learning, where data rows in an example file are coded with values of the field you want to predict. One would then build a model based on this data, and use the model to predict outcomes on data it hasn't seen before that would come from another input source but would be fed to your trained model - and the model would generate predictions. Again, I'm assuming a vaue of 1 = sold and a value of 0 = not sold and that the "sold_notsold" is the attribute (field) you want to predict..
In your process, the data is filtered - which results in the model that is built being based only on data where the "sold_notsold" field (what you want to predict) equals 1. This would result in a model that would seem to have no basis or experience to classify new prediction examples as other than 1 (positive) - because it has never seen any negative (0 or false) examples. Your process then feeds data rows where "sold_notsold" equals 0 to this model - but again - since the model was trained on data where "sol_notsold" only equals 1, the model doesn't know how to distinguish between data rows where "sold_notsold" =1 or "sold_notsold" = 0. It only knows that "sold_notsold" = 1.
I worked up three simple processes named LS_EXIT_KNN_RMSupportAlternate_Nr_1, LS_EXIT_KNN_RMSupportAlternate_Nr_2, and LS_EXIT_KNN_RMSupportAlternate_Nr_3.
I copied same examples from your "Supervised" Excel file and pasted them into a new file Excel that I named "Prediction_Data.xlsx". Data rows in this file do not include the "sold_notsold" field - this is the value that will be predicted based on the values in the new data rows. Of course, the "Prediction" Excel file would contain completely new data.
LS_EXIT_KNN_RMSupportAlternate_Nr_1 is a very stripped down version of your use case - where all of your data in the "Supervised" file is used to train a model that is applied to the data rows in the "Prediction_Data" file. Because the training and testing data contains examples of where "sold_notsold" equals 1 and some where "sold_notsold" equals 0, the model has some basis for predicting either outcome based on the input data.
If you run this process, you'll see that varying predictions are made against the data in "Prediction Data" Excel file, as per a normal classification use case - because the model making the predictions has seen values of 0 and 1 for the "sold_notsold" data field. I also added "Weighted Voting" and the "k-NN Kernel Type" to the list of parameters to be optimised for the k-NN operator, which can sometimes boost accuracy depending on your data. I also used the "Numerical to Binomial" operator to create the binomial label for your field to predict in this process - which then allows the use of the Binomial Performance Operator which provides other useful accuracy metrics such as the AUC (Area Under the Curve).
The two other process files (Nr2 and Nr3) do pretty much the same things the Nr1 file does, but also adds discreet binning of numeric metrics as binning helps minimise the impact of outliers - and the k-NN algorithm is sensitive to outliers.
The "Nr3" file uses the "Impute Missing Values" operator to fill in missing values using Gradient Boosted Trees. I also used the "Generate Attributes" operator to map the values for "sold_notsold" to text values of "Sold" or "Not Sold" to a new field named "SoldNotSold" and defined this new field as the Label.
Using binning and boosted trees to impute missing values upped the accuracy of the k-NN model by between 7 and 8 percent over the accuracy of the k_NN without binning or imputing missing values on my system.
You might try some Evolutionary Feature selection to find the best data fields to use, but as there are so many missing values it may not be too helpful. The "Explain Predictions" operator might help you understand which data attributes drove predictions in each data row - and the differences in the data between rows where "sold_notsold" = 1 and "sold-notsold" = 0.
You also may also want to try clustering your data which may help you see the which differences in which data fields leads to different outcomes. You might also try using some of RapidMiner's own built in chart types (especially a Parallel Chart or Andrews Curves) to further analyse the data in your "Supervised" file - you can color the data fields in a chart by the value of the "sold_notsold" field.
Last but not least, having fewer missing values would be helpful, but of course this is something that could be out of your control.
Again - perhaps I have misunderstood your RapidMiner process and / or what you want to do, so the above may not be at all helpful or relevant - but hopefully it is.
Best wishes, Michael Martin