The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Issue with Supervised Leaning

msacs09msacs09 Member Posts: 55 Contributor II
Experts,

Can you please help me on how to get this work. I'm trying to perform a supervised learning on a known data set and predict un-known. I'm attaching the sample data and my model. As you can see i have quite a few missing and I have handled Missing data a swell, yet my model predicts everything as positive. If you see the sample data i'm training the model on 1 and predicting how many of 0 are having similar behavior. I tried different ensembles and no luck

How do i get this to produce a reasonable output.. do i have too many variables(inputs) data missing,causing these issue, why cant handle missing data take care of it?

Appreciate your valuable guidance 


Thx

Answers

  • msacs09msacs09 Member Posts: 55 Contributor II
    edited November 2018
  • rfuentealbarfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello @msacs09,

    What can we do for you?
    Can you please help me on how this works?
    Let's see.
    I'm trying to perform a supervised learning on a known data set and predict un-known. I'm attaching the sample data and my model.
    Ok, I am opening your spreadsheet, and it looks like it has a lot of missing information. Would it be safe if you fill the missing data with zeroes?
    As you can see i have quite a few missings, and I have handled Missing Data as well, yet my model predicts everything as positive.
    Haven't opened your process (see below) but this thing is screaming for a time series analysis rather than a supervised algorithm. Or a collection of TS analyses.
    If you see the sample data i'm training the model on 1 and predicting how many of 0 are having similar behavior. I tried different ensembles and no luck.

    How do i get this to produce a reasonable output...
    I am not familiar with what each column means. Do you mind to explain these? I know you have actual revenues from Q1, Q2, Q3, Q4 in 2016, 2017 and 2018, but the rest is too cryptic for me :(
    Do i have too many variables(inputs) data missing, causing these issues, why cant handle missing data take care of it?
    I believe you have these. You might want a few processes:
    • Do whatever is in your hands to eliminate the missings as much as possible. If it means replacing with zeroes, it is all good; if it means replacing with the minimum value, it is all good.
    • You have too many columns, so you are trying what I call "brute force scoring". Nope, not a good thing to do: sit down and understand how your data works before getting to model stuff. Your data has a lot of potential when handled correctly, but you have to be careful.
    I can't open your RapidMiner process because I don't have a machine that can properly open these. My main computer is dead. :(

    All the best,

    Rodrigo.
  • M_MartinM_Martin RapidMiner Certified Analyst, Member Posts: 125 Unicorn
    Hi 

    Let me start by saying that I may not have correctly understood your RapidMiner process or what you're trying to accomplish, but perhaps the following will be helpful.

    Your "Supervised" data set has a column "sold_notsold" which if I understand you correctly, you want to predict the value of - this would be what RapidMiner calls the Label field.

    In your your "Supervised" Excel file, all rows for the "sold_notsold"  are coded with the value of 1 or 0.  As your file name suggests, your use case looks like a case of supervised learning, where data rows in an example file are coded with values of the field you want to predict.  One would then build a model based on this data, and use the model to predict outcomes on data it hasn't seen before that would come from another input source but would be fed to your trained model - and the model would generate predictions.  Again, I'm assuming a vaue of 1 = sold and a value of 0 = not sold and that the "sold_notsold" is the attribute (field) you want to predict.. 

    In your process, the data is filtered - which results in the model that is built being based only on data where the "sold_notsold" field (what you want to predict) equals 1.  This would result in a model that would seem to have no basis or experience to classify new prediction examples as other than 1 (positive)  - because it has never seen any negative (0 or false) examples.  Your process then feeds data rows where "sold_notsold" equals 0 to this model - but again - since the model was trained on data where "sol_notsold" only equals 1, the model doesn't know how to distinguish between data rows where "sold_notsold" =1 or "sold_notsold" = 0.  It only knows that "sold_notsold" = 1.
      
    I worked up three simple processes named LS_EXIT_KNN_RMSupportAlternate_Nr_1, LS_EXIT_KNN_RMSupportAlternate_Nr_2, and LS_EXIT_KNN_RMSupportAlternate_Nr_3.

    I copied same examples from your "Supervised" Excel file and pasted them into a new file Excel that I named "Prediction_Data.xlsx".  Data rows in this file do not include the "sold_notsold" field - this is the value that will be predicted based on the values in the new data rows.  Of course, the "Prediction" Excel file would contain completely new data.

    LS_EXIT_KNN_RMSupportAlternate_Nr_1 is a very stripped down version of your use case - where all of your data in the "Supervised" file is used to train a model that is applied to the data rows in the "Prediction_Data" file. Because the training and testing data contains examples of where "sold_notsold" equals 1 and some where "sold_notsold" equals 0, the model has some basis for predicting either outcome based on the input data. 

    If you run this process, you'll see that varying predictions are made against the data in "Prediction Data" Excel file, as per a normal classification use case - because the model making the predictions has seen values of 0 and 1 for the "sold_notsold" data field.  I also added "Weighted Voting" and the "k-NN Kernel Type" to the list of parameters to be optimised for the k-NN operator, which can sometimes boost accuracy depending on your data.  I also used the "Numerical to Binomial" operator to create the binomial label for your field to predict in this process - which then allows the use of the Binomial Performance Operator which provides other useful accuracy metrics such as the AUC (Area Under the Curve).

    The two other process files (Nr2 and Nr3) do pretty much the same things the Nr1 file does, but also adds discreet binning of numeric metrics as binning helps minimise the impact of outliers - and the k-NN algorithm is sensitive to outliers.

    The "Nr3" file uses the "Impute Missing Values" operator to fill in missing values using Gradient Boosted Trees.  I also used the "Generate Attributes" operator to map the values for "sold_notsold" to text values of "Sold" or "Not Sold" to a new field named "SoldNotSold" and defined this new field as the Label. 

    Using binning and boosted trees to impute missing values upped the accuracy of the k-NN model by between 7 and 8 percent over the accuracy of the k_NN without binning or imputing missing values on my system. 

    You might try some Evolutionary Feature selection to find the best data fields to use, but as there are so many missing values it may not be too helpful.  The "Explain Predictions" operator might help you understand which data attributes drove predictions in each data row - and the differences in the data between rows where "sold_notsold" = 1 and "sold-notsold" = 0. 

    You also may also want to try clustering your data which may help you see the which differences in which data fields leads to different outcomes.  You might also try using some of RapidMiner's own built in chart types (especially a Parallel Chart or Andrews Curves) to further analyse the data in your "Supervised" file - you can color the data fields in a chart by the value of the "sold_notsold" field.

    Last but not least, having fewer missing values would be helpful, but of course this is something that could be out of your control.

    Again - perhaps I have misunderstood your RapidMiner process and / or what you want to do, so the above may not be at all helpful or relevant - but hopefully it is.

    Best wishes,  Michael Martin
  • M_MartinM_Martin RapidMiner Certified Analyst, Member Posts: 125 Unicorn
    Hi again: Here are the attachments mentioned in my previous post - somehow they were not included.  Best wishes, Michael Martin
  • msacs09msacs09 Member Posts: 55 Contributor II
    Thank you @rfuentealba and @M_Martin .. Very much appreciated. I just came back to this topic after bit of hiatus.. Thank u once again.
Sign In or Register to comment.