The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to find the most important features in a dataset?

Christos_KarapapasChristos_Karapapas Member Posts: 25 Contributor II
I have a dataset in csv format with more than 500 columns, I have imported it to a database marking every column as polynomial since they all hold different types of information and now, I want to find which of those are the most important.  

So far, I have managed to get a table with the feature and its weight, using the weight by "X" operator, but the problem is that on the results I get every feature-value separately on a different row. Instead what I want is to aggregate by feature and have a single weight for each of them. I tried using the aggregate operator but with no luck.

As an example, this is what I get:
feature01-value05, weight:0,71
feature01-value13, weight:0,69
feature09-value03, weight:0,55

Instead I want something like this:
feature01, weight:0,7
feature09, weight:0,55

Best Answer

  • Christos_KarapapasChristos_Karapapas Member Posts: 25 Contributor II
    Solution Accepted
    Thank you so much Lionel! 

    I finally managed to figure it out. I was getting a ArrayIndexOutOfBoundsException on the Weight by Information Gain operator due to some missing values in my dataset, so I was trying with various (wrong) operators to overcome this problem. One of those was the nominal to numerical which apparently caused this behavior. Once i replaced it with the (obviously right for this job) Replace Missing Values operator everything worked as expected.

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @chris_skg,

    I'm not able to get the results you obtained...
    Here the results I get by applying Weight by Information Gain operator to the Golf dataset : 



    In order we can reproduce what you observe and understand what's going on, can you please share : 
     - your XML process or your file process (.rmp file)
     - your data

    Regards,

    Lionel


  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    OK, @chris_skg,

    Glad that you finally found a solution ! 

    Regards,

    Lionel
Sign In or Register to comment.