The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to find the most important features in a dataset?
Christos_Karapapas
Member Posts: 25 Contributor II
in Help
I have a dataset in csv format with more than 500 columns, I have imported it to a database marking every column as polynomial since they all hold different types of information and now, I want to find which of those are the most important.
So far, I have managed to get a table with the feature and its weight, using the weight by "X" operator, but the problem is that on the results I get every feature-value separately on a different row. Instead what I want is to aggregate by feature and have a single weight for each of them. I tried using the aggregate operator but with no luck.
As an example, this is what I get:
feature01-value05, weight:0,71
feature01-value13, weight:0,69
feature09-value03, weight:0,55
Instead I want something like this:
feature01, weight:0,7
feature09, weight:0,55
So far, I have managed to get a table with the feature and its weight, using the weight by "X" operator, but the problem is that on the results I get every feature-value separately on a different row. Instead what I want is to aggregate by feature and have a single weight for each of them. I tried using the aggregate operator but with no luck.
As an example, this is what I get:
feature01-value05, weight:0,71
feature01-value13, weight:0,69
feature09-value03, weight:0,55
Instead I want something like this:
feature01, weight:0,7
feature09, weight:0,55
Tagged:
0
Best Answer
-
Christos_Karapapas Member Posts: 25 Contributor IIThank you so much Lionel!
I finally managed to figure it out. I was getting a ArrayIndexOutOfBoundsException on the Weight by Information Gain operator due to some missing values in my dataset, so I was trying with various (wrong) operators to overcome this problem. One of those was the nominal to numerical which apparently caused this behavior. Once i replaced it with the (obviously right for this job) Replace Missing Values operator everything worked as expected.1
Answers
I'm not able to get the results you obtained...
Here the results I get by applying Weight by Information Gain operator to the Golf dataset :
In order we can reproduce what you observe and understand what's going on, can you please share :
- your XML process or your file process (.rmp file)
- your data
Regards,
Lionel
Glad that you finally found a solution !
Regards,
Lionel