The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Related attributes
Hello everyone,
In the following task I could really use some tips or hints from other miners
My dataset does not suffice the rule that the number of samples is over 10 times the number of variables per sample and the number of choises of the label variable. In other words...
I have only very few samples and a lot of data variables per sample, but I still would like to do something useful with the data set namely the following:
- Try to find which variables (attributes) give related information, and if possible show this in some kind of graphical manner
- From all the unrelated variables, try to find a few combinations which best describe a model for the label variable
Of course, since the number of samples is so low there could be many combinations of the latter. The attributes are not all of the same type (so some are bins, some are numbers, some are text)
A little about the data set:
Suppose I am a car maker and I have 6 car models, but some have design flaws in them that I would like to find, I try to parametrise each designs in a set of variables (attributes, now there are only 6 (plus CarModel, but that shouldnt be used to mine) but imagine that there are 300 attributes).
CarModel WheelSize WheelBrand EngineType EnginePower EngineBrand Failure(label)
Corvega 18 Brimstone Nitro 6GigaWatt RollsDavidson CarExploded
A little about the way I tried to do this before:
- To see which attributes are directly related to the label attribute, I used correlationmatrix. I then looked at all variables which are (in absolute sense) closest to 1 and thought those were important attributes. The drawback is I could not look at combinations of attributes.
- In parallel I tried to create a decision tree. The problem with this approach was, that there were many possibilities and the program just took the first attribute it encountered in the data set at which it could classify well, so what I did was remove that attribute to look at which was the next attribute and look at the model again.
Could anyone please give hints in how to better approach this problem than I did before?
Thank you!
In the following task I could really use some tips or hints from other miners
My dataset does not suffice the rule that the number of samples is over 10 times the number of variables per sample and the number of choises of the label variable. In other words...
I have only very few samples and a lot of data variables per sample, but I still would like to do something useful with the data set namely the following:
- Try to find which variables (attributes) give related information, and if possible show this in some kind of graphical manner
- From all the unrelated variables, try to find a few combinations which best describe a model for the label variable
Of course, since the number of samples is so low there could be many combinations of the latter. The attributes are not all of the same type (so some are bins, some are numbers, some are text)
A little about the data set:
Suppose I am a car maker and I have 6 car models, but some have design flaws in them that I would like to find, I try to parametrise each designs in a set of variables (attributes, now there are only 6 (plus CarModel, but that shouldnt be used to mine) but imagine that there are 300 attributes).
CarModel WheelSize WheelBrand EngineType EnginePower EngineBrand Failure(label)
Corvega 18 Brimstone Nitro 6GigaWatt RollsDavidson CarExploded
A little about the way I tried to do this before:
- To see which attributes are directly related to the label attribute, I used correlationmatrix. I then looked at all variables which are (in absolute sense) closest to 1 and thought those were important attributes. The drawback is I could not look at combinations of attributes.
- In parallel I tried to create a decision tree. The problem with this approach was, that there were many possibilities and the program just took the first attribute it encountered in the data set at which it could classify well, so what I did was remove that attribute to look at which was the next attribute and look at the model again.
Could anyone please give hints in how to better approach this problem than I did before?
Thank you!
0
Answers
this is a common but still complex problem. I doubt we can discuss all possible ways here. In RapidMiner there are many useful tools for solving this problem:
You could use the more heuristic attribute weighting schemas, or the wrapper based approach using Forward or Backward selection with an integrated learning algorithm.
You can even select the subset by random and try to draw conclusion from random results inside a loop.
You will have to experiment with the combination of possibilities to find the most suitable for your (very) specific task. For an orientation, there are many samples for weighting and selecting attributes in the Sample repository.
Greetings,
Sebastian
With Samples do you mean the community expansion or is there another location that I am not aware of yet with samples in them?
Which of the many possible ways to do this would you recommend yourself? (if possible a web-link to an example or a search term I could
use in the community expansion would make things much easier for me).
take a look at the Sample Repository that is delivered with RapidMiner. There are several examples for this!
Greetings,
Sebastian
Thank you for the helpful advice, I found the samples (and browsed the community processes). Now I have the following question:
I managed to give a weight to the attributes (weight by information gain) and I can cap off the data set by doing "select by weight" and then
sending the result of that to decision tree. Now I wonder if there is also an operator that would be able to order the dataset by weight instead of
capping it off (like select by weight does), so that my decision tree encounters the most relevant attribute first, then the next, etc.
Does such an operator ("re-order by weight") exist and if so, what is its name? (I tried to find one by typing in "weight" in the operator search etc.)
what you suggest does not make sense: The Decision Tree will take a look at ALL attributes, regardless of their ordering.
Greetings,
Sebastian