The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Beginner Machine Laerning Question"
Ghostrider
Member Posts: 60 Contributor II
Say I want to predict the price of an automobile based on attributes of the automobile. Assume that I know things such as tire size, date of manufacture, number of doors, etc. I could throw all these attributes into a decision tree learner and hope to find some relation about the cost of the car. But can I get a better result by using relations that I already know about the attributes? For example, assume that I don't know how much horsepower that the engine produces, but I do know information about the attributes that correlate with the engine's horsepower such as the engine displacement, number of cylinders, and number of gears in the transmission. Although I don't know the horsepower, assume that I can roughly calculate it form these parameters. Question is, doesn't it make more sense to try to isolate these attributes from the other attributes and use them exclusively for building a model for engine horsepower which can then be supplied to a higher layer learner that can try to figure out how horsepower and other factors affect an automobile's price? Obviously, if I don't have any idea about how the attributes relate, it's probably better to just supply them all into one learning algorithm. But if I know information about the relation among certain attributes, it seems like it would be a better approach to isolate the attributes into groups, build a model for what these attributes represent, and then use these sub-models to train another model, this would be like a hierarchy of learning, going from detailed attributes (number of cylinders, engine displacement, gears in transmission) to predict higher-level attributes (horsepower, torque), and finally predict price of auto from these higher level attributes (horsepower, quality of interior, car marker's reputation, etc). Question is, is this a good approach? Idea is to use information about relationships that I already know and direct the learning process. Second question, what if I don't know how to calculate horsepower from those low-level attributes, I only know that those attributes are related?
Tagged:
0
Answers
If you are predicting the price as an amount, that is a regression problem. So you can only use regression learners, not e. g. decision trees. Of course, you could put your numeric target variable into classes like "< 15,000", "15,000 - 30,000" etc. so that you have a classification problem and can use most learners.
Your idea with predicting more attributes in sub-models is interesting. Would you need those attributes later? If not, you can always experiment with variable selection or use a learner that selects the best variables itself.
It is never sure that additional attributes help. One can only try. In your example, if you have attributes that correlate with higher horsepower, and cars with more horsepower are pricier, those attributes will have a positive effect anyway. You would be predicting one helper attribute and possibly introduce noise into the model, or just redundance.
With the "Select Subprocess" operator, you can always create alternative paths in your process and put that e.g. into a parameter optimization in order to see how your submodels perform versus no submodels.
Welcome to the forums!
If I know that two or more attributes are related, but are completely independent with the rest of attributes, I'd like to isolate those attributes from the others to help guide the learning algorithm (and reduce the complexity of the problem through a divide-and-conqueror approach). I think it might be possible to do this by training a model that takes input from a sub-model, but it's not clear how the sub-model(s) would be trained without having labels for the sub models.
Another example that I thought of while watching the Neural Market Trend tutorials linked from the RM homepage is that often in predicting time series, preceding days are simply treated as another attribute using the time series window operator -- one attribute will be the current value, another attribute will be the value from the previous example, and another attribute would be the current value from two examples ago. But doing so seems like such a waste. If I was asking another human to look for trends in data, it would certainly be useful to know that attribute 1 was taken on Wednesday, attribute 2 was taken on Tuesday, and attribute 3 was taken on Monday rather than essentially telling the learning algorithm, "here's 3 values from this example, look for a pattern".
Point is, is there some way we can use knowledge about the problem to guide and improve the efficiency of the learning process? If so, are there books or good references describing such techniques? As a newbie to data mining, I think I'd really benefit.
In RapidMiner, just create a copy of your date attribute with "Generate Copy" and then extract the desired property by converting the new attribute with "Date to nominal". Of course, if you can express this knowledge in the data or in the process.
Examples:
Above is an image of my idea. Horsepower, manufacturer's image, and interior quality are all qualities that determine the cost of a car. Each has attributes which determine the magnitude of each of these 3 qualities. Question I have is is there any advantage to separating the 3 groups of attributes (assume that I know Att1, Att2, and Att3 are only good for predicting horsepower and have no correlation with the other two categories, mfg. image and interior quality) or would it be just as well to feed them all into the Price of Car model directly? It seems like the learning algorithm would have an easier time with the first case.
If you work with a two-level model like you described, you either need to find an algorithmic approach (generating attributes as described earlier) or gather "label" data for the sub-models, train and learn those submodels and then integrate their predictions as additional attributes for the "big" model. But there is always the danger of introducing more noise into the model with this approach.
Try visualising your data with the target attribute (price) and the different attributes. If the attributes neatly separate the cases into clusters of differently-colored objects in the graph, the learning algorithms should be able to that, too.
Just try a few learners in a Cross-Validation (X-Validation) and see how they perform. If their performance is too bad, you can start building more complex models until you get the desired accuracy (if it is ever possible).