The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Strategy for analysis of multivariate numerical data (novice)
How could I estimate the value of a ("class") variable based on the attributes of about 8--10 other related variables? I have some missing data in each of the 8 variables (from as little as 1% up to 15%), and only about 10 in 8000 vales for the class variable.
The data are numeric, well-log data from an antique geophysical survey down a series of boreholes (many boreholes). A peek at the data from one well looks like this (TC is the class variable):
[tt]DEPTH CALI DEN GR LAT LN NEUT SN SP TC
94.7927 79.8064 109.3991 40.0754 125.7779 58.2112 628.632 36.54 33.4619 1.60[/tt]
I have a niche piece of software for mineral exploration data analysis (using the SOM technique as its core 'clustering' method), and have tried to learn the fundamentals of the underlying methodology, though I am no statistician. The implementation is a little black-box to me and I am reliant on a single point of contact regarding its use, so I would like to have some other way of looking at the data and the problem. I am a complete novice to rapidminer and am looking for some help to get started with it (and by proxy some of the algorithms it uses).
More detail (can skip this next bit):
This is part of a larger research project I am undertaking. The essential method of the software I have is imputation of the class variable following grouping/clustering of the data. The well logs are of course responding to physical features of the rocks in the borehole, so I also wish to use this feature of the data to explore other means to estimate the TC variable. For example; unsupervised clustering should identify rock types based on related physical responses recorded in the well logs. Match these with qualitative descriptions and I can estimate unknown variables from global or regional observations. Though the more I say about it the more I might be influencing your thoughts.
The data are numeric, well-log data from an antique geophysical survey down a series of boreholes (many boreholes). A peek at the data from one well looks like this (TC is the class variable):
[tt]DEPTH CALI DEN GR LAT LN NEUT SN SP TC
94.7927 79.8064 109.3991 40.0754 125.7779 58.2112 628.632 36.54 33.4619 1.60[/tt]
I have a niche piece of software for mineral exploration data analysis (using the SOM technique as its core 'clustering' method), and have tried to learn the fundamentals of the underlying methodology, though I am no statistician. The implementation is a little black-box to me and I am reliant on a single point of contact regarding its use, so I would like to have some other way of looking at the data and the problem. I am a complete novice to rapidminer and am looking for some help to get started with it (and by proxy some of the algorithms it uses).
More detail (can skip this next bit):
This is part of a larger research project I am undertaking. The essential method of the software I have is imputation of the class variable following grouping/clustering of the data. The well logs are of course responding to physical features of the rocks in the borehole, so I also wish to use this feature of the data to explore other means to estimate the TC variable. For example; unsupervised clustering should identify rock types based on related physical responses recorded in the well logs. Match these with qualitative descriptions and I can estimate unknown variables from global or regional observations. Though the more I say about it the more I might be influencing your thoughts.
Tagged:
0
Answers
1. Handle Missing values : Replace them by min, max, avg of the attribute (or 0)
2. Then apply linear regression to see how it performs.
Other ways. Discretize your dataset.
1. Try Naive bayes
2. Try Decision Trees.
Good luck.
Cheers,
Venki
I have more questions... very basic...
1. I think I Apply Model by connecting output of model operator (e.g. 'Linear Regression') to model input of 'Apply Model' operator, and 'exampleset' output of Linear Regression operator to unlabelled data input of Apply Model, is this correct? I have added xml below for clarity.
2. I am confused as to how to handle missing values in my target attribute.