The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Data distribution
Legacy User
Member Posts: 0 Newbie
Hi,
I've a general question about data mining.
It is well known that to find a suitable learning algorithm, the distribution
of data must be known in advance. How is this done in practice? Let's
say I've a dataset consisting of numerical and nominal features and
binary labels, how can I determine its distribution? Can RapidMiner help me
here? :-)
Otherwise, if it is not possible to determine the distribution, how do I find a
good learning algorithms for my data that minimizes the classification error?
By trail-and-error?
Regards,
Tim
I've a general question about data mining.
It is well known that to find a suitable learning algorithm, the distribution
of data must be known in advance. How is this done in practice? Let's
say I've a dataset consisting of numerical and nominal features and
binary labels, how can I determine its distribution? Can RapidMiner help me
here? :-)
Otherwise, if it is not possible to determine the distribution, how do I find a
good learning algorithms for my data that minimizes the classification error?
By trail-and-error?
Regards,
Tim
0
Answers
the problem with reald world data is: You don't know the underlying distribution. If you would, you wouldn't need to apply any learning algorithm at all.
The task of such an learning algorithm is always to try to model this distribution. Naive Bayes directly tries it by building independent normal distributions per attribute. A decision tree learner does it by constructing subspaces with ortogonal cuts and giving every subspace one uniform distribution. And so on...
So your task on real data is to find the learning algorithm approximating the real distribution best. This could be done by trial and error, but each learner has its own assumptions. This assumptions are often related and might guid the search for the correct algorithm. For example Linear Regression and SVMs with linear kernel are both linear models. Rule Learner and Decision Trees both use ortogonal cuts...
But you must have gained deep insight into the statistical methods behind the learners to have this knowledge. Trial and Error might be more handy
And there are many methods within rapid miner to do the trails of trial and error automatically. XValidation allows you to estimate the success of the modeling of the underlying distribution. With the OperatorSelector and a ParameterIterator several Learning Algorithms might be applied on the same dataset to compare their performance. The ParameterOptimizations are a tool to find the best parameters for the learners.
I hope this will help you
Greetings,
Sebastian