The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Statistical Significance
Hi all,
I am doing regular classification validation , shown below
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt# #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
<parameter key="logfile" value="C:\knn.txt"/>
<operator name="TextInput (4)" class="TextInput" expanded="no">
<list key="texts">
<parameter key="b" value=".."/>
<parameter key="P" value=".."/>
</list>
<parameter key="default_content_encoding" value="utf8"/>
<parameter key="default_content_language" value="utf8"/>
<parameter key="prune_below" value="3"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer (4)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
</operator>
<operator name="XValidation (3)" class="XValidation" expanded="yes">
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="true"/>
<parameter key="relative_error_strict" value="true"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
<parameter key="squared_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="cross-entropy" value="true"/>
<parameter key="margin" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<parameter key="logistic_loss" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"
Thank you
I am doing regular classification validation , shown below
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt# #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
<parameter key="logfile" value="C:\knn.txt"/>
<operator name="TextInput (4)" class="TextInput" expanded="no">
<list key="texts">
<parameter key="b" value=".."/>
<parameter key="P" value=".."/>
</list>
<parameter key="default_content_encoding" value="utf8"/>
<parameter key="default_content_language" value="utf8"/>
<parameter key="prune_below" value="3"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer (4)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
</operator>
<operator name="XValidation (3)" class="XValidation" expanded="yes">
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="true"/>
<parameter key="relative_error_strict" value="true"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
<parameter key="squared_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="cross-entropy" value="true"/>
<parameter key="margin" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<parameter key="logistic_loss" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"
Thank you
0
Answers
RapidMiner provides operator for checking if results are statistically significant better compared to others using the operators in the Validation / Significance group. Namely it provides you with an ANOVA and a T-Test operator for comparing performance vectors.
Is that what you searched for?
Greetings,
Sebastian
I just download the RapidMiner and was impressed by all the data mining methods in it. However, is it another way to test significance, like Fisher's test? For example, consider a rule:
A1=> A0, i.e., prob(A0|A1) > prob(A0)
we can rewrite it as
prob(A0|A1) *prob(A1) > prob(A0) * prob(A1)
prob(A0&A1) > prob(A0)*prob(A1)
Therefore, we can test hypothesis H0
H0: prob(A0&A1) = prob(A0)*prob(A1)
against alternative hypothesis H1
H1: prob(A0&A1) != prob(A0)*prob(A1)
if H0 is confirmed, then A0=>A1 is not a statistically significant rule.
Any functionability on this test?
where would you like to add this feature? Should it apply to Association Rules or the Rule model? Testing general data mining models could be a little difficult with that, since we don't have a probability there. Or am I misunderstanding something?
Greetings,
Sebastian
I understand what you are implying. Speaking as a bayesian, you want to test whether the occurrence of an attribute (or the specific value of a an attribute) is independent of the occurence of another attribute (specific value of another attribute). This is in general a good idea, however...
- most learners are constructed in such a way that only significant combinations are weighted more than insignificant ones, to improve overall quality and to reduce overfitting
- I would not care if I had a model containing only insignificant rules (in a sense of a statistical hypothesis test), but which delivers well-tested (!) low error-rates
So ... if you sink that the quality of rule models can be improved significantly, why dont you try it out by coding it yourself ? You see, the rapidminer guys have a lot of stuff to do, so the best way to persuade them to include "new" approaches is to provide code and an example demonstrating the power of the idea.happy mining,
steffen