The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Show prevalence of largest class in Performance (Classification) and similar operators
When doing classification tasks, I normally use the prevalence (frequency) of the largest (modal) class as the naïve benchmark against which to compare if a single model is useful or not. For example, if my label is binary yes and no, with yes comprising 9% of the dataset and no comprising 91%, then I would expect the accuracy of a model to be at least 91%. If not, the model is no better than naively assigning all predictions to the larger class. The same logic applies for multiple categories (e.g. three or four classes for prediction). For example, if there were three classes A, B and C distributed 30%, 40% and 30%, then the prevalence of the largest class (B) would be 40%.
My request is that the Performance (Classification) and Performance (Binominal Classification) operators would add this as an option for criteria that they output. I am not sure, but I think the formal name for this measure is "prevalence of largest class" (c.f. https://en.wikipedia.org/wiki/Prevalence and https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion. Because the calculation is so simple, I hope it would be easy to implement. Yet having this handy as an output option would be more convenient than pulling out a calculator each time, which is what I have to do now.
Tagged:
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi Chitu,it is of course possible to add new performance measures to the operators. I can of course open a ticket for this feature request, but please do not expect this to be solved in the next weeks. As you know RapidMiner has release schedules and it is not likely this will be of top priority for us.Also i ask the question: How is this a performance measure? Isn't this a constant value for each data set? Don't you want to have something like accuracy-prevalence or so? So how many percentage points are you above the prevalence?In any case, you can easily use custom operators to build yourself your own operator calculating prevalence [without any coding]Best,Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
The other thing I am frequently doing is to calculate the accuracy/ROI of a default model. The default model maybe the 'naive' prediction of predicting the majority class. Have a look at the Default Model operator for it.
Dortmund, Germany
Dortmund, Germany
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.8.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="default_model" compatibility="9.8.000" expanded="true" height="82" name="Default Model" width="90" x="313" y="34">
<parameter key="method" value="mode"/>
<parameter key="constant" value="0.0"/>
<parameter key="attribute_name" value=""/>
</operator>
<operator activated="true" class="apply_model" compatibility="9.8.000" expanded="true" height="82" name="Apply Model" width="90" x="447" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.8.000" expanded="true" height="82" name="Performance" width="90" x="581" y="34">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="execute_script" compatibility="9.8.000" expanded="true" height="82" name="Execute Script" width="90" x="715" y="34">
<parameter key="script" value=" import com.rapidminer.operator.performance.*; PerformanceVector perf = input[0]; PerformanceCriterion c = perf.getCriterion(0) //operator.log(c),5) c.NAMES[0]="prevalence" // You can add any code here // This line returns the first input as the first output return perf;"/>
<parameter key="standard_imports" value="true"/>
</operator>
<operator activated="false" class="h2o:deep_learning" compatibility="9.8.000" expanded="true" height="103" name="Deep Learning" width="90" x="313" y="238">
<parameter key="activation" value="Rectifier"/>
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<parameter key="reproducible_(uses_1_thread)" value="false"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="epochs" value="10.0"/>
<parameter key="compute_variable_importances" value="false"/>
<parameter key="train_samples_per_iteration" value="-2"/>
<parameter key="adaptive_rate" value="true"/>
<parameter key="epsilon" value="1.0E-8"/>
<parameter key="rho" value="0.99"/>
<parameter key="learning_rate" value="0.005"/>
<parameter key="learning_rate_annealing" value="1.0E-6"/>
<parameter key="learning_rate_decay" value="1.0"/>
<parameter key="momentum_start" value="0.0"/>
<parameter key="momentum_ramp" value="1000000.0"/>
<parameter key="momentum_stable" value="0.0"/>
<parameter key="nesterov_accelerated_gradient" value="true"/>
<parameter key="standardize" value="true"/>
<parameter key="L1" value="1.0E-5"/>
<parameter key="L2" value="0.0"/>
<parameter key="max_w2" value="10.0"/>
<parameter key="loss_function" value="Automatic"/>
<parameter key="distribution_function" value="AUTO"/>
<parameter key="early_stopping" value="false"/>
<parameter key="stopping_rounds" value="1"/>
<parameter key="stopping_metric" value="AUTO"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Default Model" to_port="training set"/>
<connect from_op="Default Model" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Default Model" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_op="Execute Script" to_port="input 1"/>
<connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Dortmund, Germany
Dortmund, Germany