The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Selecting samples for attributes whose values contributes the most
I have a attribute job which is a label and has 15 different values.
Out of 1000 samples, 7 values contributes to 950 samples and remaining 8 values contributes to 50 samples.
I want to use only the 950 samples (i.e 7 values only) and ignore the rest.
How do I select the values of the label which contributes the most to the sample?
This chosen-not chosen combination may change ( 8-7,10-5,12-3 etc) depending on the data.
I tried the following approach
1) Count number of occurrence of the values in the whole table (stuck at this point)
2) Rank the values (have no idea)
3) Filter out the chosen-not chosen values (have no idea)
If a better approach/way can be suggested , I will be very grateful
I have the following table
I tried to count the number of occurrence of the values in the whole table which should look like this
I tried Generate Aggregation but it is updating it wrong
Out of 1000 samples, 7 values contributes to 950 samples and remaining 8 values contributes to 50 samples.
I want to use only the 950 samples (i.e 7 values only) and ignore the rest.
How do I select the values of the label which contributes the most to the sample?
This chosen-not chosen combination may change ( 8-7,10-5,12-3 etc) depending on the data.
I tried the following approach
1) Count number of occurrence of the values in the whole table (stuck at this point)
2) Rank the values (have no idea)
3) Filter out the chosen-not chosen values (have no idea)
If a better approach/way can be suggested , I will be very grateful
I have the following table
Name | Job |
John | Painting |
Kelly | Washing |
Diamond | Carpentry |
Clarice | Carpentry |
Kennedy | Washing |
Kevin | Painting |
Hart | Painting |
Budsey | Painting |
David | Washing |
I tried to count the number of occurrence of the values in the whole table which should look like this
Name | Job | Total Job |
John | Painting | 4 |
Kelly | Washing | 3 |
Diamond | Carpentry | 2 |
Clarice | Carpentry | 2 |
Kennedy | Washing | 3 |
Kevin | Painting | 4 |
Hart | Painting | 4 |
Budsey | Painting | 4 |
David | Washing | 3 |
I tried Generate Aggregation but it is updating it wrong
<div><?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"></div><div> <context></div><div> <input/></div><div> <output/></div><div> <macros/></div><div> </context></div><div> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"></div><div> <parameter key="logverbosity" value="init"/></div><div> <parameter key="random_seed" value="2001"/></div><div> <parameter key="send_mail" value="never"/></div><div> <parameter key="notification_email" value=""/></div><div> <parameter key="process_duration_for_mail" value="30"/></div><div> <parameter key="encoding" value="SYSTEM"/></div><div> <process expanded="true"></div><div> <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve job" width="90" x="45" y="34"></div><div> <parameter key="repository_entry" value="../data/job"/></div><div> </operator></div><div> <operator activated="true" class="generate_aggregation" compatibility="9.6.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"></div><div> <parameter key="attribute_name" value="TotalJob"/></div><div> <parameter key="attribute_filter_type" value="single"/></div><div> <parameter key="attribute" value="Job"/></div><div> <parameter key="attributes" value="Job"/></div><div> <parameter key="use_except_expression" value="false"/></div><div> <parameter key="value_type" value="attribute_value"/></div><div> <parameter key="use_value_type_exception" value="false"/></div><div> <parameter key="except_value_type" value="time"/></div><div> <parameter key="block_type" value="attribute_block"/></div><div> <parameter key="use_block_type_exception" value="false"/></div><div> <parameter key="except_block_type" value="value_matrix_row_start"/></div><div> <parameter key="invert_selection" value="false"/></div><div> <parameter key="include_special_attributes" value="true"/></div><div> <parameter key="aggregation_function" value="count"/></div><div> <parameter key="concatenation_separator" value="|"/></div><div> <parameter key="keep_all" value="true"/></div><div> <parameter key="ignore_missings" value="true"/></div><div> <parameter key="ignore_missing_attributes" value="false"/></div><div> </operator></div><div> <connect from_op="Retrieve job" from_port="output" to_op="Generate Aggregation" to_port="example set input"/></div><div> <connect from_op="Generate Aggregation" from_port="example set output" to_port="result 1"/></div><div> <portSpacing port="source_input 1" spacing="0"/></div><div> <portSpacing port="sink_result 1" spacing="0"/></div><div> <portSpacing port="sink_result 2" spacing="0"/></div><div> <portSpacing port="sink_result 3" spacing="0"/></div><div> </process></div><div> </operator></div><div></process> </div>
The output I am getting is
RowNo Name Job TotalJob
RowNo Name Job TotalJob
1 | John | Painting | 1.0 |
2 | Kelly | Washing | 1.0 |
3 | Diamond | Carpentry | 1.0 |
4 | Clarice | Carpentry | 1.0 |
5 | Kennedy | Washing | 1.0 |
6 | Kevin | Painting | 1.0 |
7 | Hart | Painting | 1.0 |
8 | Budsey | Painting | 1.0 |
9 | David | Washing | 1.0 |
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi @Vanlal ,you were on the right track. If you want to you can use Aggregate and calculate count(Job) for every Job. The trick is then to join this with your original table.Maybe more simple is to use the Replace Rare Values operator, which is part of the operator toolbox extension. It just allows you replace every value which is less frequent than X with a value like Other. You can then just filter on it. The two options look like this:And here is the process for you to copy (after you installed Operator Toolbox, otherwise you cannot use Replace Rare).Best,Martin<?xml version="1.0" encoding="UTF-8"?><process version="9.7.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.7.002" expanded="true" height="68" name="Read Excel" width="90" x="112" y="340">
<parameter key="excel_file" value="C:\Users\MartinSchmitz\Downloads\job.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Name.true.polynominal.attribute"/>
<parameter key="1" value="Job.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<description align="center" color="transparent" colored="false" width="126">Change the file path under 'excel file'</description>
</operator>
<operator activated="true" class="multiply" compatibility="9.7.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="340"/>
<operator activated="true" class="operator_toolbox:replace_rare" compatibility="2.7.000-SNAPSHOT" expanded="true" height="103" name="Replace Rare Values" width="90" x="380" y="493">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Job"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="use_relative_threshold" value="false"/>
<parameter key="relative_threshold_value" value="0.01"/>
<parameter key="threshold" value="3"/>
<parameter key="replacement_value" value="Other"/>
<parameter key="replace_if_unknown" value="true"/>
<description align="center" color="transparent" colored="false" width="126">Replaces all values less frequent then 3 with 'Other'</description>
</operator>
<operator activated="true" class="aggregate" compatibility="9.7.002" expanded="true" height="82" name="Aggregate" width="90" x="380" y="238">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="Job" value="count"/>
</list>
<parameter key="group_by_attributes" value="Job"/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="true"/>
<description align="center" color="transparent" colored="false" width="126">Count(job) group by job</description>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.7.002" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="85">
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="custom_filters"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="count(Job).gt.3"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
<description align="center" color="transparent" colored="false" width="126">Filter Away Less Frequent Jobs</description>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.7.002" expanded="true" height="82" name="Join" width="90" x="648" y="238">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="outer"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Job" value="Job"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
<description align="center" color="transparent" colored="false" width="126">If you want that this acts as a filter: use inner.<br/>If you want that this adds missings for rare jobs use: outer</description>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Replace Rare Values" to_port="example set input"/>
<connect from_op="Replace Rare Values" from_port="example set output" to_port="result 2"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="420"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
If the count was 218,150,156,90,80,40,30,20,1,21,1,11, if i take mean i.e 68.16.
I try to take examples above this count only..(i don't know whether this approach is good or not.. Any other approach is welcomed)
So i extract a macro JobCount to take the mean of count of the job and use this for the Filter Example
Replace Rare values threshold value cannot be set to this macro value.
Dortmund, Germany