The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Aggregate operator applied to each subset
Hello everyone,
I apologize if this question has already been asked elsewhere, or if it is an obvious one. I'm still learning how to use Rapidminer
This is my problem. Let's suppose that I have a dataset looking like that (but much more different values):
Year Region Item
01 QC CCD
01 QC CCD
01 QC CS
01 ON CCD
01 ON CS
01 NB CCD
01 NB CS
--------------------------
02 QC CCD
02 QC CS
02 QC CS
02 ON CS
02 ON CS
02 NB CCD
02 NB CCD
I would like to get the relative percentage of each Item related to the Region and to the Year:
Year Region Item Proportion
01 QC CCD 66.6%
01 QC CS 33.3%
01 ON CCD 50%
01 ON CS 50%
01 NB CCD 50%
01 NB CCS 50%
-------------------------------------
02 QC CCD 33.3%
02 QC CS 66.6%
02 ON CS 100%
02 NB CCD 100%
I tried many combinations with the operators Aggregate, Loop values, Branch, etc. but I seem to constantly fail...
Do you have any suggestion?
Thanks a lot!
Sylvain
I apologize if this question has already been asked elsewhere, or if it is an obvious one. I'm still learning how to use Rapidminer
This is my problem. Let's suppose that I have a dataset looking like that (but much more different values):
Year Region Item
01 QC CCD
01 QC CCD
01 QC CS
01 ON CCD
01 ON CS
01 NB CCD
01 NB CS
--------------------------
02 QC CCD
02 QC CS
02 QC CS
02 ON CS
02 ON CS
02 NB CCD
02 NB CCD
I would like to get the relative percentage of each Item related to the Region and to the Year:
Year Region Item Proportion
01 QC CCD 66.6%
01 QC CS 33.3%
01 ON CCD 50%
01 ON CS 50%
01 NB CCD 50%
01 NB CCS 50%
-------------------------------------
02 QC CCD 33.3%
02 QC CS 66.6%
02 ON CS 100%
02 NB CCD 100%
I tried many combinations with the operators Aggregate, Loop values, Branch, etc. but I seem to constantly fail...
Do you have any suggestion?
Thanks a lot!
Sylvain
Tagged:
0
Best Answer
-
SGolbert RapidMiner Certified Analyst, Member Posts: 344 UnicornHi @SylvainMI think I managed to do what you wanted with 2 aggregations and one Join, check it out:<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="82" name="Execute Python" width="90" x="112" y="34">
<parameter key="script" value="import pandas # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(): 	import pandas as pd 	import random 	dt = [] 	 	for i in range(100): 	 region = random.choice(['America', 'Europe', 'Asia']) 	 quartile = random.choice(range(1,5)) 	 item = random.choice(['A', 'B', 'C']) 	 dt.append([region, quartile, item]) 	 	dt = pd.DataFrame(dt, columns=['region', 'quarter', 'item']) 	return dt"/>
<parameter key="use_default_python" value="true"/>
<parameter key="package_manager" value="conda (anaconda)"/>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="9.2.001" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="quarter"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="order_attributes" compatibility="9.2.001" expanded="true" height="82" name="Reorder Attributes" width="90" x="380" y="34">
<parameter key="sort_mode" value="user specified"/>
<parameter key="attribute_ordering" value="region|quarter|item"/>
<parameter key="use_regular_expressions" value="false"/>
<parameter key="handle_unmatched" value="append"/>
<parameter key="sort_direction" value="ascending"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply" width="90" x="514" y="34"/>
<operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate" width="90" x="648" y="34">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="item" value="count"/>
</list>
<parameter key="group_by_attributes" value="quarter|region|item"/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="true"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate (2)" width="90" x="648" y="187">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="item" value="count"/>
</list>
<parameter key="group_by_attributes" value="quarter|region"/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="true"/>
</operator>
<operator activated="true" class="rename" compatibility="9.2.001" expanded="true" height="82" name="Rename" width="90" x="782" y="187">
<parameter key="old_name" value="count(item)"/>
<parameter key="new_name" value="region_quarter_total"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.2.001" expanded="true" height="82" name="Join" width="90" x="916" y="34">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="region" value="region"/>
<parameter key="quarter" value="quarter"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="1050" y="34">
<list key="function_descriptions">
<parameter key="proportion" value="[count(item)]/region_quarter_total"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="1184" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="region|quarter|item|proportion"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<connect from_op="Execute Python" from_port="output 1" to_op="Numerical to Polynominal" to_port="example set input"/>
<connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
<connect from_op="Reorder Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Aggregate (2)" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Aggregate (2)" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I generated some random data with Python.
Best regards,Sebastian7
Answers
what you're trying to achieve is called "window functions" in SQL.
You should check out this project, it's an implementation of window functions in RapidMiner.
https://github.com/bbarany/rapidminer-windowfunctions
You can calculate groupwise sums or counts, generate the ratios, and then aggregate according to your needs.
Regards,
Balazs
Your solution, Sebastian, is perfect! Thank you so much!
Best regards to both of you,
Sylvain