Apply Join operator on each Attribute
Hello everyone,
As usual, I'm sorry if the question has already been posted before - although I carefully checked...
Here is my issue. I have an Example Set_1 containing an attribute called Date, and a variable number n of attributes called People_1 to People_n. I also have another Example Set_2 containing a list of names and biographic datas.
In order to get for each date the biographic datas of each people involved that day, I would like to apply the Join Operator to each attribute People_i (i=1 to n) using their names as Key attributes.
For now, as n is still small (7), I can do it by hand, parametring 7 Join Operators for each People attribute, and joining my 7 resulting sets in the Append Operator. But, I would like to get the same result automatically, in case of n becomes too big (parametring 40 Join Operators and connecting all them to an Append seems quite heavy and inelegant to me).
I've tried during past week to find an elegant solution to that problem, and I haven't been able to figure it. I've tried Loops of all kind, Subprocesses, etc. It seems very simple though, but I've been unable to get my "hand" results automatically.
Here are some of my datas to help you.
Thanks a lot, as always :-)
Sylvain
Row | Id | Date | People_1 | People_2 | People_3 | People_4 | People_5 | People_6 | People_7 |
12 | 12.0 | Sat Dec 19 00:00:00 EST 1970 | Loretta Lynn | Owen Bradley | ? | ? | ? | ? | ? |
13 | 13.0 | Sat Jun 07 00:00:00 EDT 1980 | Sissy Spacek | Loretta Lynn | Owen Bradley | ? | ? | ? | ? |
14 | 14.0 | Sat Dec 04 00:00:00 EST 2010 | Loretta Lynn | Sheryl Crow | Miranda Lambert | ? | ? | ? | ? |
15 | 15.0 | Sat Aug 09 00:00:00 EDT 1969 | Charley Pride | Dallas Frazier | A.L. "Doodle" | Chet Atkins | Jack Clement | Bob Ferguson | Felton Jarvis |
Row | People | Birth_date | Birth_state | Sex | Ethnicity |
2 | Charles Kelley | Fri Sep 11 00:00:00 EDT 1981 | GA | Male | Caucasian |
3 | Hillary Scott | Tue Apr 01 00:00:00 EST 1986 | TN | Female | Caucasian |
4 | Reba McEntire | Mon Mar 28 00:00:00 EST 1955 | OK | Female | Caucasian |
5 | Wanda Jackson | Wed Oct 20 00:00:00 EST 1937 | OK | Female | Caucasian |
6 | Carrie Underwood | Thu Mar 10 00:00:00 EST 1983 | OK | Female | Caucasian |
7 | Toby Keith | Sat Jul 08 00:00:00 EDT 1961 | OK | Male | Caucasian |
8 | David Bellamy | Sat Sep 16 00:00:00 EDT 1950 | FL | Male | Caucasian |
9 | Howard Bellamy | Sat Feb 02 00:00:00 EST 1946 | FL | Male | Caucasian |
10 | Keith Urban | Thu Oct 26 00:00:00 EDT 1967 | Northland | Male | Caucasian |
11 | Miranda Lambert | Thu Nov 10 00:00:00 EST 1983 | TX | Female | Caucasian |
12 | Sam Hunt | Sat Dec 08 00:00:00 EST 1984 | GA | Male | Caucasian |
13 | Johnny Cash | Fri Feb 26 00:00:00 EST 1932 | AR | Male | Caucasian |
14 | June Carter | Sun Jun 23 00:00:00 EDT 1929 | VA | Female | Caucasian |
15 | Merle Haggard | Tue Apr 06 00:00:00 EST 1937 | CA | Male | Caucasian |
16 | Waylon Jennings | Tue Jun 15 00:00:00 EDT 1937 | TX | Male | Caucasian |
17 | Willie Nelson | Sat Apr 29 00:00:00 EST 1933 | TX | Male | Caucasian |
18 | Loretta Lynn | Thu Apr 14 00:00:00 EST 1932 | KY | Female | Caucasian |
19 | Sissy Spacek | Sun Dec 25 00:00:00 EST 1949 | TX | Female | Caucasian |
20 | Sheryl Crow | Sun Feb 11 00:00:00 EST 1962 | MO | Female | Caucasian |
21 | Charley Pride | Sun Mar 18 00:00:00 EST 1934 | MS | Male | African American |
22 | Rodney Clawon | ? | TX | Male | Caucasian |
23 | Nathan Chapman | ? | TN | Male | Caucasian |
Best Answer
-
JEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
I'm sure I'm missing something important, but have you tried the De-Pivot operator before the join?
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.0.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="Row 	Id 	Date 	People_1 	People_2 	People_3 	People_4 	People_5 	People_6 	People_7 12 	12.0 	Sat Dec 19 00:00:00 EST 1970 	Loretta Lynn 	Owen Bradley 	? 	? 	? 	? 	? 13 	13.0 	Sat Jun 07 00:00:00 EDT 1980 	Sissy Spacek 	Loretta Lynn 	Owen Bradley 	? 	? 	? 	? 14 	14.0 	Sat Dec 04 00:00:00 EST 2010 	Loretta Lynn 	Sheryl Crow 	Miranda Lambert 	? 	? 	? 	? 15 	15.0 	Sat Aug 09 00:00:00 EDT 1969 	Charley Pride 	Dallas Frazier 	A.L. "Doodle" 	Chet Atkins 	Jack Clement 	Bob Ferguson 	Felton Jarvis"/>
<parameter key="column_separator" value=" 	"/>
</operator>
<operator activated="true" class="de_pivot" compatibility="8.2.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34">
<list key="attribute_name">
<parameter key="People" value="People_.*"/>
</list>
<parameter key="index_attribute" value="NewID"/>
</operator>
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.0.000" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="45" y="187">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="Row 	People 	Birth_date 	Birth_state 	Sex 	Ethnicity 2 	Charles Kelley 	Fri Sep 11 00:00:00 EDT 1981 	GA 	Male 	Caucasian 3 	Hillary Scott 	Tue Apr 01 00:00:00 EST 1986 	TN 	Female 	Caucasian 4 	Reba McEntire 	Mon Mar 28 00:00:00 EST 1955 	OK 	Female 	Caucasian 5 	Wanda Jackson 	Wed Oct 20 00:00:00 EST 1937 	OK 	Female 	Caucasian 6 	Carrie Underwood 	Thu Mar 10 00:00:00 EST 1983 	OK 	Female 	Caucasian 7 	Toby Keith 	Sat Jul 08 00:00:00 EDT 1961 	OK 	Male 	Caucasian 8 	David Bellamy 	Sat Sep 16 00:00:00 EDT 1950 	FL 	Male 	Caucasian 9 	Howard Bellamy 	Sat Feb 02 00:00:00 EST 1946 	FL 	Male 	Caucasian 10 	Keith Urban 	Thu Oct 26 00:00:00 EDT 1967 	Northland 	Male 	Caucasian 11 	Miranda Lambert 	Thu Nov 10 00:00:00 EST 1983 	TX 	Female 	Caucasian 12 	Sam Hunt 	Sat Dec 08 00:00:00 EST 1984 	GA 	Male 	Caucasian 13 	Johnny Cash 	Fri Feb 26 00:00:00 EST 1932 	AR 	Male 	Caucasian 14 	June Carter 	Sun Jun 23 00:00:00 EDT 1929 	VA 	Female 	Caucasian 15 	Merle Haggard 	Tue Apr 06 00:00:00 EST 1937 	CA 	Male 	Caucasian 16 	Waylon Jennings 	Tue Jun 15 00:00:00 EDT 1937 	TX 	Male 	Caucasian 17 	Willie Nelson 	Sat Apr 29 00:00:00 EST 1933 	TX 	Male 	Caucasian 18 	Loretta Lynn 	Thu Apr 14 00:00:00 EST 1932 	KY 	Female 	Caucasian 19 	Sissy Spacek 	Sun Dec 25 00:00:00 EST 1949 	TX 	Female 	Caucasian 20 	Sheryl Crow 	Sun Feb 11 00:00:00 EST 1962 	MO 	Female 	Caucasian 21 	Charley Pride 	Sun Mar 18 00:00:00 EST 1934 	MS 	Male 	African American 22 	Rodney Clawon 	? 	TX 	Male 	Caucasian 23 	Nathan Chapman 	? 	TN 	Male 	Caucasian"/>
<parameter key="column_separator" value=" 	"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join" width="90" x="380" y="85">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People" value="People"/>
</list>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="De-Pivot" to_port="example set input"/>
<connect from_op="De-Pivot" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Create ExampleSet (2)" from_port="output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>2
Answers
Hi @SylvainM,
Based on the 2 examplesets of your post, could you post a resulting exampleset of what you want to obtain, to better understand.
Regards,
Lionel
Hello Lionel,
Thank you for your help :-)
This is what I get "by hand" (please, consider that my bio datas aren't complete yet, some names are missing, which explains why I don't have a row for each person indicated by my previous Example sets)
Best,
Sylvain
Hi again @SylvainM
Could you share your process ?
Regards,
Lionel
Hi Lionel,
So I don't know if I did it correctly (I'm very bad with computers), but this is the "joining" part of the process. It takes as inputs the intermediary results I gave in my previous post.
Thanks so much!
Sylvain
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="103" name="Subprocess" width="90" x="246" y="34">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="208" name="Multiply (5)" width="90" x="45" y="34"/>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="208" name="Multiply (6)" width="90" x="45" y="340"/>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (14)" width="90" x="179" y="646">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_7" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (13)" width="90" x="179" y="544">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_6" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (12)" width="90" x="179" y="442">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_5" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (11)" width="90" x="179" y="340">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_4" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (10)" width="90" x="179" y="238">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_3" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (8)" width="90" x="179" y="34">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_1" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (9)" width="90" x="179" y="136">
<parameter key="remove_double_attributes" value="true"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="People_2" value="People"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
</operator>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="208" name="Append (2)" width="90" x="514" y="136">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_port="in 1" to_op="Multiply (5)" to_port="input"/>
<connect from_port="in 2" to_op="Multiply (6)" to_port="input"/>
<connect from_op="Multiply (5)" from_port="output 1" to_op="Join (8)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 2" to_op="Join (9)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 3" to_op="Join (10)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 4" to_op="Join (11)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 5" to_op="Join (12)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 6" to_op="Join (13)" to_port="left"/>
<connect from_op="Multiply (5)" from_port="output 7" to_op="Join (14)" to_port="left"/>
<connect from_op="Multiply (6)" from_port="output 1" to_op="Join (8)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 2" to_op="Join (9)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 3" to_op="Join (10)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 4" to_op="Join (11)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 5" to_op="Join (12)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 6" to_op="Join (13)" to_port="right"/>
<connect from_op="Multiply (6)" from_port="output 7" to_op="Join (14)" to_port="right"/>
<connect from_op="Join (14)" from_port="join" to_op="Append (2)" to_port="example set 7"/>
<connect from_op="Join (13)" from_port="join" to_op="Append (2)" to_port="example set 6"/>
<connect from_op="Join (12)" from_port="join" to_op="Append (2)" to_port="example set 5"/>
<connect from_op="Join (11)" from_port="join" to_op="Append (2)" to_port="example set 4"/>
<connect from_op="Join (10)" from_port="join" to_op="Append (2)" to_port="example set 3"/>
<connect from_op="Join (8)" from_port="join" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Join (9)" from_port="join" to_op="Append (2)" to_port="example set 2"/>
<connect from_op="Append (2)" from_port="merged set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="source_in 3" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (7)" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="Birth State|ETHNICITY|Peak|SEX"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<connect from_op="Subprocess" from_port="out 1" to_op="Select Attributes (7)" to_port="example set input"/>
<connect from_op="Select Attributes (7)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
And I know : it's super heavy and ugly :smileylol:
:smileysurprised: That's exactly that I was looking for! I had no idea that I could use De-pivot like that. Sorry for being such a beginner... :smileyembarrassed:
Thanks so much, JEdward! I deeply appreciate your help!
And the problem is solved, of course.