Sorting csv file on 3 different node lvls using parent id
Hi there!
I'm pretty new to RapidMiner and already did the tutorials and looked for a solution for my problem in this forum as well as on google and youtube but can't figure out how i should start with.
I have a .csv table with following attributes:
id - parent_id - level
The data comes from facebook. I downloaded a post (which is level 1) with all comments (level 2) and the re-comments (level 3). Now i got a lot of unstructured data and try to find a way to sort it in the right way. I think it's easier with an example:
unstructured (how it is)
id par_id lvl
15 1 1 --> all on lvl 1 are different posts
16 1 1
21 15 2 --> on lvl 2 we have the parent_id, in this case 15. So this is a comment of the post with the id 15
22 15 2 --> this one is also a comment of post with the id 15
23 15 2
27 16 2 --> this one is a comment (because level 2!) on the post with the id 16
28 16 2
29 16 2
52 21 3 --> on level 3 we have the comment of a comment. in this case it's a comment to the comment with the id 21 (parent_id)
53 22 3 --> this one is a re-comment on the comment with the parent_id 22
54 22 3
55 22 3
56 27 3
57 27 3
So what i want is to sort them like this:
id par_id lvl
15 1 1
21 15 2
52 21 3
22 15 2
53 22 3
54 22 3
So that the post is on first position, followed by the comment and the recomment of this comment. Then there should be the next comment with the recomments. Like you know it from Facebook.
At the end it would also be very helpful if there is a possibitity to get a own document (exampleSet) for every single post (lvl 1) with the comments and recomments out of it. So for the first post with id 15, the comments and the recomments. Would need that for analysing the data...
I know, it's a lot i ask for, but i really don't have a clue how to manage that.
I hope i explained it in way to understand what i mean. I'll also add my testfile.csv if you need it in any case. Would be super gradeful if somebody could tell me how i can manage that problem! Thanks in advance
Mike
Best Answer
-
FBT Member Posts: 106 Unicorn
This is the easiest solution I could come up with:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="UTF-8"/>
<list key="data_set_meta_data_information"/>
<description align="center" color="transparent" colored="false" width="126">Read-in source CSV</description>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="124" name="Multiply" width="90" x="246" y="34">
<description align="center" color="transparent" colored="false" width="126">Multiply for level filter</description>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples - Level 1" width="90" x="514" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="level.eq.1"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples - Level 3" width="90" x="514" y="442">
<list key="filters_list">
<parameter key="filters_entry_key" value="level.eq.3"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples - Level 2" width="90" x="514" y="238">
<list key="filters_list">
<parameter key="filters_entry_key" value="level.eq.2"/>
</list>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="7.5.003" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="715" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="ID"/>
</operator>
<operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="124" name="Loop Values" width="90" x="849" y="34">
<parameter key="attribute" value="ID"/>
<parameter key="iteration_macro" value="parent_id"/>
<process expanded="true">
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples (2)" width="90" x="246" y="187">
<list key="filters_list">
<parameter key="filters_entry_key" value="ID.equals.%{parent_id}"/>
</list>
</operator>
<operator activated="true" class="extract_macro" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="187">
<parameter key="macro" value="post"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="parent_id"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="340">
<list key="filters_list">
<parameter key="filters_entry_key" value="parent_id.eq.%{parent_id}"/>
</list>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="7.5.003" expanded="true" height="82" name="Numerical to Polynominal (2)" width="90" x="380" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="ID"/>
</operator>
<operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="103" name="Append" width="90" x="514" y="187"/>
<operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples (3)" width="90" x="246" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="ID.eq.%{post}"/>
</list>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="7.5.003" expanded="true" height="82" name="Numerical to Polynominal (3)" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="ID"/>
</operator>
<operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="103" name="Append (3)" width="90" x="648" y="34"/>
<operator activated="true" class="generate_attributes" compatibility="7.5.003" expanded="true" height="82" name="Generate Attributes" width="90" x="782" y="34">
<list key="function_descriptions">
<parameter key="duplicate" value="concat([ID],str([parent_id]),str([level]))"/>
</list>
</operator>
<connect from_port="input 1" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_port="input 2" to_op="Filter Examples" to_port="example set input"/>
<connect from_port="input 3" to_op="Filter Examples (3)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Append" to_port="example set 1"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Numerical to Polynominal (2)" to_port="example set input"/>
<connect from_op="Numerical to Polynominal (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Append (3)" to_port="example set 2"/>
<connect from_op="Filter Examples (3)" from_port="example set output" to_op="Numerical to Polynominal (3)" to_port="example set input"/>
<connect from_op="Numerical to Polynominal (3)" from_port="example set output" to_op="Append (3)" to_port="example set 1"/>
<connect from_op="Append (3)" from_port="merged set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="source_input 3" spacing="189"/>
<portSpacing port="source_input 4" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="82" name="Append (2)" width="90" x="983" y="34"/>
<operator activated="true" class="remove_duplicates" compatibility="7.5.003" expanded="true" height="103" name="Remove Duplicates" width="90" x="1117" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="duplicate"/>
<parameter key="regular_expression" value="duplicate"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Filter Examples - Level 1" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Filter Examples - Level 2" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 3" to_op="Filter Examples - Level 3" to_port="example set input"/>
<connect from_op="Filter Examples - Level 1" from_port="example set output" to_op="Loop Values" to_port="input 3"/>
<connect from_op="Filter Examples - Level 3" from_port="example set output" to_op="Loop Values" to_port="input 2"/>
<connect from_op="Filter Examples - Level 2" from_port="example set output" to_op="Numerical to Polynominal" to_port="example set input"/>
<connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
<connect from_op="Loop Values" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Append (2)" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I have not commented the steps with notes, but you should understand what it is doing by looking at the operators. This solution still feels a bit more complex than it should be, hence there are probably more elegant ways of solving this sorting issue.
0
Answers
I am not sure, if I fully understand the structure of your desired order, but as long as it is logically consistent, you can achieve this by generating a sorting attribute with the "Generate Attributes" operator and then sort it with the "Sort" operator. Take a look at below XML process, which should give you an idea of how your desired order can be achieved (it's not the complete solution yet).
Hey!
First i want to thank you! But my problem is, that i have to compare the ID with the parent_id in different rows. with the if_then_else function i just can compare the ID with the parent_id of the same row.
I need to compare the whole sheet (like in a matrix) to bring the comments and recomments to the right post. Maybe a loop could help me out, but i can't figure out how i could archive my goal...
But i really appreciate your help!
I just thought about splitting the data to compare them after!?
Should i split it to the different lvls (1, 2, 3) and compare them afterwards? And if so, how haha . *hidden*....
Yes, you could do that. Apply the "Multiply" operator right after you read in your source file. Then make as many threads as you need and use the "Filter Examples" operator to select the relevant datapoints. Afterwards you can "glue" everything back together with the "Append" operator, maybe within a loop. Having said that, I still believe you would need some kind of self-generated sorting attribute, otherwise you will do a lot of manual labor that is hardly scalable.
If you can define a logical construct of the sorting algorithm you want to apply, RapidMiner can built it.
Ok. The Data in the CSV-File are Facebook Posts. So like i said, the Post on its own is lvl1, the comments are lvl2 and the re-comments are lvl3. After exporting the data from Facepager, which is used to collect the data from Facebook, the structure is sorted by the ID (increased).
Like this:
ID parent_id lvl
15 2 1
16 2 1
21 15 2
22 15 2
23 16 2
30 21 3
31 21 3
32 22 3
33 23 3
And they should get sorted like the original Post. So, the Post (ID 15) at first, then the first comment and the recomments of this one. Then the next comment of the post, also with the recomments. So that is like we know the Posts from Facebook.
So i would need something that, at first, separetes the data on their lvl. That can be done easy with the Filter Example operator. So then we have 3 different Datasets. Then i would need some operator which counts the rows, on the lvl2&3 dataset, which have the same parent_id's. Then all rows with the same parent_id should get seperated to an own dataset again. Same on the lvl3 dataset. And, if thats not confusing enough, the lvl 1 Dataset should do some kind of if-then-else function to look in all different lvl2 dataset with the right parent_id and should append the comment, then the lvl 2 dataset should look for every recomment in the lvl3 datasets with the right parent_id again. And then it should be repeated, till the structure is like this:
ID parent_id lvl
15 2 1
21 15 2
30 21 3
31 21 3
22 15 2
I know, it's confusing as hell. And i'm really struggling now. So, what would i cost to build such a process? And thank you so much for trying to help a noob
Cheers
Thank you so much!! This works perfect. Just figured out how it works, no need for explainations.
Thank you again! Now i got what you ment with some kind of sorting attribute. And now i understand how a loop can be done to solve a few other problems i'm working on. Cheers for you calmness!