The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
comparing two datasets and calculating multiple values
Hello there,
I'm doing some kind of dictionary-based emotion-analysis in which I want to compare a set of text messages against a dictionary. The text message dataset consist of the three attributes "date", "sender" and the "message" itself. The dictionary marks the corresponding emotion for each word with a boolean marker like the following:
After processing the text messages and the lexicon (cases, stopwords, tokenize, stem), I end up with a word vector for the text message dataset that displays the count of each word for every message:
My goal is to create a table in which a score for each of the emotions is calculated for every message. To do so, the corresponding words for every emotion need to be counted and the sum should be displayed like the following:
Now my question is, how can I compare the two datasets and calculate the individual scores? I was trying to use the "Intersect" operator for merging the two datasets together, as can be seen in my process, but another obstacle is that I'll have multiple tokens in one message, so it will only display messages that contain a single token.
An example for the lexicon and the messages is attached. I'd be thankful for your help.
I'm doing some kind of dictionary-based emotion-analysis in which I want to compare a set of text messages against a dictionary. The text message dataset consist of the three attributes "date", "sender" and the "message" itself. The dictionary marks the corresponding emotion for each word with a boolean marker like the following:
word | emo1 | emo2 | emo3 |
w1 | 1 | 0 | 1 |
w2 | 1 | 0 | 0 |
w3 | 0 | 1 | 0 |
w4 | 0 | 1 | 1 |
After processing the text messages and the lexicon (cases, stopwords, tokenize, stem), I end up with a word vector for the text message dataset that displays the count of each word for every message:
date | sender | w1 | w2 | w3 | w4 |
1.1. | alex | 0 | 0 | 1 | 0 |
2.1. | max | 1 | 0 | 0 | 1 |
3.1. | lisa | 1 | 0 | 1 | 0 |
3.1. | alex | 2 | 1 | 0 | 0 |
My goal is to create a table in which a score for each of the emotions is calculated for every message. To do so, the corresponding words for every emotion need to be counted and the sum should be displayed like the following:
date | sender | emo1 | emo2 | emo3 | |
1.1. | alex | 0 | 1 | 0 | |
2.1. | max | 1 | 1 | 2 | |
3.1. | lisa | 1 | 1 | 1 | |
3.1. | alex | 3 | 0 | 2 |
Now my question is, how can I compare the two datasets and calculate the individual scores? I was trying to use the "Intersect" operator for merging the two datasets together, as can be seen in my process, but another obstacle is that I'll have multiple tokens in one message, so it will only display messages that contain a single token.
An example for the lexicon and the messages is attached. I'd be thankful for your help.
<?xml version="1.0" encoding="UTF-8"?><process version="9.9.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve Testdaten_Emolexikon" width="90" x="45" y="340"> <parameter key="repository_entry" value="../data/Diplomarbeit/Testdaten_Emolexikon"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.9.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="179" y="340"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="english|sadness|joy"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="313" y="340"> <parameter key="create_word_vector" value="false"/> <parameter key="vector_creation" value="Term Occurrences"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="4"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="false" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="34"> <parameter key="transform_to" value="lower case"/> </operator> <operator activated="false" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English) (2)" width="90" x="246" y="34"/> <operator activated="true" class="text:stem_porter" compatibility="9.3.001" expanded="true" height="68" name="Stem (Porter) (2)" width="90" x="380" y="34"/> <connect from_port="document" to_op="Stem (Porter) (2)" to_port="document"/> <connect from_op="Stem (Porter) (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role" width="90" x="447" y="340"> <parameter key="attribute_name" value="text"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve messages" width="90" x="45" y="34"> <parameter key="repository_entry" value="../data/Diplomarbeit/messages"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.9.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="date|sender|message"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="Term Occurrences"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="4"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="34"> <parameter key="transform_to" value="lower case"/> </operator> <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="34"/> <operator activated="true" class="text:filter_by_length" compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="514" y="34"> <parameter key="min_chars" value="3"/> <parameter key="max_chars" value="25"/> </operator> <operator activated="true" class="text:stem_porter" compatibility="9.3.001" expanded="true" height="68" name="Stem (Porter)" width="90" x="648" y="34"/> <connect from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/> <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role (2)" width="90" x="447" y="34"> <parameter key="attribute_name" value="text"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="intersect" compatibility="9.9.000" expanded="true" height="82" name="Intersect" width="90" x="648" y="34"/> <connect from_op="Retrieve Testdaten_Emolexikon" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/> <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/> <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Intersect" to_port="second"/> <connect from_op="Retrieve messages" from_port="output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="Intersect" to_port="example set input"/> <connect from_op="Intersect" from_port="example set output" to_port="result 1"/> <connect from_op="Intersect" from_port="original" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <description align="center" color="orange" colored="true" height="425" resized="true" width="157" x="10" y="10">Input</description> <description align="center" color="red" colored="true" height="427" resized="true" width="394" x="170" y="10">Processing</description> <description align="center" color="yellow" colored="true" height="430" resized="true" width="476" x="567" y="10">Analysis</description> </process> </operator> </process>
Tagged:
0
Best Answer
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistGood points. @RaphiHD. Thanks for your follow up.
To make sure the tokens in both matrix A and matrix B are the same keywords , you can reuse the wordlist from matrix B.
Another option is discussed in @BalazsBarany 's reply. De-pivot and aggregate for the weighted total. Attached are the process for your reference. You would need to install operator toolbox extension from market place.
0
Answers
Thanks for sharing sample data and process. You are actually performing an NLP word embedding model! What you need here is a simple matrix mulplication.
However, in RapidMiner we do not have a code-free operator for multiplying two matrix. Previous discussions:
https://community.rapidminer.com/discussion/28244/solved-dot-product/p1
https://community.rapidminer.com/discussion/54553/about-matrix-multiplication
https://community.rapidminer.com/discussion/3372/matrix-multiplication
To solve your problem, you will call R/Python codes to get the job done.
Your input matrix A, and matrix B is like these.
Using the attached process the multiplied results A X B give you
Keep in mind that the token order in matrix A (as columns) must be the same as the order in matrix B (as rows). That's why I used a "reorder" trick before the matrix multiplication...
Cheers,
YY
Another way to solve this could be de-pivoting the "wide" data, so your w1, w2, w3 etc. are turned into rows instead of columns. That would be then a simple join with the first table.
Then you could aggregate by user and get the sum of the sentiment values.
Regards,
Balázs
Thank you for the quick answer, I appreciate it very much. Also for sharing that detailed process, that helped me a lot. It took me some time to comprehend all the steps, but in the end I was able to reproduce the results of your answer
When replacing that example data with my real data, I face the problem that now the number of tokens used in the student messages doesn't fit the number of entries of my lexicon. In other words: The number of colums of Matrix A doesn't fit the rows of Matrix B anymore. Therefore I got to remove all the tokens (= attributes) which have no equivalent in the rows of the lexicon.
Does anyone have an idea on how to achieve this?
@ BalazsBarany Thank you for providing the De-Pivot/ Aggregate Idea, I eventually chose that approach. Also, I originally tried to use the De-Pivot operator before, but struggled to find the right regular expression. With the codes that @ yyhuang provided, that issue was solved as well.
@ yyhuang Thank you so much for solving my problem by implementing basically the whole process for me. It ended up much more complex than I expected and your work probably saved me dozens of hours of experimenting. Big shoutout to this community and it's well functioning teamwork
Best regards
Regex is tough, some tools like this could help a bit...
After de-pivoting, a filter that remove the tokens with 0 term occurrences can be used before join. It will make join faster