The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Preserve rows during text processing
Hello Rapidminer friends - I'm trying to process some text for sentiment analysis and have gotten stuck. I have an excel spreadsheet with about 3000 rows, each of which is a free-text comment expressing sentiment towards an experience. I would like to use the "Extract Sentiment" operator from the operator toolbox extension to allocate a sentiment value to each individual comment.
I am importing, changing nominal to text, and then using process documents from data with the sub-operators Tokenize, Transform Cases, Filter Stopwords (English), Filter Tokens (by Length), Stem (Porter). When I check the results at this stage, each row is associated with a token rather than an original string of tokens that would have formed a row. Is there a way around this, or of re-stitching discrete tokens back together after the above steps? I need to allocate a sentiment to each row within the spreadsheet, rather than the entire spreadsheet.
Many thanks for your help - and apologies if this is a newbie query
I am importing, changing nominal to text, and then using process documents from data with the sub-operators Tokenize, Transform Cases, Filter Stopwords (English), Filter Tokens (by Length), Stem (Porter). When I check the results at this stage, each row is associated with a token rather than an original string of tokens that would have formed a row. Is there a way around this, or of re-stitching discrete tokens back together after the above steps? I need to allocate a sentiment to each row within the spreadsheet, rather than the entire spreadsheet.
Many thanks for your help - and apologies if this is a newbie query
Tagged:
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi @JohnG22 ,I think the solution is to not use Process Documents from Data, but Loop Collection to preprocess your data set like this:Attached is the process for it.Best,Martin<?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.8.001" expanded="true" height="68" name="Retrieve JobPosts" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Training Resources/Data/Job Posts/JobPosts"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.8.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="9.3.001" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.8.001" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="34">
<parameter key="set_iteration_macro" value="false"/>
<parameter key="macro_name" value="iteration"/>
<parameter key="macro_start_value" value="1"/>
<parameter key="unfold" value="false"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="581" y="34"/>
<connect from_port="single" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="34">
<parameter key="text_attribute" value="text"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="use_processed_text" value="false"/>
</operator>
<operator activated="true" class="operator_toolbox:extract_sentiment" compatibility="3.0.000-SNAPSHOT" expanded="true" height="103" name="Extract Sentiment" width="90" x="715" y="34">
<parameter key="model" value="vader"/>
<parameter key="text_attribute" value="text"/>
<parameter key="show_advanced_output" value="false"/>
<parameter key="use_default_tokenization_regex" value="true"/>
<list key="additional_words"/>
</operator>
<connect from_op="Retrieve JobPosts" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Extract Sentiment" to_port="exa"/>
<connect from_op="Extract Sentiment" from_port="exa" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
if you have an Id attribute in your spreadsheet, you can use that; if not, just use Generate Id.
Then duplicate the table using Multiply. On one copy, do the preprocessing, join back the other after doing the processing. Select the attributes you need.
Another way would be creating a copy of the text attribute, but keeping it with the Nominal type.
It depends on your process if the first or second approach is easier.
Best regards,
Balázs