The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Mining - Document Similarity/Clustering"
Hello All,
I am trying to perform document similarity/clustering in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:
1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity
I am trying to perform document similarity/clustering in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:
1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity
Tagged:
0
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi,
is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.
Attached is a sample process.
Best,
Martin
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="subprocess" compatibility="6.4.000" expanded="true" height="76" name="Get Data" width="90" x="45" y="120">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="179" y="75">
<list key="attribute_values">
<parameter key="Text" value=""Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="179" y="165">
<list key="attribute_values">
<parameter key="Text" value=""Lorem ipsum""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="94" name="Append" width="90" x="313" y="75"/>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Simply generate some test data</description>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="120"/>
<operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="120">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="447" y="120">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="648" y="120"/>
<operator activated="true" class="cross_distances" compatibility="6.4.000" expanded="true" height="94" name="Cross Distances" width="90" x="782" y="120">
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<connect from_op="Get Data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0 -
rahi84 Member Posts: 3 Contributor IThank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.0
Answers
Could you post the XML of your process? Then i could check way easier for the mistake.
Cheers,
Martin
Dortmund, Germany
I have 'blacked out' the directory for privacy.
Please see below the XML code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="112" y="120">
<parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="246" y="120">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="380" y="255">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="6.4.000" expanded="true" height="76" name="Data to Similarity" width="90" x="581" y="255">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.
There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator. It allows you to map a set of related tokens to a single token that represents the set. You can create as many entries as you want.
If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
thank you so much for your response. can you please tell me how to make cluster of all of them?