"Creating a comparing white list of words to a wordlist from a data mined webpage"
Currently, I need to compare a wordlist that I have created, 'White List' to a word list that RapidMiner has created from a webpage. I have the wordlist from the webpage tokenized, and filtered. What I want to do is import a wordlist I have created into the process so that I can compare the wordlist I have made to the output of the process that works so far so that I can create a matching scheme. E.g., the 'White List' contains imaging while the word list from the output contains imaging, thus creating a match and moving that into a new output file.
If you need more information, let me know.
Ian
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi Idunn,
have a look at the attached process, it is working well for me.
~Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="85">
<list key="attribute_values">
<parameter key="text" value=""this is a text""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="85">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="7.2.000" expanded="true" height="82" name="WordList to Data" width="90" x="514" y="85"/>
<operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="238">
<list key="attribute_values">
<parameter key="text" value=""this is another text""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="238"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="238">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="85"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="7.2.000" expanded="true" height="82" name="WordList to Data (2)" width="90" x="514" y="238"/>
<operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="648" y="136">
<parameter key="remove_double_attributes" value="false"/>
<parameter key="join_type" value="outer"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="word" value="word"/>
</list>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Join" to_port="left"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="word list" to_op="WordList to Data (2)" to_port="word list"/>
<connect from_op="WordList to Data (2)" from_port="example set" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany1
Answers
Hi Ian,
you can go for wordlist to data twice, join the resulting tables and then do standard Filtering operations?
Best,
Martin
Dortmund, Germany
Ok, so I've created both wordlists, but I am confused on how to compare them now. Everytime I join them or create a union it does not compare the lists. I.e. it will delete words in common and neither count them as a term occurence or create a comparing wordlist.
Did you use wordlist to data and did an outer join on the word?
Best,
MArtin
Dortmund, Germany
Oh man, can you tell I'm new? :smileyvery-happy:
Ok, so I see what your saying about the join outer operation but my two word lists do not have attribute id's and I do not how to set them. Everytime I go from Word list to Data into a join operation, no attribute list is set, and when I turn off use id attribute on the join block it just feeds out useless information. So, how do I give my data attributes within the tables during the prcoess? If I can do this, I do beleive the join block would work.
Ian
You've solved it! I had practically everything excpet, nominal to text.
Could you explain this block to me in this form? Why did the program not work until I had that exact operation on? Does it just give attributes to tables?
Idunn
Hi,
you mean Nominal to Text? It converts Attributes from Polynominal types (red, green, blue / yes, no) to text types (= unique strings). This type is needed for text mining.
~Martin
Dortmund, Germany