Use a Web Mining model on a new page
I'm using a process from the book Predictive Analytics and Data Mining using Rapid Miner. Specifically, one from Chapter 9.1, which I will post below. After tweaking the process so that the input files uses https: pages instead of http, I get a two-cluster K-Medoid model out of it.
How can I use this model to evaluate a brand new webpage to see which cluster it belongs to?
Thanks,
Tom
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Read URL list (text file)" width="90" x="45" y="30">
<parameter key="csv_file" value="pages.txt"/>
<parameter key="column_separators" value=","/>
<parameter key="trim_lines" value="false"/>
<parameter key="use_quotes" value="true"/>
<parameter key="quotes_character" value="""/>
<parameter key="escape_character" value="\"/>
<parameter key="skip_comments" value="false"/>
<parameter key="comment_characters" value="#"/>
<parameter key="parse_numbers" value="true"/>
<parameter key="decimal_character" value="."/>
<parameter key="grouped_digits" value="false"/>
<parameter key="grouping_character" value=","/>
<parameter key="date_format" value=""/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="links_to_scan.true.text.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.002" expanded="true" height="60" name="Get Pages" width="90" x="45" y="120">
<parameter key="link_attribute" value="links_to_scan"/>
<parameter key="page_attribute" value="page_content"/>
<parameter key="random_user_agent" value="false"/>
<parameter key="connection_timeout" value="10000"/>
<parameter key="read_timeout" value="10000"/>
<parameter key="follow_redirects" value="true"/>
<parameter key="accept_cookies" value="none"/>
<parameter key="cookie_scope" value="global"/>
<parameter key="request_method" value="GET"/>
<parameter key="delay" value="none"/>
<parameter key="delay_amount" value="1000"/>
<parameter key="min_delay_amount" value="0"/>
<parameter key="max_delay_amount" value="1000"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="6.1.000" expanded="true" height="76" name="Select Attributes - remove meta" width="90" x="45" y="210">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="|page_content|links_to_scan"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="255">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="3"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content (2)" width="90" x="45" y="30">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length" value="5"/>
<parameter key="override_content_type_information" value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords" width="90" x="45" y="210"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="300">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="179" y="30">
<parameter key="condition" value="contains"/>
<parameter key="string" value="detroit"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120">
<parameter key="transform_to" value="lower case"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords" to_port="document"/>
<connect from_op="Filter Stopwords" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="6.1.000" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="120">
<parameter key="attribute_filter_type" value="numeric_value_filter"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="regular_expression" value="value<=5"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="numeric_condition" value="<= 2"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="k_medoids" compatibility="6.1.000" expanded="true" height="76" name="Clustering" width="90" x="447" y="120">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="true"/>
<parameter key="k" value="2"/>
<parameter key="max_runs" value="10"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_op="Read URL list (text file)" from_port="output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Select Attributes - remove meta" to_port="example set input"/>
<connect from_op="Select Attributes - remove meta" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 2"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
I see in your process that you execute RapidMiner 6.1 ("process version="6.1.000")
I advise you to update RapidMiner (version 8.2 actually).
Here a screenshot of the process :
Regards,
Lionel
1
Answers
The data source file is called pages.txt. It is a 3-line file as follows:
links_to_scan
https://www.detroitperforms.org/category/dance/
https://www.detroitperforms.org/category/film/
Hi @tschmidt,
You have to :
- Perform the same preprocessing steps on your score dataset (Get Pages, Select Attributes, Process Documents etc.).
- Connect the wor output port of the Process Documents operator of the "train branch" of your process to the wor input port of the Process Document operator of the "score branch".
- Use the Apply Model operator.
Here the process :
I hope it helps,
Regards,
Lionel
Will give it a try, Lionelder. I note that there are items in your process XML that keep me from loading the item. Like this one:
\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining
I'm going to try to follow your directions. I don't wan't to test on the data, so I guess what I will want to do is add another read pages and then see if I can classify.
Thanks for the suggestion,
Tom
Note also that there is no "mod" output anywhere in the process I sent, so there's no input to an "apply model" process item.
@tschmidt,
You have to set the path of the Read URL list (Read CSV) with your own path and file (like in the process you shared)
Secondly you have to do the same setting on the Read URL list of the score branch with the new file containing the link you want to classify.
Regards,
LIonel
@tschmidt,
To better understand can you import the process I shared ?
You have to connect the first clu output port of the Clustering operator to the mod input port of the Apply Model operator.
Regards,
LIonel
No, your process as posted does not import. Can you perhaps post a screen shot of what you are doing?
Importing your process gives the attached result.
I don't have a Score Branch in the process I loaded here, so your instructions confuse me.
OK, Clu added to Mod input. Now I need to get some input data. Let's see what I get.
OK, I got something working with it. Used input from American Ballet Theater. I'll take it for now, and thanks for the help.
Tom
You're welcome, @tschmidt
Regards,
Lionel
I wound up with a very similar process. I have one question. You take the output from Wor on the top Process Documents and Feed it into Wor on the bottom Process Documents. You take the output from Exa on the bottom Select Attributes and feed it into the Bottom Process Documents.
I did exactly as your picture has it, except for the Wor link between the two process Documents. Applying the model to a different Web Page gave me a decision showing that my new web page belonged to the "correct" cluster.
What does the Wor-Wor link give me?
Thanks,
Tom
I'm running RapidMiner Studio 8.2. The script I posted is from a book published in 2016, in teh dark days of RapidMiner 6.x, I guess.
Does it make a difference if the script uses the older version?
Hi @tschmidt,
Sometimes there are problem of compatibility between different versions of RapidMiner.
More generally it is always relevant to have the last version in order to benefit of the last features.
Regards,
Lionel
@tschmidt,
You have set k = 2 in your cluster model.
I make a bet : I'm sure that your new link is classified as "cluster_0"
I am right ???
......(to be continued).....
Regards,
Lionel
Yes, We got the Ballet to go as Cluster_0.
This is actually a useful thing for a number of purposes. We could build a simple two-way classifier and apply it to many web pages.
@tschmidt,
Your question about the link between the wor connection between the Process Documents prompted me to execute
this process with and without the connection and I discovered a result which, for me, is weird :
I used the following websites to train the cluster model with k = 2:
https://www.laprovence.com/Edition-marseille
https://www.laprovence.com/faits-divers-justice
I used the following website for the scoring :
https://fr.news.yahoo.com/
https://fr.news.yahoo.com/france/
https://news.google.com/?taa=1&hl=fr&gl=FR&ceid=FR:fr
https://news.google.com/topics/CAAqIggKIhxDQkFTRHdvSkwyMHZNR1k0YkRsakVnSm1jaWdBUAE?hl=fr&gl=FR&ceid=FR%3Afr
https://news.google.com/topics/CAAqKAgKIiJDQkFTRXdvSkwyMHZNR1ptZHpWbUVnSm1jaG9DUmxJb0FBUAE?hl=fr&gl=FR&ceid=FR%3Afr
So intuitively, we can think that the 2 first websites will be classified in one cluster and the 3 last websites will be classified in the other cluster.
1/ Process executed without link between the wor ports :
2/ Process executed with link between the wor ports :
We see that for the 2 first websites, all attributes = 0 and for the 3 last websites, attributes have in majority a no null value,
so when we see these results, intuitively, we can think that the 2 first websites belongs to a cluster and the 3 last websites belongs to the other cluster.
However RapidMiner classify all the websites in one unique cluster.
- Is there a rational explanation for this result ?
- Why do not we find the same results as in the first case, results that, for me, are more in line with reality ?
- What should we conclude : link or not link between the wor ports of Process Documents ?
In attached file the text files with the training/score websites.
The process :
Thanks you,
Regards,
Lionel
Linking the wordlist ports forces RapidMiner to use the wordlist generated from the first set when analyzing the second set, which means it will tokenize and otherwise process the documents with the inner operators, but will then only perform word vector calculations for the tokens that exist from the first wordlist. If you are building a predictive model, this is absolutely necessary, since the words from that list will be attributes in the model and they could be missing if you generate a completely new wordlist on the fly and you will get an error message when you Apply Model if the required attributes are missing.
For clustering, it's not strictly necessary because the clustering "model" will still work even if there are new terms included and old terms missing. However, it's not really going to be the same clustering model as before. In essence, you should be using the same wordlist, but you also should have representatives of all future "types" of documents in your original set. When you don't, the behavior of the clustering model is not going to be stable or reliable, which is what you are observing with your "experiment" :-)
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @Telcontar120,
Ok thank you, I understand better now.
Regards,
Lionel