The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text Mining Create Association Rules
TobiasNehrig
Member Posts: 41 Maven
Hi experts,
I‘ve X web pages and each web page has an ID. Now I‘d like to compute for each single web page with my sub operator „Word Association“ association rules, so that I can get association rule graphs for each page.
At the moment I only compute association rules over all X web pages.
I‘ve tried to loop my sub operator with Loop Collection, Loop cluster (ID) or a normal loop with a macro (ID). Has maybe someone a hint for me?
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="8.1.003" expanded="true" height="82" name="Crawler Spon 10 pages" width="90" x="45" y="544">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web (2)" width="90" x="112" y="34">
<parameter key="url" value="http://www.spiegel.de"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+www.spiegel.+"/>
<parameter key="follow_link_with_matching_url" value=".+spiegel.+|.+de.+"/>
</list>
<parameter key="max_crawl_depth" value="10"/>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="max_pages" value="10"/>
<parameter key="delay" value="100"/>
<parameter key="max_concurrent_connections" value="200"/>
<parameter key="max_connections_per_host" value="100"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages (2)" width="90" x="246" y="34">
<parameter key="link_attribute" value="Link"/>
<parameter key="page_attribute" value="link"/>
<parameter key="random_user_agent" value="true"/>
</operator>
<connect from_op="Crawl Web (2)" from_port="example set" to_op="Get Pages (2)" to_port="Example Set"/>
<connect from_op="Get Pages (2)" from_port="Example Set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="246" y="544">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="179" y="34">
<parameter key="ignore_non_html_tags" value="false"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="68" name="R-Script-Pairwise-Count" width="90" x="514" y="646">
<parameter key="script" value="library(dplyr) library(tidytext) library(widyr) rm_main = function(data) { korpus <- data_frame(id =data$id, text = data$text) print(korpus) woerter <- korpus %>% unnest_tokens(word, text)%>% group_by(id)%>% count(word, sort =TRUE)%>% filter(n>=10) print(woerter) woerter <- as.data.table(woerter) cooccurre <- korpus %>% unnest_tokens(word, text)%>% pairwise_count(word, id, sort = TRUE)%>% # filter(n>=10) print(cooccurre) cooccurre <- as.data.frame(cooccurre) return(list(woerter, cooccurre)) } "/>
</operator>
<operator activated="false" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="68" name="R-Script-Bigram" width="90" x="514" y="544">
<parameter key="script" value="library(dplyr) library(tidytext) library(widyr) rm_main = function(data) { korpus <- data_frame(id =data$id, text = data$text) print(korpus) woerter <- korpus %>% unnest_tokens(word, text)%>% group_by(id)%>% count(word, sort =TRUE)%>% filter(n>=10) print(woerter) woerter <- as.data.table(woerter) cooccurre <- korpus %>% unnest_tokens(bigram, text, token= "ngrams", n= 2)%>% count(bigram, sort = TRUE) #pairwise_count(word, id, sort = TRUE)%>% # filter(n>=10) print(cooccurre) cooccurre <- as.data.frame(cooccurre) return(list(woerter, cooccurre)) } "/>
</operator>
<operator activated="false" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve 10-Rohseiten-Spiegel" width="90" x="45" y="34">
<parameter key="repository_entry" value="../data/10-Rohseiten-Spiegel"/>
</operator>
<operator activated="true" class="subprocess" compatibility="8.1.003" expanded="true" height="124" name="Prepare Data" width="90" x="246" y="34">
<process expanded="true">
<operator activated="true" class="set_role" compatibility="8.1.003" expanded="true" height="82" name="Set Role (2)" width="90" x="45" y="34">
<parameter key="attribute_name" value="text"/>
<list key="set_additional_roles">
<parameter key="Title" value="regular"/>
</list>
</operator>
<operator activated="true" class="generate_id" compatibility="8.1.003" expanded="true" height="82" name="Generate ID" width="90" x="45" y="187"/>
<operator activated="true" class="order_attributes" compatibility="8.1.003" expanded="true" height="82" name="Reorder Attributes" width="90" x="45" y="340">
<parameter key="attribute_ordering" value="Title|text"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="493">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Title|text"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.003" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="Title.is_not_missing."/>
</list>
<parameter key="filters_logic_and" value="false"/>
<parameter key="filters_check_metadata" value="false"/>
</operator>
<operator activated="true" class="set_macros" compatibility="8.1.003" expanded="true" height="82" name="Set Macros" width="90" x="246" y="187">
<list key="macros">
<parameter key="attribute_id" value="id"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.1.003" expanded="true" height="103" name="Multiply uncut" width="90" x="380" y="187"/>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="cut in sentences" width="90" x="581" y="34">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="112" y="34">
<parameter key="query_type" value="Regular Region"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries">
<parameter key="sentences" value="\\\.\\s[A-Z]| \\!\\s[A-Z]|\\?\\s[A-Z].\\\.|\\!|\\?"/>
</list>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">for r-scripts<br>tidy text<br/>bigram<br/>pairwise count</description>
</operator>
<operator activated="true" class="multiply" compatibility="8.1.003" expanded="true" height="103" name="Multiply" width="90" x="782" y="34"/>
<connect from_port="in 1" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
<connect from_op="Reorder Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Set Macros" to_port="through 1"/>
<connect from_op="Set Macros" from_port="through 1" to_op="Multiply uncut" to_port="input"/>
<connect from_op="Multiply uncut" from_port="output 1" to_op="cut in sentences" to_port="example set"/>
<connect from_op="Multiply uncut" from_port="output 2" to_port="out 2"/>
<connect from_op="cut in sentences" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
<connect from_op="Multiply" from_port="output 2" to_port="out 3"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
<portSpacing port="sink_out 4" spacing="0"/>
</process>
</operator>
<operator activated="true" class="subprocess" compatibility="8.1.003" expanded="true" height="124" name="RM Co-occurrence (3)" width="90" x="715" y="85">
<process expanded="true">
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (4)" width="90" x="112" y="136">
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="0.01"/>
<parameter key="prune_above_percent" value="100.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Non-letters (3)" width="90" x="112" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Linguistic (3)" width="90" x="246" y="34">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="34">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="false" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="380" y="34"/>
<operator activated="false" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (3)" width="90" x="648" y="34"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Tokenize Non-letters (3)" to_port="document"/>
<connect from_op="Tokenize Non-letters (3)" from_port="document" to_op="Tokenize Linguistic (3)" to_port="document"/>
<connect from_op="Tokenize Linguistic (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text_to_nominal" compatibility="8.1.003" expanded="true" height="82" name="Text to Nominal (3)" width="90" x="246" y="34"/>
<operator activated="true" class="numerical_to_binominal" compatibility="8.1.003" expanded="true" height="82" name="Numerical to Binominal (3)" width="90" x="380" y="34"/>
<operator activated="true" class="fp_growth" compatibility="8.1.003" expanded="true" height="82" name="FP-Growth (3)" width="90" x="514" y="34">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.2"/>
<parameter key="max_items" value="2"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="8.1.003" expanded="true" height="82" name="Create Association Rules (3)" width="90" x="715" y="136">
<parameter key="min_confidence" value="0.01"/>
<parameter key="gain_theta" value="1.0"/>
</operator>
<connect from_port="in 1" to_op="Process Documents from Data (4)" to_port="example set"/>
<connect from_op="Process Documents from Data (4)" from_port="example set" to_op="Text to Nominal (3)" to_port="example set input"/>
<connect from_op="Process Documents from Data (4)" from_port="word list" to_port="out 3"/>
<connect from_op="Text to Nominal (3)" from_port="example set output" to_op="Numerical to Binominal (3)" to_port="example set input"/>
<connect from_op="Numerical to Binominal (3)" from_port="example set output" to_op="FP-Growth (3)" to_port="example set"/>
<connect from_op="FP-Growth (3)" from_port="example set" to_port="out 1"/>
<connect from_op="FP-Growth (3)" from_port="frequent sets" to_op="Create Association Rules (3)" to_port="item sets"/>
<connect from_op="Create Association Rules (3)" from_port="rules" to_port="out 2"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
<portSpacing port="sink_out 4" spacing="0"/>
</process>
</operator>
<operator activated="false" class="concurrency:loop" compatibility="8.1.003" expanded="true" height="124" name="Loop" width="90" x="715" y="391">
<parameter key="number_of_iterations" value="1"/>
<parameter key="iteration_macro" value="%{attribute_id}"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="8.1.003" expanded="true" height="124" name="RM Co-occurrence (2)" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="112" y="136">
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="0.01"/>
<parameter key="prune_above_percent" value="100.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Non-letters (2)" width="90" x="112" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Linguistic (2)" width="90" x="246" y="34">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="514" y="34">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="false" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="380" y="34"/>
<operator activated="false" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="648" y="34"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Tokenize Non-letters (2)" to_port="document"/>
<connect from_op="Tokenize Non-letters (2)" from_port="document" to_op="Tokenize Linguistic (2)" to_port="document"/>
<connect from_op="Tokenize Linguistic (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text_to_nominal" compatibility="8.1.003" expanded="true" height="82" name="Text to Nominal (2)" width="90" x="246" y="34"/>
<operator activated="true" class="numerical_to_binominal" compatibility="8.1.003" expanded="true" height="82" name="Numerical to Binominal (2)" width="90" x="380" y="34"/>
<operator activated="true" class="fp_growth" compatibility="8.1.003" expanded="true" height="82" name="FP-Growth (2)" width="90" x="514" y="34">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.2"/>
<parameter key="max_items" value="2"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="8.1.003" expanded="true" height="82" name="Create Association Rules (2)" width="90" x="715" y="85">
<parameter key="min_confidence" value="0.01"/>
<parameter key="gain_theta" value="1.0"/>
</operator>
<connect from_port="in 1" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_op="Text to Nominal (2)" to_port="example set input"/>
<connect from_op="Process Documents from Data (3)" from_port="word list" to_port="out 3"/>
<connect from_op="Text to Nominal (2)" from_port="example set output" to_op="Numerical to Binominal (2)" to_port="example set input"/>
<connect from_op="Numerical to Binominal (2)" from_port="example set output" to_op="FP-Growth (2)" to_port="example set"/>
<connect from_op="FP-Growth (2)" from_port="example set" to_port="out 1"/>
<connect from_op="FP-Growth (2)" from_port="frequent sets" to_op="Create Association Rules (2)" to_port="item sets"/>
<connect from_op="Create Association Rules (2)" from_port="rules" to_port="out 2"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
<portSpacing port="sink_out 4" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="RM Co-occurrence (2)" to_port="in 1"/>
<connect from_op="RM Co-occurrence (2)" from_port="out 1" to_port="output 1"/>
<connect from_op="RM Co-occurrence (2)" from_port="out 2" to_port="output 2"/>
<connect from_op="RM Co-occurrence (2)" from_port="out 3" to_port="output 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
<portSpacing port="sink_output 4" spacing="0"/>
</process>
</operator>
<operator activated="false" class="collect" compatibility="8.1.003" expanded="true" height="68" name="Collect" width="90" x="514" y="238"/>
<operator activated="false" class="loop_collection" compatibility="8.1.003" expanded="true" height="124" name="Loop Collection" width="90" x="715" y="238">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="8.1.003" expanded="true" height="124" name="RM Co-occurrence (4)" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (6)" width="90" x="112" y="136">
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="0.01"/>
<parameter key="prune_above_percent" value="100.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Non-letters (4)" width="90" x="112" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize Linguistic (4)" width="90" x="246" y="34">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="514" y="34">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="false" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (4)" width="90" x="380" y="34"/>
<operator activated="false" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (4)" width="90" x="648" y="34"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (4)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Tokenize Non-letters (4)" to_port="document"/>
<connect from_op="Tokenize Non-letters (4)" from_port="document" to_op="Tokenize Linguistic (4)" to_port="document"/>
<connect from_op="Tokenize Linguistic (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Transform Cases (4)" to_port="document"/>
<connect from_op="Transform Cases (4)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text_to_nominal" compatibility="8.1.003" expanded="true" height="82" name="Text to Nominal (5)" width="90" x="246" y="34"/>
<operator activated="true" class="numerical_to_binominal" compatibility="8.1.003" expanded="true" height="82" name="Numerical to Binominal (5)" width="90" x="380" y="34"/>
<operator activated="true" class="fp_growth" compatibility="8.1.003" expanded="true" height="82" name="FP-Growth (5)" width="90" x="514" y="34">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.2"/>
<parameter key="max_items" value="2"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="8.1.003" expanded="true" height="82" name="Create Association Rules (5)" width="90" x="715" y="136">
<parameter key="min_confidence" value="0.01"/>
<parameter key="gain_theta" value="1.0"/>
</operator>
<connect from_port="in 1" to_op="Process Documents from Data (6)" to_port="example set"/>
<connect from_op="Process Documents from Data (6)" from_port="example set" to_op="Text to Nominal (5)" to_port="example set input"/>
<connect from_op="Process Documents from Data (6)" from_port="word list" to_port="out 3"/>
<connect from_op="Text to Nominal (5)" from_port="example set output" to_op="Numerical to Binominal (5)" to_port="example set input"/>
<connect from_op="Numerical to Binominal (5)" from_port="example set output" to_op="FP-Growth (5)" to_port="example set"/>
<connect from_op="FP-Growth (5)" from_port="example set" to_port="out 1"/>
<connect from_op="FP-Growth (5)" from_port="frequent sets" to_op="Create Association Rules (5)" to_port="item sets"/>
<connect from_op="Create Association Rules (5)" from_port="rules" to_port="out 2"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
<portSpacing port="sink_out 4" spacing="0"/>
</process>
</operator>
<connect from_port="single" to_op="RM Co-occurrence (4)" to_port="in 1"/>
<connect from_op="RM Co-occurrence (4)" from_port="out 1" to_port="output 1"/>
<connect from_op="RM Co-occurrence (4)" from_port="out 2" to_port="output 2"/>
<connect from_op="RM Co-occurrence (4)" from_port="out 3" to_port="output 3"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
<portSpacing port="sink_output 4" spacing="0"/>
</process>
</operator>
<connect from_op="Crawler Spon 10 pages" from_port="out 1" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Prepare Data" to_port="in 1"/>
<connect from_op="Prepare Data" from_port="out 1" to_port="result 1"/>
<connect from_op="Prepare Data" from_port="out 2" to_op="RM Co-occurrence (3)" to_port="in 1"/>
<connect from_op="RM Co-occurrence (3)" from_port="out 1" to_port="result 2"/>
<connect from_op="RM Co-occurrence (3)" from_port="out 2" to_port="result 3"/>
<connect from_op="RM Co-occurrence (3)" from_port="out 3" to_port="result 4"/>
<connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<description align="center" color="yellow" colored="false" height="286" resized="true" width="434" x="10" y="480">Crawler <br/></description>
<description align="center" color="yellow" colored="false" height="278" resized="true" width="173" x="477" y="488">R-Scripts<br/></description>
</process>
</operator>
</process>
Kind regards
Tobias
0
Answers
In theory, you should be able to retrieve your web pages and then store them as documents (you might need "Data to Documents" depending on how you retrieve them). After that,you should be able to use "Loop Collection" to process each one separately, but that doesn't seem to work with Process Documents because it's not returning any wordlists or word vectors at all. So I agree with you, something here isn't working properly.
Another alternative should be to store the web pages as examplesets and then use "Loop Examples" but that also doesn't seem to work---it returns the same wordlist and word vector across all documents.
So I think this probably needs to be looked at by RapidMiner developers to understand what is breaking down inside the loops with respect to processing documents. @sgenzer can you forward this to their attention?
See the example process attached (it's much simpler than the OP which contains a lot of unecessary extras not needed for isolating this specific issue).
Brian
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @Telcontar120
thank you. I thought there is a failure in this routine of mine.
Tobias
Hi...so I'm not sure I completely understand the problem. You can use "Loop Collection" on a collection of documents and do whatever you want inside the Loop Collection operator. For example, I just used a piece of your process and did Transform Cases inside the Loop Collection. It works fine. Am I missing something?
Scott
Hey @sgenzer , thanks for looking at this.
I think the problem with Loop Collection is specifically with "Process Documents" and specifically with the Word Vector creation part of it. Did you try running my entire process that I posted? If you do that, the errors that I describe should be evident (mainly, no word vector output!).
With Loop Examples and a Macro, there still seems to be a problem, which is that it is returning only a single Word Vector instead of one per example, which is what it logically should be doing.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hi @Telcontar120 - so I have no problem seeing the full word vectors if I turn off the pruning with Process Documents:
Let me look into the other one while you take a look at this...
Scott
so if I run just the Loop Examples part, I do see a full example set...?
Scott
Yep, for Loop Collections, you're right---I should have tested it that way too! I figured out one problem was the default word vector calculation method of TF-IDF. Because we are only doing one document at a time, it's going to generate all zero values because there's no document collection to calculate IDF! Term occurrences works ok, though. But shouldn't this setup still work if pruning is turned on? (but it doesn't!)
In terms of the Loop Examples output, I think the problem is different. It is only returning one wordlist with the same values across all documents, but it should be returning one separate wordlist for each document, right?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@sgenzer Were you able to duplicate the 2nd error I described in more detail in my response? And have you already filed a bug report on the first item (no word vector generated with pruning on), or do you want me to do that?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @Telcontar120 sorry for the delay. So I poked around for a while on the Loop Collection issue and I don't think it's a bug. You see when you Loop Collection and use Process Documents inside, you're only using Process Document on one document at a time. I'm not sure this really makes sense. If you want to create word vectors on a collection of documents, I would just feed the collection to the Process Documents directly. And then pruning works and so on:
When I played around with breakpoints etc I also saw that it was not only pruning that "did not work" (did not product word vectors); ranking failed and absolute worked. This also makes sense - it's only looking at one document so of course anything that creates a subset via statistics on one document is going to fail. But "absolute" works because, well, there is only one document.
Does this make sense?
Scott
@sgenzer I agree that this isn't the ordinary way of doing things, but I still think the Process Documents operator is not behaving according to its intended design. Take a look at the example process now.
After some additional testing, it looks like the problem is really with Process Documents and doesn't have to do with the Loop Collections portion. That is, if you feed Process Documents a single document (no loop involved), it will produce a word vector on that document, but NOT if you select pruning with certain options.
This doesn't make sense because there isn't anything inherently collective about pruning---it should be able to be done via any method (absolute, percentual, or ranking) on the word vector itself. And in fact, the process still works if you select pruning "by ranking" as the method, although it doesn't actually do the pruning! But it fails to produce any word vector output at all if you select pruning by "absolute" or "percentual" methods. So the only method that is working as expected is when there is no pruning at all.
So basically putting this inside the Loop Collections is irrelevant, since the strange behavior is occurring if you simply use one document with Process Documents alone. At the very least, I would expect either to get an error or warning message, or to get an unpruned word vector for an individual document, but never to get no word vectors or wordlists at all!
And the problem still exists with the unified wordlist being created when using "Process Documents" inside the generic Loop. That definitely should not be happening since each document is supposed to be processed separately.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @sgenzer,
Hi @Telcontar120,
my problem is that I'd like to analyze the downloaded web pages. For each page I've to create co-occurrence lists and to find associations. I'm looking for a operator with which I can create the graphs for each page by using a r-script for co-occurrence and associations. For both, associations and co-occurrence, I'd like to see the results for each page.
When I try Scotts loop approach with this code:
All items of the collections are the same, I didn't find any difference between them.
Regards
Tobias
Agreed, that is the same behavior I referenced earlier in this thread. I believe it is a bug that the developers are going to need to look at.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @sgenzer
Hi @Telcontar120
I think I have found finally a solution with the Group into Collection Operator from the Operatior Toolbox. But is there a way to combine the results and compare them?
Kind regards
Tobias
@TobiasNehrig,
i think Converters Extension got a Ass. Rule to ExampleSet? That could help.
BR,
Martin
Dortmund, Germany
Hi @sgenzer,
hi @Telcontar120,
the previously discussed point works so far fine that I've got some results. But in the end the results are not so good, so that I've to do a better job in preparing the data. My new concept looks like that I'll crawl the webpages, prepare the data and then cascade the text mining process in the Loop Collection operator. At first I'll splitt each text of a web page in sentences in an ExampleSet per web page. After that to tokenize the sentences words for each web page and sencenes in a seperate ExampleSet. My aim is to have for each page an ExampleSet where I can calculate the tf-idf for each page. So I use again the Loop Collection operator. But I miss something, in the results my senteces are not further tokenized.
best regards
Tobias