Joining two sub-processes together, first classification and then clustering

amir_askary_sha · 2017 17

Hi there,

I have two sub-processes and I want to join them together in order to have a complete automated process.

The first process is a CLASSIFICATION task, which gets some text documents and then by Applying a previousely trained model puts the documents into 5 classes: politic, sport, science, ... . The output is a table with document IDs and the classes as label.

the second process is a CLUSTERING task, which gets some text documents and then using k-means algorithm, puts the documents in different clusters.

I want to join these two processes together, meaning first applying the classification and then applying the clustering 5 times for each group.

I don't know how to achieve this but I feel I should use the loop operator and somehow a split table operator to break the table result of first sub-process and loop over sub-tables.

Any help is appreciated. thanks

sgenzer · 2017 18

hmmm...perhaps this is something you can work from? My "amir1" is your first process and my "amir2" is your second process. I was not sure if your label was actually called "classes" but you can change it in the parameters for "Loop Values".

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="productivity:execute_process" compatibility="7.6.001" expanded="true" height="82" name="Execute amir1" width="90" x="45" y="34">
        <parameter key="process_location" value="//RapidMiner OneDrive/random community stuff/amir1"/>
        <list key="macros"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.6.001" expanded="true" height="82" name="Loop Values" width="90" x="179" y="34">
        <parameter key="attribute" value="classes"/>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="classes.equals.%{loop_value}"/>
            </list>
          </operator>
          <operator activated="true" class="productivity:execute_process" compatibility="7.6.001" expanded="true" height="68" name="Execute amir2" width="90" x="179" y="34">
            <parameter key="process_location" value="//RapidMiner OneDrive/random community stuff/amir2"/>
            <list key="macros"/>
          </operator>
          <connect from_port="input 1" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Execute amir2" to_port="input 1"/>
          <connect from_op="Execute amir2" from_port="result 1" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Execute amir1" from_port="result 1" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

sgenzer · 2017 17

hello @amir_askary_sha - I'd be happy to help but it would be very helpful to see what you have so far. Please paste your XML in this thread using the </> tool.

Thanks.

Scott

amir_askary_sha · 2017 18

Hi @sgenzer

here is the xml of first part of the process, the classification:

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="7.6.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="238">
        <parameter key="csv_file" value="C:\gensim_tests\clustering\filtered_sz_articles_labeled.csv"/>
        <parameter key="column_separators" value=";"/>
        <parameter key="trim_lines" value="false"/>
        <parameter key="use_quotes" value="true"/>
        <parameter key="quotes_character" value="&quot;"/>
        <parameter key="escape_character" value="\"/>
        <parameter key="skip_comments" value="false"/>
        <parameter key="comment_characters" value="#"/>
        <parameter key="parse_numbers" value="true"/>
        <parameter key="decimal_character" value="."/>
        <parameter key="grouped_digits" value="false"/>
        <parameter key="grouping_character" value=","/>
        <parameter key="date_format" value=""/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="title.true.text.attribute"/>
          <parameter key="1" value="description.true.text.attribute"/>
          <parameter key="2" value="category.true.polynominal.label"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="true"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="246" y="136">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="title" value="1.5"/>
          <parameter key="description" value="1.0"/>
          <parameter key="category" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve" width="90" x="447" y="85">
        <parameter key="repository_entry" value="//Local Repository/data/trained_model"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve (2)" width="90" x="112" y="340">
        <parameter key="repository_entry" value="../data/word_list_trained_model"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="313" y="289">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="187">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="German"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="246" y="289">
            <parameter key="condition" value="matches"/>
            <parameter key="regular_expression" value="(welt|übrigens)"/>
            <parameter key="case_sensitive" value="false"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="380" y="187">
            <parameter key="stop_word_list" value="Standard"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="581" y="238">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:stem_german" compatibility="7.5.000" expanded="true" height="68" name="Stem (2)" width="90" x="782" y="187"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="983" y="136">
            <parameter key="max_length" value="2"/>
          </operator>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="340">
        <parameter key="attribute_name" value="category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="238">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Retrieve" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Process Documents" to_port="word list"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

And here is the second part, the clustering:

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Root">
    <parameter key="logverbosity" value="warning"/>
    <parameter key="random_seed" value="-1"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="7.6.001" expanded="true" height="68" name="Open File" width="90" x="45" y="493">
        <parameter key="resource_type" value="URL"/>
        <parameter key="url" value="Sample_xml_resource"/>
      </operator>
      <operator activated="true" class="advanced_file_connectors:read_xml" compatibility="7.6.001" expanded="true" height="68" name="Read XML" width="90" x="112" y="340">
        <parameter key="file" value="Sample_xml_resource"/>
        <parameter key="xpath_for_examples" value="//response/result/doc"/>
        <enumeration key="xpaths_for_attributes">
          <parameter key="xpath_for_attribute" value="str[1]/attribute::name"/>
          <parameter key="xpath_for_attribute" value="str[1]/text()"/>
          <parameter key="xpath_for_attribute" value="str[2]/attribute::name"/>
          <parameter key="xpath_for_attribute" value="str[2]/text()"/>
        </enumeration>
        <parameter key="use_namespaces" value="true"/>
        <list key="namespaces"/>
        <parameter key="use_default_namespace" value="false"/>
        <parameter key="parse_numbers" value="true"/>
        <parameter key="decimal_character" value="."/>
        <parameter key="grouped_digits" value="false"/>
        <parameter key="grouping_character" value=","/>
        <parameter key="date_format" value=""/>
        <list key="annotations"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="German (Germany)"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="str[1]/attribute::name.false.polynominal.attribute"/>
          <parameter key="1" value="title.true.text.attribute"/>
          <parameter key="2" value="str[2]/attribute::name.false.polynominal.attribute"/>
          <parameter key="3" value="description.true.text.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="true"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.6.001" expanded="true" height="82" name="Split Data" width="90" x="246" y="442">
        <enumeration key="partitions">
          <parameter key="ratio" value="1.0"/>
        </enumeration>
        <parameter key="sampling_type" value="linear sampling"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="313" y="289">
        <parameter key="create_nominal_ids" value="false"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="187">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="title" value="4.0"/>
          <parameter key="description" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="text:filter_documents_by_content" compatibility="7.5.000" expanded="true" height="82" name="Filter Documents (by Content)" width="90" x="581" y="340">
        <parameter key="condition" value="contains match"/>
        <parameter key="regular_expression" value="(.* nicht gefunden .*|.* unavailable .*|.* 404 .*|.* service not available .*)"/>
        <parameter key="case_sensitive" value="false"/>
        <parameter key="invert condition" value="true"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="715" y="187">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_percent" value="0.4"/>
        <parameter key="prune_above_percent" value="40.0"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="2000"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="false" class="text:filter_tokens_by_pos" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="447" y="493">
            <parameter key="language" value="German"/>
            <parameter key="expression" value="NE.*|NN.*|VVFIN.*|VVINF.*|VVIZU.*|VVPP.*"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="187">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="German"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="179" y="289">
            <parameter key="condition" value="matches"/>
            <parameter key="regular_expression" value="(thing)"/>
            <parameter key="case_sensitive" value="false"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="313" y="187">
            <parameter key="stop_word_list" value="Standard"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="238">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:stem_german" compatibility="7.5.000" expanded="true" height="68" name="Stem (2)" width="90" x="782" y="187"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="983" y="136">
            <parameter key="max_length" value="2"/>
          </operator>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" compatibility="7.6.001" expanded="true" height="82" name="Clustering" width="90" x="916" y="187">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="true"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="20"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="determine_good_start_values" value="false"/>
        <parameter key="measure_types" value="BregmanDivergences"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read XML" to_port="file"/>
      <connect from_op="Read XML" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
      <connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I want to concatenate them together.

P.S: There are 5 groups generated by the classification process.

amir_askary_sha · 2017 18

Wow, that is almost it. Thank you very much, you saved me so much time.

There is just something left that I should find out, and that is that the second process doesn't care about its input. It always reads all the documents from a xml url source. I should make it now somehow that it fetches the document IDs from its input (that is the result of the classification part) and then it queries just for those documents, not all documents.

sgenzer · 2017 18

oh that's fine. Just use Extract Macro from your input to grab the document IDs and use that in your URLs.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Joining two sub-processes together, first classification and then clustering

Best Answer

Answers