Iterating through a flat list .dcm/.tag file pairs and applying some processing on each pair

ralph_brecheise · 2018 01

Hi,

I have a flat list of files like this:

IM001.dcm

IM001.tag

IM002.dcm

IM002.tag

...

I'd like to iterate over this list and apply some processing on each *.dcm/*.tag pair, i.e., inside the "Loop Files" operator I'd like to have access to (IM001.dcm, IM001.tag), (IM002.dcm, IM002.tag), etc.

In Python this is easy but I'd like to learn how to do such file manipulation in RM

Is this possible?

Ralph

yyhuang · 2018 01

Hi@ralph_brecheise

That is also easy with 'loop files' operator in RapidMiner. You can use regex to pick the list of names for .dcm files. And inside the loop you do something 'magic' to create the corresponding '.tag' file name with the Macro creations.

With the newly created file name for .tag you can do whatever you want to load, extract info from tag.

The number of iteration here is n/2, instead of n for python

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="8.1.000" expanded="true" height="82" name="Loop Files" width="90" x="916" y="34">
        <parameter key="directory" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\testLoopFiles"/>
        <parameter key="filter_type" value="regex"/>
        <parameter key="filter_by_regex" value=".*dcm"/>
        <parameter key="enable_macros" value="true"/>
        <process expanded="true">
          <operator activated="true" class="image:read_image" compatibility="7.0.000" expanded="true" height="68" name="Read Image" width="90" x="246" y="34"/>
          <operator activated="true" class="generate_macro" compatibility="8.1.000" expanded="true" height="82" name="Generate Macro" width="90" x="380" y="34">
            <list key="function_descriptions">
              <parameter key="tag_file_name" value="concat(prefix(%{file_name},index(%{file_name},&quot;.&quot;)),&quot;.tag&quot;)"/>
            </list>
          </operator>
          <connect from_port="file object" to_op="Read Image" to_port="file"/>
          <connect from_op="Read Image" from_port="output" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop Files" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

ralph_brecheise · 2018 01

It works!

yyhuang · 2018 02

@ralph_brecheise the error message told you there is missing extensions. You will need at least "operator toolbox", "converters" extensions from marketplace.

Any questions, please let us know.

Thomas_Ott · 2018 01

Yes, in the Loop Files operator you can filter out by Regex, just enable the regex and do something like this as your filter .*.tag|.*.dcm

ralph_brecheise · 2018 01

Hi yyhuang,

Thanks for the quick reply! Your solution makes a lot of sense.

I'll give a try. Unfortunately, I couldn't load your process because I don't have the "Read Image" operator but the overall idea is clear to me.

Ralph

ralph_brecheise · 2018 01

I added a 2nd macro inside the "Generate Macro" operator called %{dcm_file_name} so now I should be able to use both in any downstream operators.

However, I'd like to process the dcm/tag pairs using the "Execute Python" operator. Can I access the macro variables from there? The documentation doesn't seem to mention macros.

Ralph

yyhuang · 2018 01

Hi @ralph_brecheise

Thanks for the followup! The 'read image' is an operator from IMMI, image mining extension. Since you used dcm file, I tought it could be an image....

http://www.burgsys.com/image-analysis-software.php

The link of a solid Image Mining extension for RapidMiner. The burgsys released it under the AGPL license free of charge.

Tutorial doc can be found in the downloaded folder and you can manually install the downloaded jar file for IMMI extension following this link

Cheers,

YY

ralph_brecheise · 2018 01

Hi,

Thanks for the IMMI tip! I'll look into it.

Do you have any suggestions about that pair-wise processing issue I sneaked into my previous message?

Thanks!

Ralph

Thomas_Ott · 2018 01

@ralph_brecheise, the community search is your friend! Lots of pairwise related posts here: https://community.rapidminer.com/t5/forums/searchpage/tab/message?advanced=false&allow_punctuation=false&q=pairwise

ralph_brecheise · 2018 01

Hi Thomas

Thanks for the link but not every post that contains the word "pair-wise" addresses my question. I did search but could not find anything specific.

A more concrete suggestion would be appreciated.

Ralph

yyhuang · 2018 01

Hi @ralph_brecheise

The following example shows how to use macro together with you python scripts. Credit goes to @JEdward

In the scripts it is using the evolutionary optimize methods to search for the best hyper-parameter setup for the python Random Forest.

The whole process is computational intensive, and need about 5 min to finish on my lappie.

Cheers,

YY

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="optimize_parameters_evolutionary" compatibility="6.0.003" expanded="true" height="124" name="Optimize Parameters (Evolutionary)" width="90" x="313" y="34">
        <list key="parameters">
          <parameter key="parameterSet1.number_of_trees" value="[1.0;100.0]"/>
          <parameter key="parameterSet1.maximal_depth" value="[-1.0;100.0]"/>
          <parameter key="parameterSet1.minimal_leaf_size" value="[1.0;100.0]"/>
          <parameter key="parameterSet1.minimal_size_for_split" value="[1.0;100.0]"/>
        </list>
        <parameter key="use_early_stopping" value="true"/>
        <parameter key="population_size" value="3"/>
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="103" name="Hyperparameters" width="90" x="45" y="34">
            <process expanded="true">
              <operator activated="true" class="generate_data" compatibility="8.1.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
                <parameter key="target_function" value="random classification"/>
              </operator>
              <operator activated="true" class="concurrency:parallel_random_forest" compatibility="8.1.000" expanded="true" height="103" name="parameterSet1" width="90" x="179" y="136">
                <parameter key="number_of_trees" value="63"/>
                <parameter key="maximal_depth" value="94"/>
                <parameter key="minimal_leaf_size" value="7"/>
                <parameter key="minimal_size_for_split" value="8"/>
              </operator>
              <operator activated="true" class="operator_toolbox:get_parameters" compatibility="0.9.000" expanded="true" height="103" name="Get Parameters" width="90" x="313" y="85">
                <parameter key="Operator name" value="parameterSet1"/>
              </operator>
              <operator activated="false" class="set_macro" compatibility="8.1.000" expanded="true" height="82" name="nTree" width="90" x="447" y="493">
                <parameter key="macro" value="nTree"/>
                <parameter key="value" value="200"/>
              </operator>
              <operator activated="false" class="set_macro" compatibility="8.1.000" expanded="true" height="82" name="minSizeSplit" width="90" x="581" y="493">
                <parameter key="macro" value="minSizeSplit"/>
                <parameter key="value" value="4"/>
              </operator>
              <operator activated="false" class="set_macro" compatibility="8.1.000" expanded="true" height="82" name="minLeafSize" width="90" x="715" y="493">
                <parameter key="macro" value="minLeafSize"/>
                <parameter key="value" value="2"/>
              </operator>
              <operator activated="false" class="set_macro" compatibility="8.1.000" expanded="true" height="103" name="maxDepth" width="90" x="849" y="493">
                <parameter key="macro" value="maxDepth"/>
                <parameter key="value" value="20"/>
              </operator>
              <operator activated="true" class="converters:parameter_set_2_example_set" compatibility="0.3.001" expanded="true" height="103" name="Parameter Set to ExampleSet" width="90" x="447" y="85"/>
              <operator activated="true" class="extract_macro" compatibility="8.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="85">
                <parameter key="macro" value="nTree"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="parameterSet1.number_of_trees"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros">
                  <parameter key="minSizeSplit" value="parameterSet1.minimal_size_for_split"/>
                  <parameter key="minLeafSize" value="parameterSet1.minimal_leaf_size"/>
                  <parameter key="maxDepth" value="parameterSet1.maximal_depth"/>
                </list>
                <description align="center" color="transparent" colored="false" width="126">Extracts the parameters to macro values</description>
              </operator>
              <connect from_port="in 1" to_port="out 1"/>
              <connect from_op="Generate Data" from_port="output" to_op="parameterSet1" to_port="training set"/>
              <connect from_op="parameterSet1" from_port="model" to_op="Get Parameters" to_port="through 1"/>
              <connect from_op="Get Parameters" from_port="parameters" to_op="Parameter Set to ExampleSet" to_port="parameters"/>
              <connect from_op="nTree" from_port="through 1" to_op="minSizeSplit" to_port="through 1"/>
              <connect from_op="minSizeSplit" from_port="through 1" to_op="minLeafSize" to_port="through 1"/>
              <connect from_op="minLeafSize" from_port="through 1" to_op="maxDepth" to_port="through 1"/>
              <connect from_op="maxDepth" from_port="through 1" to_op="maxDepth" to_port="through 2"/>
              <connect from_op="Parameter Set to ExampleSet" from_port="exampleSet" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_port="out 2"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
              <portSpacing port="sink_out 3" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.1.000" expanded="true" height="145" name="Cross Validation 2" width="90" x="447" y="34">
            <parameter key="use_local_random_seed" value="true"/>
            <process expanded="true">
              <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Random Forest" width="90" x="112" y="34">
                <parameter key="script" value="&#10;import pandas as pd&#10;from sklearn.ensemble import GradientBoostingClassifier&#10;from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for regression problem&#10;&#10;# This script creates a RandomForestClassifier from SKLearn on RM data&#10;# It can be used as a generic template for other sklearn classifiers or regressors&#10;&#10;def rm_main(data):&#10;    metadata =  data.rm_metadata&#10;&#10;    # Get the list of regular attributes and the label&#10;    df = pd.DataFrame(metadata).T&#10;    label = df[df[1]==&quot;label&quot;].index.values&#10;    regular = df[df[1] != df[1]].index.values&#10;&#10;    # === RandomForest === #&#10;    # Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset&#10;    # Create Random Forest object&#10;    model= RandomForestClassifier(n_estimators = %{nTree}&#10;                                , max_depth = %{maxDepth}&#10;                                , min_samples_split = %{minSizeSplit} # The minimum number of samples required to split an internal node&#10;                                , min_samples_leaf = %{minLeafSize} # The minimum number of samples required to be at a leaf node&#10;                                , random_state = 1992&#10;                                 )&#10;    # Train the model using the training sets and check score&#10;    # model.fit(X, y)&#10;    model.fit(data[regular], data[label])&#10;    # Predict Output&#10;    # predicted = model.predict(x_test)&#10;    return (model,regular,label[0]), data"/>
              </operator>
              <connect from_port="training set" to_op="Random Forest" to_port="input 1"/>
              <connect from_op="Random Forest" from_port="output 1" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model (2)" width="90" x="112" y="34">
                <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function,&#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;def rm_main(rfinfo, data):&#10;    rf = rfinfo[0]&#10;    regular = rfinfo[1]&#10;    label = rfinfo[2]&#10;    meta = data.rm_metadata&#10;    predictions = rf.predict(data[regular])&#10;    confidences = rf.predict_proba(data[regular])&#10;&#10;&#10;    predictions = pd.DataFrame(predictions, columns=[&quot;prediction(&quot;+label+&quot;)&quot;])&#10;    confidences = pd.DataFrame(confidences,&#10;                               columns=[&quot;confidence(&quot; + str(c) + &quot;)&quot; for c in rf.classes_])&#10;&#10;    data = data.join(predictions)&#10;    data = data.join(confidences)&#10;    data.rm_metadata = meta&#10;    data.rm_metadata[&quot;prediction(&quot;+label+&quot;)&quot;] = (&quot;nominal&quot;,&quot;prediction&quot;)&#10;&#10;    for c in rf.classes_:&#10;        data.rm_metadata[&quot;confidence(&quot;+str(c)+&quot;)&quot;] = (&quot;numerical&quot;,&quot;confidence_&quot;+str(c))&#10;&#10;    return data, rf"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="8.1.000" expanded="true" height="82" name="Python" width="90" x="246" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="input 1"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="input 2"/>
              <connect from_op="Apply Model (2)" from_port="output 1" to_op="Python" to_port="labelled data"/>
              <connect from_op="Python" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Python</description>
          </operator>
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="Extract Performance Log" width="90" x="648" y="187">
            <process expanded="true">
              <operator activated="true" class="provide_macro_as_log_value" compatibility="8.1.000" expanded="true" height="82" name="LognTree" width="90" x="45" y="34">
                <parameter key="macro_name" value="nTree"/>
              </operator>
              <operator activated="true" class="provide_macro_as_log_value" compatibility="8.1.000" expanded="true" height="82" name="Log maxDepth" width="90" x="179" y="34">
                <parameter key="macro_name" value="maxDepth"/>
              </operator>
              <operator activated="true" class="provide_macro_as_log_value" compatibility="8.1.000" expanded="true" height="82" name="Log minLeafSize" width="90" x="313" y="34">
                <parameter key="macro_name" value="minLeafSize"/>
              </operator>
              <operator activated="true" class="provide_macro_as_log_value" compatibility="8.1.000" expanded="true" height="82" name="Log minSizeSplit" width="90" x="447" y="34">
                <parameter key="macro_name" value="minSizeSplit"/>
              </operator>
              <operator activated="true" class="log" compatibility="8.1.000" expanded="true" height="82" name="Log" width="90" x="581" y="34">
                <list key="log">
                  <parameter key="Count" value="operator.Apply Model (2).value.applycount"/>
                  <parameter key=" Testing Error" value="operator.Cross Validation 2.value.performance 1"/>
                  <parameter key="Training StdDev" value="operator.Cross Validation 2.value.std deviation 1"/>
                  <parameter key="maxDepth" value="operator.Log maxDepth.value.macro_value"/>
                  <parameter key="minLeafSize" value="operator.Log minLeafSize.value.macro_value"/>
                  <parameter key="minSizeSplit" value="operator.Log minSizeSplit.value.macro_value"/>
                  <parameter key="Number of Trees" value="operator.LognTree.value.macro_value"/>
                </list>
              </operator>
              <connect from_port="in 1" to_op="LognTree" to_port="through 1"/>
              <connect from_op="LognTree" from_port="through 1" to_op="Log maxDepth" to_port="through 1"/>
              <connect from_op="Log maxDepth" from_port="through 1" to_op="Log minLeafSize" to_port="through 1"/>
              <connect from_op="Log minLeafSize" from_port="through 1" to_op="Log minSizeSplit" to_port="through 1"/>
              <connect from_op="Log minSizeSplit" from_port="through 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Hyperparameters" to_port="in 1"/>
          <connect from_op="Hyperparameters" from_port="out 1" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Cross Validation 2" to_port="example set"/>
          <connect from_op="Cross Validation 2" from_port="model" to_port="result 1"/>
          <connect from_op="Cross Validation 2" from_port="performance 1" to_op="Extract Performance Log" to_port="in 1"/>
          <connect from_op="Extract Performance Log" from_port="out 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Optimize Parameters (Evolutionary)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Evolutionary)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Evolutionary)" from_port="parameter" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

ralph_brecheise · 2018 01

Great yyhuang! I'll try that approach and give you a heads up if I get it working.

Ralph

ralph_brecheise · 2018 01

Hi YY

I'm getting the following error when loading the example process XML. Looks like we're on different versions of RM (I'm using 8.1). Any chance you have an example that's more compatible? Or am I missing an extension?

Cheers, Ralph

Screen Shot 2018-03-01 at 21.16.32.png

sgenzer · 2018 01

always about six ways to do anything like this in RapidMiner! Here's another approach (N.B. you will need the Operator Toolbox from the Marketplace):

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:generate_univariate_series" compatibility="0.9.000" expanded="true" height="68" name="Generate Univariate Series" width="90" x="45" y="34"/>
      <operator activated="true" class="generate_attributes" compatibility="8.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="fileNameDCM" value="if(att1&lt;10,concat(&quot;IM00&quot;,str(att1),&quot;.dcm&quot;),&#10;if(att1&lt;100,concat(&quot;IM0&quot;,str(att1),&quot;.dcm&quot;),&#10;concat(&quot;IM&quot;,str(att1),&quot;.dcm&quot;)))"/>
          <parameter key="fileNameTAG" value="if(att1&lt;10,concat(&quot;IM00&quot;,str(att1),&quot;.tag&quot;),&#10;if(att1&lt;100,concat(&quot;IM0&quot;,str(att1),&quot;.tag&quot;),&#10;concat(&quot;IM&quot;,str(att1),&quot;.tag&quot;)))"/>
        </list>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="8.1.000" expanded="true" height="82" name="Loop Examples" width="90" x="313" y="34">
        <process expanded="true">
          <operator activated="true" class="extract_macro" compatibility="8.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
            <parameter key="macro" value="fileNameDCM"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="fileNameDCM"/>
            <parameter key="example_index" value="%{example}"/>
            <list key="additional_macros">
              <parameter key="fileNameTAG" value="fileNameTAG"/>
            </list>
          </operator>
          <operator activated="true" class="open_file" compatibility="8.1.000" expanded="true" height="68" name="Open File" width="90" x="313" y="34">
            <parameter key="filename" value="%{fileNameDCM}"/>
          </operator>
          <operator activated="true" class="open_file" compatibility="8.1.000" expanded="true" height="68" name="Open File (2)" width="90" x="313" y="136">
            <parameter key="filename" value="%{fileNameTAG}"/>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Univariate Series" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Iterating through a flat list .dcm/.tag file pairs and applying some processing on each pair

Best Answers

Answers

Howdy, Stranger!

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Iterating through a flat list *.dcm/*.tag file pairs and applying some processing on each pair

Best Answers

Answers

Iterating through a flat list .dcm/.tag file pairs and applying some processing on each pair