How to Index Objects for More Convenience when Handling Collections

Jana_OWC · December 2019

Hello,

I'm back with another knowlege base article. We'd like to present to you one of the many useful features of Old World Computing's Jackhammer Extension, bringing more convenience into handling collections with RapidMiner by indexing them. First, we will discuss Indexed Collections, and in another article talk about Indexed Models.

The idea behind Indexed Collections is simple, yet powerful: building on the existing object collections functionality of RapidMiner, the Jackhammer Extension enables you to add group information to the objects, thereby indexing them. This forms a clear structure for your results, making information readily accessible without having to start a cumbersome search through folders until you find what you were looking for.

Image: https://us.v-cdn.net/6030995/uploads/editor/xu/b81yt1nymtew.jpg

Comparing the left and right results window, you can see that the results are now available in a much more ordered and structured fashion: instead of having to click your way through many folders, all with the same name, you can precisely access the correct folder in just one step. This does not only improve the speed with which you find information, it also makes further processing or modeling steps more efficient and more precise.

An Example

We will illustrate this feature with an example: The king of Predictia is interested in knowing beforehand how much rain will fall in the upcoming season, for crop planning purposes. Charged with the task of making predictions, you have obtained centuries worth of reports of precipitation quantities from every corner of the kingdom (Predictia started recording the weather much earlier than other countries) and entered them into RapidMiner in order to later on construct predictive models upon them. Right now, however, you are drowning in the flood of information, all without any kind of information where or when a certain value was measured, and it is dawning on you that this data, as plentiful as it may be, will not actually be very useful if you do not add this information.

Without Indexed Collections, you can go two ways: either you add attributes like Location and Month to the data entries, or you use the normal (i.e. not indexed) Collections to group the information. Both options, however, have disadvantages: when adding the supplement information as attributes, you will always have to run through the entire ExampleSet to find information regarding a certain place or time. Furthermore, in later analytical steps, you will have to apply extensive filters to make sense of the data. While with collections, you could group your entries, you have no way of knowing which folder is which: “Folder 1, Folder 2, Folder 3” does not reveal much about the content. You will have to click through all of them and always check back with your original data if you would like to use them for your later analysis.

Indexed Collections surpass both of these approaches. Your data is neatly organized into folders for easy access, providing an overview and structure. Because of their clear designations, you can efficiently find and use relevant information for your analysis. What’s more, you now have direct access to the data for your later analyses without having to filter out unwanted items or loop over the whole collection: with the operator Select by Key, you can easily and conveniently access exactly what you need.

Operators:

Combine Indexed Objects: copies the IOObjects of each input with their respective group information into a single indexedIOObjectsContainer.

Extend Indexed Objects: extends the provided indexedIOObjectsContainer by the provided IOObject and group information.

Select by Key: retrieves the IOObject that was assigned to the provided group information.

sgenzer · December 2019

@Leonie_OWC thank you for this! Is there a sample process we can see? It is always much easier to learn when we can run the process ourselves.

Scott

Alice555 · December 2019

Today, i have to learn something new.

Jana_OWC · December 2019

@sgenzer Thank you for asking! You're right, it is easier when there is an example process. I've got my colleague on it, hopefully I'll be able to show you a process soon.

Jana_OWC · December 2019

We've come up with an example process, which you will be able to follow with the Jackhammer extension installed (you shouldn't need a license). Feel free to ask if there are further questions!

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="9.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.5.000"
expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve"
compatibility="9.5.000" expanded="true" height="68" name="Retrieve
Monthly Milk Production" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/Time
Series/data sets/Monthly Milk Production"/>
      </operator>
      <operator activated="true" class="generate_attributes"
compatibility="9.5.000" expanded="true" height="82" name="Generate
Attributes" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="Month" value="date_str_custom(Date, &quot;MM
MMM&quot;)"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true"
class="rmx_toolkit:loop_groups_advanced" compatibility="2.2.882"
expanded="true" height="103" name="Loop Groups (Advanced)" width="90"
x="313" y="34">
        <parameter key="enable_parallel_execution" value="true"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Month"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="force_data_materialization" value="false"/>
        <parameter key="define_group_macros" value="true"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <process expanded="true">
          <operator activated="true" class="select_attributes"
compatibility="9.5.000" expanded="true" height="82" name="Select
Attributes" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Month"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type"
value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="date_to_numerical"
compatibility="9.5.000" expanded="true" height="82" name="Date to
Numerical" width="90" x="179" y="187">
            <parameter key="attribute_name" value="Date"/>
            <parameter key="time_unit" value="year"/>
            <parameter key="millisecond_relative_to" value="second"/>
            <parameter key="second_relative_to" value="minute"/>
            <parameter key="minute_relative_to" value="hour"/>
            <parameter key="hour_relative_to" value="day"/>
            <parameter key="day_relative_to" value="month"/>
            <parameter key="week_relative_to" value="year"/>
            <parameter key="month_relative_to" value="year"/>
            <parameter key="quarter_relative_to" value="year"/>
            <parameter key="half_year_relative_to" value="year"/>
            <parameter key="year_relative_to" value="era"/>
            <parameter key="keep_old_attribute" value="false"/>
          </operator>
          <operator activated="true" class="set_role"
compatibility="9.5.000" expanded="true" height="82" name="Set Role"
width="90" x="313" y="187">
            <parameter key="attribute_name" value="Monthly milk
production / pounds per cow"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles">
              <parameter key="Date" value="regular"/>
            </list>
          </operator>
          <operator activated="true"
class="concurrency:cross_validation" compatibility="9.5.000"
expanded="true" height="145" name="Validation" width="90" x="447" y="187">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true"
class="h2o:generalized_linear_model" compatibility="9.3.001"
expanded="true" height="124" name="Generalized Linear Model" width="90"
x="45" y="34">
                <parameter key="family" value="AUTO"/>
                <parameter key="link" value="family_default"/>
                <parameter key="solver" value="AUTO"/>
                <parameter key="reproducible" value="false"/>
                <parameter key="maximum_number_of_threads" value="4"/>
                <parameter key="use_regularization" value="true"/>
                <parameter key="lambda_search" value="false"/>
                <parameter key="number_of_lambdas" value="0"/>
                <parameter key="lambda_min_ratio" value="0.0"/>
                <parameter key="early_stopping" value="true"/>
                <parameter key="stopping_rounds" value="3"/>
                <parameter key="stopping_tolerance" value="0.001"/>
                <parameter key="standardize" value="true"/>
                <parameter key="non-negative_coefficients" value="false"/>
                <parameter key="add_intercept" value="true"/>
                <parameter key="compute_p-values" value="false"/>
                <parameter key="remove_collinear_columns" value="false"/>
                <parameter key="missing_values_handling"
value="MeanImputation"/>
                <parameter key="max_iterations" value="0"/>
                <parameter key="specify_beta_constraints" value="false"/>
                <list key="beta_constraints"/>
                <parameter key="max_runtime_seconds" value="0"/>
                <list key="expert_parameters"/>
              </operator>
              <connect from_port="training set" to_op="Generalized
Linear Model" to_port="training set"/>
              <connect from_op="Generalized Linear Model"
from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <description align="left" color="green" colored="true"
height="113" resized="true" width="284" x="33" y="190">Builds a model on
the current training data set (90 % of the data by default, 10
times).&lt;br&gt;&lt;br&gt;Make sure that you only put numerical
attributes into a linear regression!</description>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model"
compatibility="9.5.000" expanded="true" height="82" name="Apply Model"
width="90" x="45" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_regression"
compatibility="9.5.000" expanded="true" height="82" name="Performance"
width="90" x="179" y="34">
                <parameter key="main_criterion" value="first"/>
                <parameter key="root_mean_squared_error" value="true"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="true"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="true"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="prediction_average" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model"
to_port="model"/>
              <connect from_port="test set" to_op="Apply Model"
to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data"
to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance"
to_port="performance 1"/>
              <connect from_op="Performance" from_port="example set"
to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
              <description align="left" color="blue" colored="true"
height="107" resized="true" width="333" x="28" y="190">Applies the model
built from the training data set on the current test set (10 % by
default).&lt;br/&gt;The Performance operator calculates performance
indicators and sends them to the operator result.</description>
            </process>
            <description align="center" color="transparent"
colored="false" width="126">A cross validation including a linear
regression.</description>
          </operator>
          <operator activated="true"
class="rmx_toolkit:extend_indexed_object" compatibility="2.2.882"
expanded="true" height="82" name="Extend Indexed Object (2)" width="90"
x="648" y="187">
            <list key="group_information">
              <parameter key="Data.nominal" value="Model"/>
            </list>
          </operator>
          <operator activated="true"
class="rmx_toolkit:extend_indexed_object" compatibility="2.2.882"
expanded="true" height="82" name="Extend Indexed Object (3)" width="90"
x="782" y="238">
            <list key="group_information">
              <parameter key="Data.nominal" value="Performance"/>
            </list>
          </operator>
          <operator activated="true"
class="rmx_toolkit:extend_indexed_object" compatibility="2.2.882"
expanded="true" height="82" name="Extend Indexed Object" width="90"
x="1251" y="85">
            <list key="group_information">
              <parameter key="Month.nominal" value="%{month}"/>
            </list>
          </operator>
          <connect from_port="batch of example set" to_op="Select
Attributes" to_port="example set input"/>
          <connect from_port="loop 1" to_op="Extend Indexed Object"
to_port="indexed_ioobject"/>
          <connect from_op="Select Attributes" from_port="example set
output" to_op="Date to Numerical" to_port="example set input"/>
          <connect from_op="Date to Numerical" from_port="example set
output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output"
to_op="Validation" to_port="example set"/>
          <connect from_op="Validation" from_port="model" to_op="Extend
Indexed Object (2)" to_port="ioobject"/>
          <connect from_op="Validation" from_port="performance 1"
to_op="Extend Indexed Object (3)" to_port="ioobject"/>
          <connect from_op="Extend Indexed Object (2)"
from_port="indexedIoobject" to_op="Extend Indexed Object (3)"
to_port="indexed_ioobject"/>
          <connect from_op="Extend Indexed Object (3)"
from_port="indexedIoobject" to_op="Extend Indexed Object"
to_port="ioobject"/>
          <connect from_op="Extend Indexed Object"
from_port="indexedIoobject" to_port="loop 1"/>
          <portSpacing port="source_batch of example set" spacing="0"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_loop 1" spacing="42"/>
          <portSpacing port="source_loop 2" spacing="42"/>
          <portSpacing port="sink_output collector 1" spacing="0"/>
          <portSpacing port="sink_loop 1" spacing="0"/>
          <portSpacing port="sink_loop 2" spacing="0"/>
          <description align="center" color="yellow" colored="false"
height="264" resized="true" width="273" x="621" y="172">&lt;br&gt;
&lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt;
&lt;br&gt; &lt;br&gt; &lt;br&gt; Combine Model
Information&lt;br&gt;&lt;br&gt;- the model itself&lt;br&gt;-
performance</description>
          <description align="center" color="yellow" colored="false"
height="271" resized="true" width="177" x="1205"
y="70">&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Add
to outer collection with &amp;quot;Month&amp;quot; as key</description>
        </process>
      </operator>
      <connect from_op="Retrieve Monthly Milk Production"
from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set
output" to_op="Loop Groups (Advanced)" to_port="example set"/>
      <connect from_op="Loop Groups (Advanced)" from_port="loop 1"
to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to Index Objects for More Convenience when Handling Collections

An Example

Operators:

Comments