The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Replace missing values for weight with average/mean of other attribute (item identifyer)

FrancisCFrancisC Member Posts: 2 Learner I
Hi,

I have a data set containing supermarket data and two of my attributes are item weight and item identifier.
A lot of examples are missing weight info, but because of the item identifier I know what they have to be (see image: DRA24 has to be 19.350 and DRA59 has to be 8.270)

How can I replace the missing values for weight based on the average or mean of the item identifier attribute?
Or is there another way how I can fix the missing values for weight?

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    I would use Group Into Collection like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.8.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
            <parameter key="generator_type" value="attribute functions"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions">
              <parameter key="Item_Id" value="round(10*rand())"/>
              <parameter key="Item_Weight" value="if(rand()&lt;0.1,rand()*100,MISSING_NUMERIC)"/>
            </list>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.8.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="313" y="136">
            <parameter key="group_by_attribute" value="Item_Id"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
            <description align="center" color="transparent" colored="false" width="126">Get one example set per item_id</description>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.8.000" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="136">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="replace_missing_values" compatibility="9.8.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="313" y="136">
                <parameter key="return_preprocessing_model" value="false"/>
                <parameter key="create_view" value="false"/>
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="default" value="average"/>
                <list key="columns"/>
              </operator>
              <connect from_port="single" to_op="Replace Missing Values" to_port="example set input"/>
              <connect from_op="Replace Missing Values" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • FrancisCFrancisC Member Posts: 2 Learner I
    Thank you so much! Unfortunately, I don't know how to write code.. Is there an operator that could do the same?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    this is a process. Please check https://community.rapidminer.com/discussion/32606/import-xml-code-to-process on how to get the XML into your RapidMiner.

    Best,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.