The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

which operator?

MinerGeorgeMinerGeorge Member Posts: 1 Learner III
edited November 2018 in Help
Hi,

I have a dataset containing 1m + rows which I wish to group based on the relationship between several columns.

(customer name / nominal / label),(date of contact / date),(sales rep no./nominal)

smith                                                1/1/11                            001
smith                                                2/1/11                            002
jones                                                3/2/11                            001
brown                                              2/2/11                            003
brown                                              3/2/11                            001
brown                                              3/2/11                            004
black                                                6/2/11                          001
jones                                                4/2/11                            005
black                                                5/2/11                          002

Now for the tough bit,
We need to classify the customers based on the unique group of sales reps they have dealt with, ie,
smith and black are in group A as they have both been contacted by 001 and 002, jones is B, brown is C ......................

Is this possible in RM, which operator/s do you suggest?

Thanks in advance.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    the best solution would probably be to pivot the data and then apply a clustering algorithm. You probably don't want a group for each unique set of sales reps, but for similar groups of sales reps, thus clustering will work good enough.

    If you have one million rows you may want to train the clustering model only on a subset for performance reasons and then apply it to the rest of the data.

    If the dates are not important, you could replace them with 1 if present in the pivoted data, and with 0 otherwise.

    Please have a look at the attached process.

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
        <process expanded="true" height="296" width="748">
          <operator activated="true" class="read_csv" compatibility="5.2.003" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
            <parameter key="csv_file" value="C:\Users\mhelf\Desktop\test.txt"/>
            <parameter key="column_separators" value=",\s*|;\s*|\s+|\t+"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="name.true.polynominal.attribute"/>
              <parameter key="1" value="date.true.polynominal.attribute"/>
              <parameter key="2" value="contact.true.integer.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="pivot" compatibility="5.2.003" expanded="true" height="76" name="Pivot (2)" width="90" x="246" y="30">
            <parameter key="group_attribute" value="name"/>
            <parameter key="index_attribute" value="contact"/>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="5.2.003" expanded="true" height="94" name="Replace Missing Values" width="90" x="380" y="30">
            <parameter key="default" value="value"/>
            <list key="columns"/>
            <parameter key="replenishment_value" value="X"/>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.2.003" expanded="true" height="76" name="Clustering" width="90" x="514" y="30">
            <parameter key="measure_types" value="MixedMeasures"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Pivot (2)" to_port="example set input"/>
          <connect from_op="Pivot (2)" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.