How to divide ZIP Code into a cluster analysis?

a_trunk · June 2018

Hello,

sorry for my simple question, but i work not so long with rapidminer and i need it for education. I have a simple case but i do not right solve the problem: I have a dataset of 100.000 Zip Code and Customers numbers and want to analyse the best selling areas in my country. So i decided to use the cluster analyse. The ZIP Code in Germany is from 00001 to 99999 and i want to build clusters for example 00001 to 00500 and for example 70000 to 75000.

My question: How can i tell rapidminer how they build the cluster by this range?

Many many thanks for help.

lionelderkrikor · June 2018

Hi @a_trunk

You can try to use the Split Data operator to create some partitions of your data, like in this process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <parameter key="use_stepsize" value="true"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration">
          <parameter key="zip_code" value="linear.0\.0.1\.0"/>
        </list>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,att1&#10;1,&quot;0001&quot;&#10;2,&quot;0002&quot;&#10;3,&quot;0003&quot;&#10;4,&quot;0004&quot;&#10;5,&quot;0005&quot;&#10;6,&quot;0006&quot;&#10;7,&quot;0007&quot;&#10;8,&quot;0008&quot;&#10;9,&quot;0009&quot;&#10;10,&quot;0010&quot;&#10;11,&quot;0011&quot;&#10;12,&quot;0012&quot;&#10;13,&quot;0013&quot;&#10;14,&quot;0014&quot;&#10;15,&quot;0015&quot;&#10;16,&quot;0016&quot;&#10;17,&quot;0017&quot;&#10;18,&quot;0018&quot;&#10;19,&quot;0019&quot;&#10;20,&quot;0020&quot;&#10;"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="124" name="Split Data" width="90" x="514" y="34">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.1"/>
          <parameter key="ratio" value="0.1"/>
          <parameter key="ratio" value="0.8"/>
        </enumeration>
        <parameter key="sampling_type" value="linear sampling"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
      <connect from_op="Split Data" from_port="partition 3" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

I hope it helps,

Regards,

Lionel

Telcontar120 · June 2018

You might also want to create a new attribute (using Generate Attributes) that corresponds to some higher level groupings of postal codes. Using the prefix function, you can create aggregated groups at the 1 digit level, the 2 digit level, etc. These can then be made available to the clustering algorithm rather than the raw zip code. The problem with the raw zip code is that RapidMiner has no idea it is a hierarchical relationship---it just interprets it as a set of distinct nominal values.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to divide ZIP Code into a cluster analysis?

Answers