The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to divide ZIP Code into a cluster analysis?

a_trunka_trunk Member Posts: 4 Learner III
edited December 2018 in Help

Hello,

sorry for my simple question, but i work not so long with rapidminer and i need it for education. I have a simple case but i do not right solve the problem: I have a dataset of 100.000 Zip Code and Customers numbers and want to analyse the best selling areas in my country. So i decided to use the cluster analyse. The ZIP Code in Germany is from 00001 to 99999 and i want to build clusters for example 00001 to 00500 and for example 70000 to 75000.

My question: How can i tell rapidminer how they build the cluster by this range?

 

Many many thanks for help.

 

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @a_trunk

     

    You can try to use the Split Data operator to create some partitions of your data, like in this process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <parameter key="use_stepsize" value="true"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration">
    <parameter key="zip_code" value="linear.0\.0.1\.0"/>
    </list>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Id,att1&#10;1,&quot;0001&quot;&#10;2,&quot;0002&quot;&#10;3,&quot;0003&quot;&#10;4,&quot;0004&quot;&#10;5,&quot;0005&quot;&#10;6,&quot;0006&quot;&#10;7,&quot;0007&quot;&#10;8,&quot;0008&quot;&#10;9,&quot;0009&quot;&#10;10,&quot;0010&quot;&#10;11,&quot;0011&quot;&#10;12,&quot;0012&quot;&#10;13,&quot;0013&quot;&#10;14,&quot;0014&quot;&#10;15,&quot;0015&quot;&#10;16,&quot;0016&quot;&#10;17,&quot;0017&quot;&#10;18,&quot;0018&quot;&#10;19,&quot;0019&quot;&#10;20,&quot;0020&quot;&#10;"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="124" name="Split Data" width="90" x="514" y="34">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.1"/>
    <parameter key="ratio" value="0.1"/>
    <parameter key="ratio" value="0.8"/>
    </enumeration>
    <parameter key="sampling_type" value="linear sampling"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
    <connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
    <connect from_op="Split Data" from_port="partition 3" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You might also want to create a new attribute (using Generate Attributes) that corresponds to some higher level groupings of postal codes.  Using the prefix function, you can create aggregated groups at the 1 digit level, the 2 digit level, etc.  These can then be made available to the clustering algorithm rather than the raw zip code.  The problem with the raw zip code is that RapidMiner has no idea it is a hierarchical relationship---it just interprets it as a set of distinct nominal values.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.