Hello everyone how can i restrict to sample size at clustering algorthm ?

Selim · April 2019

F.e i have 3 cluster and 20 item and when i apply to k-means its giving me which have 11-2-7 item but i want to that it is gonna similar size f.e 7-7-6 how can do that ?
Kind regards,

lionelderkrikor · April 2019

@Selim

Here a pseudo_code in Python :

Let N be the number of items, K the number of clusters and S = ceil(N/K) maximum cluster size.

Create a list of tuples (item_id, cluster_id, distance)
Sort tuples with respect to distance
For each element (item_id, cluster_id, distance) in the sorted list of tuples:

if number of elements in cluster_id exceeds S do nothing
otherwise add item_id to cluster cluster_id

</code>dists = []</pre><pre><code>clusts = [None] * N
counts = [0] * K

for i, v in enumerate(items):
    dist = map( lambda x: dist(x, v), centroids )
    dd = map( lambda (k, v): (i, k, v), enumerate(dist) )
    dists.extend(dd)

dists = sorted(dists, key = lambda (x,y,z): z)

for (item_id, cluster_id, d) in dists:
    if counts[cluster_id] >= S:
        continue
    if clusts[item_id] == None:
        clusts[item_id] = cluster_id
        counts[cluster_id] = counts[cluster_id] + 1

Regards,

Lionel

lionelderkrikor · April 2019

Hi @Selim,

I propose you a process which performs your task.
First I'm applying the K-means algorithm (in Python) to initialize k clusters (and k centroids).
then the data points are reaffected to clusters according to the distance between the data points and the centroids to obtain in fine k clusters
which have the same size (ie size = ceil(N/k) where N is the number of examples.

Concretely, you obtain a new column called 'cluster' which mention the cluster of each data point :

I tested the script with your dataset which contains 30 examples, with number of cluster k = 3.
Effectively, I obtain 3 clusters of 10 examples each :

To execute this process,you need to :
- install Python on your computer
- install the "Scipy" library
- set the number of cluster(s) in the Set Macros parameters :

Does this process answer to your need ?

Regards,

Lionel

NB : the process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85">
        <parameter key="excel_file" value="C:\Users\Lionel\Downloads\k-means.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ürün ID.true.integer.attribute"/>
          <parameter key="1" value="hacim.true.integer.attribute"/>
          <parameter key="2" value="ağırlık.true.integer.attribute"/>
          <parameter key="3" value="satış miktar.true.integer.attribute"/>
          <parameter key="4" value="kırılganlık.true.polynominal.attribute"/>
          <parameter key="5" value="F.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
        <parameter key="attribute_name" value="ürün ID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="85">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="kırılganlık"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="coding_type" value="dummy coding"/>
        <parameter key="use_comparison_groups" value="false"/>
        <list key="comparison_groups"/>
        <parameter key="unexpected_value_handling" value="all 0 and warning"/>
        <parameter key="use_underscore_in_name" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="F"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="9.2.001" expanded="true" height="82" name="Set Macros" width="90" x="715" y="85">
        <list key="macros">
          <parameter key="cluster_number" value="3"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="103" name="Execute Python" width="90" x="849" y="85">
        <parameter key="script" value="import pandas as pd&#10;from operator import itemgetter&#10;import numpy as np&#10;import random&#10;import sys&#10;from scipy.spatial import distance&#10;from sklearn.cluster import KMeans&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;C = %{cluster_number}&#10;&#10;def k_means(X) : &#10;&#10;  kmeans = KMeans(n_clusters=C, random_state=0).fit(X)&#10;  return kmeans.cluster_centers_&#10;&#10;&#10;&#10;&#10;def samesizecluster( D ):&#10;    &quot;&quot;&quot; in: point-to-cluster-centre distances D, Npt x C&#10;            &#10;        out: xtoc, X -&gt; C, equal-size clusters&#10;       &#10;    &quot;&quot;&quot;&#10;       &#10;    Npt, C = D.shape&#10;    clustersize = (Npt + C - 1) // C&#10;    xcd = list( np.ndenumerate(D) )  # ((0,0), d00), ((0,1), d01) ...&#10;    xcd.sort( key=itemgetter(1) )&#10;    xtoc = np.ones( Npt, int ) * -1&#10;    nincluster = np.zeros( C, int )&#10;    nall = 0&#10;    for (x,c), d in xcd:&#10;        if xtoc[x] &lt; 0  and  nincluster[c] &lt; clustersize:&#10;            xtoc[x] = c&#10;            nincluster[c] += 1&#10;            nall += 1&#10;            if nall &gt;= Npt:  break&#10;    return xtoc&#10;&#10;def rm_main(data):&#10; &#10;  data_2 = data.values&#10;  #centres = random.sample(list(data_2), C )&#10;  centres = k_means(data_2)&#10;  D = distance.cdist( data_2, centres )&#10;  xtoc = samesizecluster( D )&#10;  data['cluster'] = xtoc&#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role (2)" width="90" x="983" y="85">
        <parameter key="attribute_name" value="cluster"/>
        <parameter key="target_role" value="cluster"/>
        <list key="set_additional_roles"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Macros" to_port="through 1"/>
      <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

lionelderkrikor · April 2019

Hi @Selim

The result is not guaranteed but you can try to use the DBScan model (an other cluster algorithm) and play with its 2 parameters epsilon and min points.
By playing with these parameters, I was able to classify the "Iris dataset" in 3 clusters of approximately same size :
Here the process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="dbscan" compatibility="9.2.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="85">
        <parameter key="epsilon" value="0.8"/>
        <parameter key="min_points" value="40"/>
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Otherwise here an interesting link :

https://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size

Hope this helps,

Regards,

Lionel

Selim · April 2019

Firstly thanks for answer i will try. And also Do you have an idea about do it with execute python ? Which code i need to write on python ?

Selim · April 2019

Many Thanks again . If i send you my excel file and rapid miner process can you check it ? And my python knowledge is not very well ı am just beginner on python ı have read your answer to a question which one is at april 2018 so ı have tried to do it with execute python operator on python script. So as result can you check my process ? And if ı copy paste to this codes will it work do you think ?

lionelderkrikor · April 2019

@Selim,

No, the Python code provided in my previous post will not work if you just copy-paste (it is only a simplified pseudo-code).
But, yes, if you provide your Excel file and your RapidMiner process, I will work on your project to provide a process
which performs what you want to do (to obtain cluster(s) of same size).

Regards,

Lionel

Selim · April 2019

thanks a lot again.

Selim · April 2019

ı am doing zoning at a warehouse .when ı run to this process it is giving 5 cluster with similar size but it does not mean that when ı work with 10.000 item it will give same size clusters so ı want to do sth permanent .so ı think ı need to write code on python.what do you think about this process and how can we do this ?

Selim · April 2019

when ı try to send photo of process it is giving error.so ı can tell you to process.
read excel---nominal to numerical---normalize----weight by user ---select by weights---clustering(k-means)---performance(distance)

lionelderkrikor · April 2019

@Selim

To share your RapidMiner's process, follow these instructions :

Note: This solution requires the "XML" panel which can be opened in the "View" menu and then "Show Panel". Activate the XML panel if you did not do this before.

Open your process in RapidMiner and open the XML panel. If you can't find it, make sure to follow the note above.

Copy the XML code from there and paste it somewhere else, for example into a forum post here on the community portal. By the way, if you post your XML here, please use the code environment which you get by clicking on the </> icon in the toolbar of the post.

In order to import such an XML description of your process, e.g. to use a process someone else has posted here in the forum, please follow the following steps:

Create a new process and go the the XML panel (see above).
Clear the view and copy the XML code you got into that panel.
Then press the green checkmark icon on top of the panel.
Switch back to the Process panel.

Don't forget step 3 above - you need to accept the changed XML code first before you will see any changes in the process!

Regards,

Lionel

Selim · April 2019

here is the steps of clustering at rapid miner

Selim · April 2019

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">

</context>

</operator>

</list>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

Selim · April 2019

İs it okay rıght now ?

Selim · April 2019

@lionelderkrikor hello sir ı have been waiting for your answer . Did you consider to process ?

Selim · April 2019

@lionelderkrikor
thank you so much really. ı did it now . ı really thank you so so much again.

lionelderkrikor · April 2019

@Selim

Yes you have to copy the XML process I shared and then paste it in the XML panel of RapidMiner.
Then you have to click on the green check mark -> The process will appear in the main window.

Tell me if you have a problem...

Regards,

Lionel

Selim · April 2019

@lionelderkrikor sir ı have problem about execute python operator it is giving error .ı did the python path in Settings --> Preferences --> Python Scripting window. ı set the python path and tested it but it gave me error
ı added screenshot of error to Word file may you check it,please ?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Hello everyone how can i restrict to sample size at clustering algorthm ?

Best Answers

Answers