The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Hello everyone how can i restrict to sample size at clustering algorthm ?
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn@Selim
Here a pseudo_code in Python :Let
N
be the number of items,K
the number of clusters andS = ceil(N/K)
maximum cluster size.- Create a list of tuples
(item_id, cluster_id, distance)
- Sort tuples with respect to distance
- For each element
(item_id, cluster_id, distance)
in the sorted list of tuples: - if number of elements in
cluster_id
exceedsS
do nothing - otherwise add
item_id
to clustercluster_id
</code>dists = []</pre><pre><code>clusts = [None] * N counts = [0] * K for i, v in enumerate(items): dist = map( lambda x: dist(x, v), centroids ) dd = map( lambda (k, v): (i, k, v), enumerate(dist) ) dists.extend(dd) dists = sorted(dists, key = lambda (x,y,z): z) for (item_id, cluster_id, d) in dists: if counts[cluster_id] >= S: continue if clusts[item_id] == None: clusts[item_id] = cluster_id counts[cluster_id] = counts[cluster_id] + 1
Regards,
Lionel6 - Create a list of tuples
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @Selim,
I propose you a process which performs your task.
First I'm applying the K-means algorithm (in Python) to initialize k clusters (and k centroids).
then the data points are reaffected to clusters according to the distance between the data points and the centroids to obtain in fine k clusters
which have the same size (ie size = ceil(N/k) where N is the number of examples.
Concretely, you obtain a new column called 'cluster' which mention the cluster of each data point :
I tested the script with your dataset which contains 30 examples, with number of cluster k = 3.
Effectively, I obtain 3 clusters of 10 examples each :
To execute this process,you need to :
- install Python on your computer
- install the "Scipy" library
- set the number of cluster(s) in the Set Macros parameters :
Does this process answer to your need ?
Regards,
Lionel
NB : the process :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_excel" compatibility="9.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85"> <parameter key="excel_file" value="C:\Users\Lionel\Downloads\k-means.xlsx"/> <parameter key="sheet_selection" value="sheet number"/> <parameter key="sheet_number" value="1"/> <parameter key="imported_cell_range" value="A1"/> <parameter key="encoding" value="SYSTEM"/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="date_format" value=""/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="ürün ID.true.integer.attribute"/> <parameter key="1" value="hacim.true.integer.attribute"/> <parameter key="2" value="ağırlık.true.integer.attribute"/> <parameter key="3" value="satış miktar.true.integer.attribute"/> <parameter key="4" value="kırılganlık.true.polynominal.attribute"/> <parameter key="5" value="F.true.polynominal.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85"> <parameter key="attribute_name" value="ürün ID"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="85"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="kırılganlık"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="coding_type" value="dummy coding"/> <parameter key="use_comparison_groups" value="false"/> <list key="comparison_groups"/> <parameter key="unexpected_value_handling" value="all 0 and warning"/> <parameter key="use_underscore_in_name" value="false"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="85"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="F"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="set_macros" compatibility="9.2.001" expanded="true" height="82" name="Set Macros" width="90" x="715" y="85"> <list key="macros"> <parameter key="cluster_number" value="3"/> </list> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="103" name="Execute Python" width="90" x="849" y="85"> <parameter key="script" value="import pandas as pd from operator import itemgetter import numpy as np import random import sys from scipy.spatial import distance from sklearn.cluster import KMeans # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) C = %{cluster_number} def k_means(X) : kmeans = KMeans(n_clusters=C, random_state=0).fit(X) return kmeans.cluster_centers_ def samesizecluster( D ): """ in: point-to-cluster-centre distances D, Npt x C out: xtoc, X -> C, equal-size clusters """ Npt, C = D.shape clustersize = (Npt + C - 1) // C xcd = list( np.ndenumerate(D) ) # ((0,0), d00), ((0,1), d01) ... xcd.sort( key=itemgetter(1) ) xtoc = np.ones( Npt, int ) * -1 nincluster = np.zeros( C, int ) nall = 0 for (x,c), d in xcd: if xtoc[x] < 0 and nincluster[c] < clustersize: xtoc[x] = c nincluster[c] += 1 nall += 1 if nall >= Npt: break return xtoc def rm_main(data): data_2 = data.values #centres = random.sample(list(data_2), C ) centres = k_means(data_2) D = distance.cdist( data_2, centres ) xtoc = samesizecluster( D ) data['cluster'] = xtoc # connect 2 output ports to see the results return data"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role (2)" width="90" x="983" y="85"> <parameter key="attribute_name" value="cluster"/> <parameter key="target_role" value="cluster"/> <list key="set_additional_roles"/> </operator> <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/> <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Set Macros" to_port="through 1"/> <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/> <connect from_op="Execute Python" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Set Role (2)" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
6
Answers
The result is not guaranteed but you can try to use the DBScan model (an other cluster algorithm) and play with its 2 parameters epsilon and min points.
By playing with these parameters, I was able to classify the "Iris dataset" in 3 clusters of approximately same size :
Here the process :
https://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size
Hope this helps,
Regards,
Lionel
No, the Python code provided in my previous post will not work if you just copy-paste (it is only a simplified pseudo-code).
But, yes, if you provide your Excel file and your RapidMiner process, I will work on your project to provide a process
which performs what you want to do (to obtain cluster(s) of same size).
Regards,
Lionel
read excel---nominal to numerical---normalize----weight by user ---select by weights---clustering(k-means)---performance(distance)
To share your RapidMiner's process, follow these instructions :
Note: This solution requires the "XML" panel which can be opened in the "View" menu and then "Show Panel". Activate the XML panel if you did not do this before.
Open your process in RapidMiner and open the XML panel. If you can't find it, make sure to follow the note above.
Copy the XML code from there and paste it somewhere else, for example into a forum post here on the community portal. By the way, if you post your XML here, please use the code environment which you get by clicking on the </> icon in the toolbar of the post.
In order to import such an XML description of your process, e.g. to use a process someone else has posted here in the forum, please follow the following steps:
Don't forget step 3 above - you need to accept the changed XML code first before you will see any changes in the process!
Regards,
Lionel
thank you so much really. ı did it now . ı really thank you so so much again.
Yes you have to copy the XML process I shared and then paste it in the XML panel of RapidMiner.
Then you have to click on the green check mark -> The process will appear in the main window.
Tell me if you have a problem...
Regards,
Lionel
ı added screenshot of error to Word file may you check it,please ?