The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
If statement
Best Answer
-
MarcoBarradas Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, Member Posts: 272 UnicornHi @mina_s_kh Assuming you already solved the clustering process you may use this logic.
I built on top of what @yyhuang said and sorted the count, extracted a macro with the name of the least cluster and created the outlier attribute. Hope it helps.<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Iris" origin="GENERATED_TUTORIAL" width="90" x="45" y="34"> <parameter key="repository_entry" value="//Samples/data/Iris"/> </operator> <operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TUTORIAL" width="90" x="246" y="34"> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="false"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="k" value="3"/> <parameter key="max_runs" value="10"/> <parameter key="determine_good_start_values" value="false"/> <parameter key="measure_types" value="BregmanDivergences"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="SquaredEuclideanDistance"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> <parameter key="max_optimization_steps" value="100"/> <parameter key="use_local_random_seed" value="true"/> <parameter key="local_random_seed" value="1992"/> </operator> <operator activated="true" class="aggregate" compatibility="9.1.000" expanded="true" height="82" name="Aggregate" origin="GENERATED_TUTORIAL" width="90" x="380" y="34"> <parameter key="use_default_aggregation" value="false"/> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="default_aggregation_function" value="average"/> <list key="aggregation_attributes"> <parameter key="a1" value="count"/> </list> <parameter key="group_by_attributes" value="cluster|label"/> <parameter key="count_all_combinations" value="false"/> <parameter key="only_distinct" value="false"/> <parameter key="ignore_missings" value="true"/> </operator> <operator activated="true" class="sort" compatibility="9.1.000" expanded="true" height="82" name="Sort" width="90" x="514" y="34"> <parameter key="attribute_name" value="count(a1)"/> <parameter key="sorting_direction" value="increasing"/> </operator> <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34"> <parameter key="macro" value="MIN_CLUSTER"/> <parameter key="macro_type" value="data_value"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="cluster"/> <parameter key="example_index" value="1"/> <list key="additional_macros"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="187"> <list key="function_descriptions"> <parameter key="Outlier" value="if(cluster==%{MIN_CLUSTER},"Outlier","Normal")"/> </list> <parameter key="keep_all" value="true"/> </operator> <connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/> <connect from_op="Clustering" from_port="clustered set" to_op="Aggregate" to_port="example set input"/> <connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/> <connect from_op="Aggregate" from_port="original" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Sort" from_port="example set output" to_op="Extract Macro" to_port="example set"/> <connect from_op="Extract Macro" from_port="example set" to_port="result 1"/> <connect from_op="Generate Attributes" from_port="example set output" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
5
Answers
You want select the cluster with the lowest number of elements ?
Can you share your process and your dataset(s) in order we better understand ?
Regards,
Lionel
I use DBSCAN to cluster my dataset. I want to consider the cluster with the lowest number of elements as outlier elements. I want label the outlier as false and other as true and use them in k_nn algorithm.
My problem is that the outlier cluster may change, when i use different dataset. I wan to find a way to dynamically determine the outlier and use it in if statement.
Thanks
Can you share your dataset (the file RDG_Day(Test)) in order I can run your process ?
Regards,
Lionel
Thanks very much for the information. If you have only two attributes, ip and counts, you can basically ignore ip or translate ip to country/city names with geo-location functions.
1. example code to integrate python scripts for geo locating from ip address
2. least() function in your aggregate will not get the cluster name with least counts. You will need to aggregate the count() by cluster and then label the "minorities" as you described above. My example process will use X-means (much faster) and return 3 clusters. From the bar charts, the cluster_2 has the lest number of examples and will be labeled as outlier.
3. You will need the python extension from marketplace to test the process from my git https://github.com/sunnyuan/geoIP-clustering but you can skip the geo-locating with python by using the sampleSet_country_names.csv directly with clustering and k-nn