The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
K Nearest for reducing colour pallet used in ML
We are applying ML in creating images, we want to apply ML to the images to try and determine which colour certain customer clusters respond to.
The problem is that there are 16 million different colour possibilities when we apply OCR to the final image that is created.
Colours in Hex code are encode #000000 or #ffffff (black and white). But RM does not know that the RM company logo has primarily four colours #f5e44c, #e26937, #7a7d82 and #32353d. It does not even understand the distances between these colours and that #f5e44c and #e26937 would be nearest neighbours. It would not know to recommend using #ce6033 because it is so close to #e26937.
The calculation of the colour codes follows this methodology:
The problem is that there are 16 million different colour possibilities when we apply OCR to the final image that is created.
Colours in Hex code are encode #000000 or #ffffff (black and white). But RM does not know that the RM company logo has primarily four colours #f5e44c, #e26937, #7a7d82 and #32353d. It does not even understand the distances between these colours and that #f5e44c and #e26937 would be nearest neighbours. It would not know to recommend using #ce6033 because it is so close to #e26937.
The calculation of the colour codes follows this methodology:
White RGB Color
White RGB code = 255*65536+255*256+255 = #FFFFFF
Blue RGB Color
Blue RGB code = 0*65536+0*256+255 = #0000FF
Red RGB Color
Red RGB code = 255*65536+0*256+0 = #FF0000
Green RGB Color
Green RGB code = 0*65536+255*256+0 = #00FF00
Gray RGB Color
Gray RGB code = 128*65536+128*256+128 = #808080
Yellow RGB Color
Yellow RGB code = 255*65536+255*256+0 = #FFFF00
Has anyone used RM to reduce the number of colours used in ML by applying K Nearest Neighbour? I want to reduce that 16 million down to a much more usable number of around 117 colours.
Tagged:
1
Best Answer
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistSample process to get color codes and run k-means on that:
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve RGB data" width="90" x="179" y="34"> <parameter key="repository_entry" value="RGB data"/> </operator> <operator activated="false" class="radoop:spark_kmeans" compatibility="9.1.000" expanded="true" height="82" name="K-Means" width="90" x="380" y="187"> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="number_of_clusters" value="2"/> <parameter key="maximum_iterations" value="20"/> <parameter key="initialization_mode" value="k-means||"/> <parameter key="parallel_runs" value="1"/> <parameter key="epsilon" value="1.0E-4"/> <parameter key="file_format" value="TEXTFILE"/> </operator> <operator activated="true" class="concurrency:k_means" compatibility="9.2.000" expanded="true" height="82" name="Clustering (2)" width="90" x="380" y="34"> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="true"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="k" value="117"/> <parameter key="max_runs" value="10"/> <parameter key="determine_good_start_values" value="true"/> <parameter key="measure_types" value="NumericalMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="SquaredEuclideanDistance"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> <parameter key="max_optimization_steps" value="100"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <connect from_op="Retrieve RGB data" from_port="output" to_op="Clustering (2)" to_port="example set"/> <connect from_op="Clustering (2)" from_port="cluster model" to_port="result 1"/> <connect from_op="Clustering (2)" from_port="clustered set" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process> </code><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"> <parameter key="generator_type" value="numeric series"/> <parameter key="number_of_examples" value="256"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"> <parameter key="r" value="linear.0\.0.255\.0"/> </list> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="179" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="numeric"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="real"/> <parameter key="block_type" value="value_series"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_series_end"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="concurrency:loop_values" compatibility="9.2.000" expanded="true" height="82" name="Loop Values" width="90" x="380" y="34"> <parameter key="attribute" value="r"/> <parameter key="iteration_macro" value="Rvalue"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="45" y="187"> <parameter key="generator_type" value="numeric series"/> <parameter key="number_of_examples" value="256"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"> <parameter key="g" value="linear.0\.0.255\.0"/> </list> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal (2)" width="90" x="179" y="187"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="numeric"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="real"/> <parameter key="block_type" value="value_series"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_series_end"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="concurrency:loop_values" compatibility="9.2.000" expanded="true" height="82" name="Loop Values (2)" width="90" x="313" y="187"> <parameter key="attribute" value="g"/> <parameter key="iteration_macro" value="Gvalue"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet (3)" width="90" x="112" y="136"> <parameter key="generator_type" value="numeric series"/> <parameter key="number_of_examples" value="256"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"> <parameter key="b" value="linear.0\.0.255\.0"/> </list> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal (3)" width="90" x="246" y="136"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="numeric"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="real"/> <parameter key="block_type" value="value_series"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_series_end"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="concurrency:loop_values" compatibility="9.2.000" expanded="true" height="82" name="Loop Values (3)" width="90" x="380" y="136"> <parameter key="attribute" value="b"/> <parameter key="iteration_macro" value="Bvalue"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet (4)" width="90" x="112" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="R, G, B %{Rvalue}, %{Gvalue}, %{Bvalue}"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <connect from_op="Create ExampleSet (4)" from_port="output" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="append" compatibility="9.2.000" expanded="true" height="82" name="Append" width="90" x="514" y="136"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <connect from_op="Create ExampleSet (3)" from_port="output" to_op="Numerical to Polynominal (3)" to_port="example set input"/> <connect from_op="Numerical to Polynominal (3)" from_port="example set output" to_op="Loop Values (3)" to_port="input 1"/> <connect from_op="Loop Values (3)" from_port="output 1" to_op="Append" to_port="example set 1"/> <connect from_op="Append" from_port="merged set" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="append" compatibility="9.2.000" expanded="true" height="82" name="Append (2)" width="90" x="447" y="187"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Numerical to Polynominal (2)" to_port="example set input"/> <connect from_op="Numerical to Polynominal (2)" from_port="example set output" to_op="Loop Values (2)" to_port="input 1"/> <connect from_op="Loop Values (2)" from_port="output 1" to_op="Append (2)" to_port="example set 1"/> <connect from_op="Append (2)" from_port="merged set" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="append" compatibility="9.2.000" expanded="true" height="82" name="Append (3)" width="90" x="581" y="34"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <operator activated="true" class="store" compatibility="9.2.000" expanded="true" height="68" name="Store" width="90" x="715" y="34"> <parameter key="repository_entry" value="RGB data"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/> <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Loop Values" to_port="input 1"/> <connect from_op="Loop Values" from_port="output 1" to_op="Append (3)" to_port="example set 1"/> <connect from_op="Append (3)" from_port="merged set" to_op="Store" to_port="input"/> <connect from_op="Store" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process> </pre><div><br></div>K-means applied on 16million data would take a while...<br><br><p><pre class="CodeBlock"><code>
1
Answers
Thanks for sharing this interesting use case.
The grouping of 16 million of color codes into 117 clusters could be solved with K-means clustering. The questions is what is the best way to represent the colors quantitatively. RGB, or CMYK, or else...
For instances, I simulated 16 million (256*256*256) colors in RGB codes. rgb(206, 96, 51) represented #ce6033, rgb(226, 105, 55) represented RapidMiner orange. We could use Euclidean distances as measurements in clustering model, k-means where k=117.
YY