Remove Duplicate Examples

gracewei · August 2019

Hi,

I'm working on some genetics data. I have 6151 examples and 157 attributes. My attributes are patient IDs and my examples are gene names. My goal is to transpose the matrix table. Here is a sample of my data set:

Image: https://us.v-cdn.net/6030995/uploads/editor/5z/r45vl2y2qby3.png

My problem now is I can't use the "Transpose" operator because there are duplicate row/example names. In order to transpose it, the attribute name needs to be unique. I wish to find all the pairs that have the same example names and edit their names. I was thinking about doing a loop, but I don't really know where to start and what operators to use to change the row names. Can somebody give me some advises on how to achieve this?

Thank you!

cdaponte · August 2019

You can use the Remove duplicates operator, and select the output that shows you the duplicates examples. Once you get the duplicates you can rename them with the operator "Rename" or "Replace".

lionelderkrikor · August 2019

Hi @gracewei,

Nice challenge, but honestly, I don't see any solution to perform automatically what you want to do with RapidMiner's native operator(s)...
... however there is a (relativ) simple solution using a Python script to perform this task.
Basically, the script add a number to the name of the duplicate and this number is incremented according to the number of duplicate(s) of a name.
Concretely the output example set looks like that :

Image: https://us.v-cdn.net/6030995/uploads/editor/kr/hzwqit3wy15q.png

After executing this process, all the names/values of the "gene_name" attribute are unique et thus you can transpose your exampleset...

To execute this process, you need to :
- Install Python on your computer
- Install the Python Scripting extension in RapidMiner (from the Marketplace)

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="read_excel" compatibility="9.4.000-BETA" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
        <parameter key="excel_file" value="D:\Lionel\Formations_DataScience\Rapidminer\Tests_Rapidminer\Rename_Duplicates\Rename_Duplicates.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="gene_name.true.polynominal.attribute"/>
          <parameter key="1" value="Target.true.integer.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.000" expanded="true" height="103" name="Execute Python" width="90" x="313" y="34">
        <parameter key="script" value="import pandas&#10;from collections import Counter # Counter counts the number of occurrences of each item&#10;from itertools import tee, count&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def uniquify(seq, suffs = count(1)):&#10;    &quot;&quot;&quot;Make all the items unique by adding a suffix (1, 2, etc).&#10;&#10;    `seq` is mutable sequence of strings.&#10;    `suffs` is an optional alternative suffix iterable.&#10;    &quot;&quot;&quot;&#10;    not_unique = [k for k,v in Counter(seq).items() if v&gt;1] # so we have: ['name', 'zip']&#10;    # suffix generator dict - e.g., {'name': &lt;my_gen&gt;, 'zip': &lt;my_gen&gt;}&#10;    suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))  &#10;    for idx,s in enumerate(seq):&#10;        try:&#10;            suffix = str(next(suff_gens[s]))&#10;        except KeyError:&#10;            # s was unique&#10;            continue&#10;        else:&#10;            seq[idx] += suffix&#10;&#10;def rm_main(data):&#10;&#10;  mylist = data['gene_name']            &#10;  uniquify(mylist, (f'_{x!s}' for x in range(1, 100)))&#10;  data['gene_name']  = mylist&#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
        <parameter key="notebook_cell_tag_filter" value=""/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope this will help in the future ...

Regards,

Lionel

gracewei · August 2019

I just solved this by manually renaming the 11 duplicates. However, I'm still curious how this can be achieved using the operators!

yyhuang · August 2019

Hi @gracewei,

You can use rapidminer operators to transpose and rename. But the tricky part is to create a new list of name without duplicates

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.3.001" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="gene-name,target-20,target-10,target-30&#10;NAME,&#9;1.0,&#9;-2.0,&#9;-1.0&#10;STATE,&#9;2.0,&#9;3.0,&#9;-2.0&#10;NAME,&#9;3.0,&#9;8.0,&#9;-3.0&#10;CITY,&#9;4.0,&#9;13.0,&#9;-4.0&#10;NAME,&#9;5.0,&#9;18.0,&#9;-5.0&#10;ZIP,&#9;6.0,&#9;23.0,&#9;-6.0&#10;ZIP,&#9;7.0,&#9;28.0,&#9;-7.0"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
        <parameter key="attribute_name" value="gene-name"/>
        <parameter key="target_role" value="regular"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="9.3.001" expanded="true" height="82" name="Generate ID" width="90" x="313" y="85">
        <parameter key="create_nominal_ids" value="true"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" breakpoints="after" class="generate_concatenation" compatibility="9.3.001" expanded="true" height="82" name="Generate Concatenation" width="90" x="447" y="85">
        <parameter key="first_attribute" value="gene-name"/>
        <parameter key="second_attribute" value="id"/>
        <parameter key="separator" value="_"/>
        <parameter key="trim_values" value="true"/>
      </operator>
      <operator activated="true" breakpoints="after" class="transpose" compatibility="9.3.001" expanded="true" height="82" name="Transpose" width="90" x="581" y="85"/>
      <operator activated="true" breakpoints="after" class="rename_by_example_values" compatibility="9.3.001" expanded="true" height="82" name="Rename by Example Values" width="90" x="715" y="85">
        <parameter key="row_number" value="5"/>
      </operator>
      <operator activated="true" class="filter_example_range" compatibility="9.3.001" expanded="true" height="82" name="Filter Example Range" width="90" x="849" y="85">
        <parameter key="first_example" value="4"/>
        <parameter key="last_example" value="4"/>
        <parameter key="invert_filter" value="true"/>
      </operator>
      <operator activated="true" class="parse_numbers" compatibility="9.3.001" expanded="true" height="82" name="Parse Numbers" width="90" x="983" y="85">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="decimal_character" value="."/>
        <parameter key="grouped_digits" value="false"/>
        <parameter key="grouping_character" value=","/>
        <parameter key="infinity_representation" value=""/>
        <parameter key="unparsable_value_handling" value="fail"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Generate Concatenation" to_port="example set input"/>
      <connect from_op="Generate Concatenation" from_port="example set output" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/>
      <connect from_op="Rename by Example Values" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
      <connect from_op="Filter Example Range" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
      <connect from_op="Parse Numbers" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Image: https://us.v-cdn.net/6030995/uploads/editor/4y/1z03onbnykz3.png

After transformation,

Cheers,

YY

Telcontar120 · August 2019

If you don't care about extra characters in the new id, then you could also simply use Generate ID to create a numerical index, and then use Generate Attributes to concatenate that with the existing Gene Name into a Gene Name / ID hybrid, and then use Transpose with that new field serving as the id. This is similar to yy's solution.

gracewei · August 2019

Thank you everyone! I will keep all of these answers in mind in the future!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Remove Duplicate Examples

Best Answers

Answers