Easiest way to remove empty rows and attributes?

kayman · October 2018

Hi,

What would be the easiest way to remove empty columns / rows in a dataset?

My data is imported from source data that I can't easily modify before load so I prefer to deal with it within rapidminer.

What I'm looking at is a simple way to remove both Col1 and Row 1 in below example, but in reality I can have more than a few empty rows and columns in much more heavy tables. I've tried to use the filter missing / no missing attributes but it basically removes everything as soon as it finds a missing value so not exactly what I need.

	Col1	Col2	Col3
1	?	?	?
2	?	?	Some data
3	?	?	Some data
4	?	?	Some data
5	?	Some data	?

For now I have a fairly heavy workflow where I first create an id, then multiply my data and have one set where I replace all missing values with a 0, everything else with 1 and then use aggregates to sum and remove every row where sum = 0. Next I loop through all my attributes, do something similar and remove all columns where the aggregated sum is 0 again.

It does the trick but seems a bit overkill, so I'm wondering if I'm missing some easy way to deal with this.

tftemme · October 2018

Hi @kayman, Hi @mschmitz,

From my point of view this would be a handy addition to the Toolbox.

I will have a look, if I find time for this.

Best regards,
Fabian

Maerkli · October 2018

Hallo Kayman,

Could Select Attributes be a solution?

Maerkli

kayman · October 2018

Hi Maerkly.

I don't thinks so, the behaviour of the 'no_missing_values' seems to be that as soon as one value is missing you (rightfully) get a hit, so in this case also Col2 and Col3 would be impacted since they have at least one missing value.

I'd need kind of the opposite, as soon as there is a value it's ok, otherwise skip. Or something like 'if all values are missing' skip, otherwise keep.

Maerkli · October 2018

I am back, Kayman. It is just a suggestion: have a look at RM Studio/Repository/Training Resources/ETL-Advanced/Aggregations/Hotel App Aggregations & Pivoting Solution. With Pivot, you get another perspective on your data and you can apply another treatment like Replace Missing Values. Keep us posted.

Maerkli

Telcontar120 · October 2018

I think you may be making this more complicated than necessary. There are two simple ETL operators that can solve these issues. Remove Useless Attributes will do the trick for you to get rid of attributes (columns) that are all missing (either nominal or numerical), and Filter Examples has a pre-defined condition in the dropdown for "no missing attributes" that should get rid of all your blank examples (rows).

kayman · October 2018

Not really, as the missing attributes operator just looks if there is at least one missing value, where I need to have the option to see if ALL values are missing. I don't care for the occasional missing value, just if my whole example or column is missing.

So if I use the missing attributes operator and loop it through my collection, only the sets without any single missing attribute are returned, which is what the operator promised but not what I wanted to see...

Telcontar120 · October 2018

@kayman Got it, but I think it can still be done with the two operators I describe. In Filter Examples you just have to set the conditions to each attribute to be "is missing" with "match all" logic and then invert the operator. No looping is needed with either. I think that the the attached process does exactly what you want to do (using an exampleset similar to your original sample) with just the two operators I referenced.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
 <context>
 <input/>
 <output/>
 <macros/>
 </context>
 <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
 <process expanded="true">
 <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
 <parameter key="generator_type" value="comma_separated_text"/>
 <list key="function_descriptions"/>
 <list key="numeric_series_configuration"/>
 <list key="date_series_configuration"/>
 <list key="date_series_configuration (interval)"/>
 <parameter key="input_csv_text" value="Row,Att1,Att2,Att3&#10;1,4,7,&#10;2,6,,&#10;3,5,8,&#10;,,,&#10;5,2,,&#10;"/>
 </operator>
 <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="246" y="34"/>
 <operator activated="true" breakpoints="after" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
 <parameter key="invert_filter" value="true"/>
 <list key="filters_list">
 <parameter key="filters_entry_key" value="Row.is_missing."/>
 <parameter key="filters_entry_key" value="Att1.is_missing."/>
 <parameter key="filters_entry_key" value="Att2.is_missing."/>
 </list>
 </operator>
 <connect from_op="Create ExampleSet" from_port="output" to_op="Remove Useless Attributes" to_port="example set input"/>
 <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
 <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
 <portSpacing port="source_input 1" spacing="0"/>
 <portSpacing port="sink_result 1" spacing="0"/>
 <portSpacing port="sink_result 2" spacing="0"/>
 </process>
 </operator>
</process>

Telcontar120 · October 2018

And if you don't want to declare a separate condition for each attribute (maybe you have a lot of attributes in your real dataset) then you can get around that pretty easily by using Generate Aggregation to count the nonmissings and then filter purely based on that, as in the attached process.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Row,Att1,Att2,Att3,Att4&#10;1,4,7,,&quot;yes&quot;&#10;2,6,,,&quot;no&quot;&#10;3,5,8,,&quot;yes&quot;&#10;,,,,&#10;5,,2,,&quot;green&quot;&#10;"/>
      </operator>
      <operator activated="true" breakpoints="after" class="trim" compatibility="9.0.003" expanded="true" height="82" name="Trim" width="90" x="246" y="34"/>
      <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="380" y="34"/>
      <operator activated="true" breakpoints="after" class="generate_aggregation" compatibility="9.0.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="514" y="34">
        <parameter key="attribute_name" value="count"/>
        <parameter key="aggregation_function" value="count"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="648" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count.gt.0"/>
        </list>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Trim" to_port="example set input"/>
      <connect from_op="Trim" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
      <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

kayman · October 2018

Yeah, but my data is unfortunatly not that simple :-)

I do not know up front how many attributes I have, let alone the name, so I'm back to looping through my content again.

Also, the remove useless atributes has the unwanted side effect that when my column actually contains too many times the same content it will also be removed.

So if I have something like

COL1,COL2,COL3

?,1,2

?,2,2

both COL1 and COL3 will be removed. Which is correct behaviour for the operator but again a bit too greedy for my purpose.

Seems afterall that for a simple problem it requires a complex solution...

MartinLiebig · October 2018

Hi,

i usually go for something like aggregate without missings - #examples and check if this is 0. If yes, i do data to weights and got something.

Maybe this is not too straight foward and we should put something into toolbox for it?

BR,

Martin

CC: @tftemme

kayman · October 2018

Hi Martin, that's my current strategy also (and even having some mongo in the middle :-))

I was just wondering if there were easier ways, as this looks to me like quite a common scenario.

I can help myself, but if you can add it in the toolbox it would be handy indeed.

tftemme · November 2018

Hi @kayman ,

We just released version 1.6.0 of the operator toolbox extension with the new operators
- Filter Attributes with Missing Values
- Filter Examples with Missing Values

They offer quite some possiblities to configure which Attributes/Examples should be kept (e.g. maximum absolut or relative number of missing, one non-missing etc) and should catch your problem.

Feel free to update ;-)

Here is the marketplace link: https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox

Best regards
Fabian

kayman · November 2018

Works like a dream, thanks a lot !

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Easiest way to remove empty rows and attributes?

Best Answer

Answers