Easiest way to remove empty rows and attributes?
Hi,
What would be the easiest way to remove empty columns / rows in a dataset?
My data is imported from source data that I can't easily modify before load so I prefer to deal with it within rapidminer.
What I'm looking at is a simple way to remove both Col1 and Row 1 in below example, but in reality I can have more than a few empty rows and columns in much more heavy tables. I've tried to use the filter missing / no missing attributes but it basically removes everything as soon as it finds a missing value so not exactly what I need.
Col1 | Col2 | Col3 | |
1 | ? | ? | ? |
2 | ? | ? | Some data |
3 | ? | ? | Some data |
4 | ? | ? | Some data |
5 | ? | Some data | ? |
For now I have a fairly heavy workflow where I first create an id, then multiply my data and have one set where I replace all missing values with a 0, everything else with 1 and then use aggregates to sum and remove every row where sum = 0. Next I loop through all my attributes, do something similar and remove all columns where the aggregated sum is 0 again.
It does the trick but seems a bit overkill, so I'm wondering if I'm missing some easy way to deal with this.
Best Answer
-
tftemme Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
Answers
Hallo Kayman,
Could Select Attributes be a solution?
Maerkli
Hi Maerkly.
I don't thinks so, the behaviour of the 'no_missing_values' seems to be that as soon as one value is missing you (rightfully) get a hit, so in this case also Col2 and Col3 would be impacted since they have at least one missing value.
I'd need kind of the opposite, as soon as there is a value it's ok, otherwise skip. Or something like 'if all values are missing' skip, otherwise keep.
I am back, Kayman. It is just a suggestion: have a look at RM Studio/Repository/Training Resources/ETL-Advanced/Aggregations/Hotel App Aggregations & Pivoting Solution. With Pivot, you get another perspective on your data and you can apply another treatment like Replace Missing Values. Keep us posted.
Maerkli
I think you may be making this more complicated than necessary. There are two simple ETL operators that can solve these issues. Remove Useless Attributes will do the trick for you to get rid of attributes (columns) that are all missing (either nominal or numerical), and Filter Examples has a pre-defined condition in the dropdown for "no missing attributes" that should get rid of all your blank examples (rows).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Not really, as the missing attributes operator just looks if there is at least one missing value, where I need to have the option to see if ALL values are missing. I don't care for the occasional missing value, just if my whole example or column is missing.
So if I use the missing attributes operator and loop it through my collection, only the sets without any single missing attribute are returned, which is what the operator promised but not what I wanted to see...
@kayman Got it, but I think it can still be done with the two operators I describe. In Filter Examples you just have to set the conditions to each attribute to be "is missing" with "match all" logic and then invert the operator. No looping is needed with either. I think that the the attached process does exactly what you want to do (using an exampleset similar to your original sample) with just the two operators I referenced.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
And if you don't want to declare a separate condition for each attribute (maybe you have a lot of attributes in your real dataset) then you can get around that pretty easily by using Generate Aggregation to count the nonmissings and then filter purely based on that, as in the attached process.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Yeah, but my data is unfortunatly not that simple :-)
I do not know up front how many attributes I have, let alone the name, so I'm back to looping through my content again.
Also, the remove useless atributes has the unwanted side effect that when my column actually contains too many times the same content it will also be removed.
So if I have something like
COL1,COL2,COL3
?,1,2
?,2,2
both COL1 and COL3 will be removed. Which is correct behaviour for the operator but again a bit too greedy for my purpose.
Seems afterall that for a simple problem it requires a complex solution...
Hi,
i usually go for something like aggregate without missings - #examples and check if this is 0. If yes, i do data to weights and got something.
Maybe this is not too straight foward and we should put something into toolbox for it?
BR,
Martin
CC: @tftemme
Dortmund, Germany
Hi Martin, that's my current strategy also (and even having some mongo in the middle :-))
I was just wondering if there were easier ways, as this looks to me like quite a common scenario.
I can help myself, but if you can add it in the toolbox it would be handy indeed.
We just released version 1.6.0 of the operator toolbox extension with the new operators
- Filter Attributes with Missing Values
- Filter Examples with Missing Values
They offer quite some possiblities to configure which Attributes/Examples should be kept (e.g. maximum absolut or relative number of missing, one non-missing etc) and should catch your problem.
Feel free to update ;-)
Here is the marketplace link: https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox
Best regards
Fabian