Outliers in a big dataset
Hello, i'm a total newby with Rapidminer.
I have a big dataset with targets with numeric values and many (34) attributes.
I have to estimate the value of the target value and i will use a linear regression.
Now I want to detect outliers but RM freezes whenever I do this.
What is the best way to tackle this? Do I need to downsize the dataset with the Sample operator?
Or should should i use the "Remove useless attributes" operator and maby also downsize the dataset?
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Sampling is always a good way to start exploring a problem without running into long runtimes or out-of-memory issues.
I would highly recommend it. "Remove Useless Attributes" will only take out attributes that are constant or missing so it probably isn't going to reduce your overall dataset size very much.
I would also explore some of weighting operators to understand which attributes are related to your target label. Weight by correlation is a good starting point if you are thinking of using a linear regression.
0
Answers
hello @Mirte welcome to the community! I'd recommend posting your XML process here (see "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.
Scott
This is an example of how i am doing it. My goal is to predict the target with a linear regression.
I am doing this the right way?
I also have an adiditional question. If i want to estimate the the value of the target attribute with linear regression, and the are so many attributes, what is the best way to identify the relevant attributes that influence the target variable and how to remove the other ones to make the dataset smaller?
hi...thanks for posting. So that detect outliers is exponentially going to take longer based on the number of rows that you are examining. Running your process with 1000 rows takes 4 seconds. Running with 2000 rows takes 18 seconds. Running with 3000 rows takes 76 seconds. You get the idea. It's a BigO thing.
Scott