Outliers in a big dataset

Mirte · January 2018

Hello, i'm a total newby with Rapidminer.

I have a big dataset with targets with numeric values and many (34) attributes.

I have to estimate the value of the target value and i will use a linear regression.

Now I want to detect outliers but RM freezes whenever I do this.

What is the best way to tackle this? Do I need to downsize the dataset with the Sample operator?

Or should should i use the "Remove useless attributes" operator and maby also downsize the dataset?

Telcontar120 · January 2018

Sampling is always a good way to start exploring a problem without running into long runtimes or out-of-memory issues.

I would highly recommend it. "Remove Useless Attributes" will only take out attributes that are constant or missing so it probably isn't going to reduce your overall dataset size very much.

I would also explore some of weighting operators to understand which attributes are related to your target label. Weight by correlation is a good starting point if you are thinking of using a linear regression.

sgenzer · January 2018

hello @Mirte welcome to the community! I'd recommend posting your XML process here (see "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.

Scott

Mirte · January 2018

This is an example of how i am doing it. My goal is to predict the target with a linear regression.

I am doing this the right way?

Mirte · January 2018

I also have an adiditional question. If i want to estimate the the value of the target attribute with linear regression, and the are so many attributes, what is the best way to identify the relevant attributes that influence the target variable and how to remove the other ones to make the dataset smaller?

sgenzer · January 2018

hi...thanks for posting. So that detect outliers is exponentially going to take longer based on the number of rows that you are examining. Running your process with 1000 rows takes 4 seconds. Running with 2000 rows takes 18 seconds. Running with 3000 rows takes 76 seconds. You get the idea. It's a BigO thing.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Outliers in a big dataset

Best Answer

Answers