add ability to calculate arbitrary percentile values (easily)
As any novice analyst knows, summarizing data with percentiles is part of basic exploratory data analysis. So I was actually very surprised that RapidMiner doesn't already appear to have this functionality built in, but I don't see any way to easily calculate the percentile values of a given numerical attribute. For example, in the quartile graph, the box is based on the 25%, median, and 75% percentile values, and the whiskers show the 5% and 95% values (I believe). But there doesn't appear to be a simple way to generate that same information numerically from the dataset in a straightforward way. Ideally it would be done via an operator with an arbitrary percentile parameter (like in Excel) where you can simply enter the percentile value from 1 to 100 that you want to see.
It should be set up so you can also access this percentile function from the aggregate menu, so you would have those values to compare to the average and median, which are available there now.
P.S. I know you can try to get at this by using the binning operator, but this is quite cumbersome and doesn't give you the output in a way that is easy to use. So I don't regard that as an adequate substitution.
Comments
Isn't this functionality in the Statistics Extension from Old World Computing?
I would like to see an operator that extracts statistics like this across the dataset in a summary table in a similar way to R's describe functions.
In addition an 'advanced statistics' tab would be very handy. @land think you'll be able to include any of this in a future update?
Hi,
John is right, that's part of the functionality offered by our Statistics Extension. Unfortunately it is not yet on the marketplace, but we plan to move it there as soon as possible.
You can get more information and a download link on our website oldworldcomputing.com
About the advanced statistics tab: I'm not sure if that is so easy to add and if the additional benefit would outweight it. I nearly never use the percentiles but use histograms instead (if I take a look on the data myself at all).
Greetings,
Sebastian
oh this is good news. Thanks, Sebastian. I have had to do percentiles in quite an archaic way. Can it do normal distribution probabilities and inverses? I keep hoping that it appears in the "calculator" for generate attributes one of these days...
Scott
Do you want to calculate the density of a normal distribute at a given point x? Or what are you refering to? I don't think that this is possible, right now. You can check with a T-Test whether a population matches a given normal distribution, but not against a single point. But shouldn't be difficult to compute. I think we can add this as a feature request for the next version.
Greetings,
Sebastian
yes exactly. The Excel equivalent would be the NORM.DIST and NORM.INV functions.
Scott
Hi Scott,
yes, that should be quite easily. We could include it in the next release of the extension. Would have been handy for myself also for some times.
Greetings,
Sebastian
see Statistics Extension
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts