The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Bringing arbitrary mathematical functions to RapidMiner for generating data sets
likeasir001
Member Posts: 2 Contributor I
Hello,
I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.
I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.
So what I would like to do right now is:
1) generate a data set using a mathematical test function of my choice
2) add noise on that previously created data set
3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation
3.2) also import the mathematical function that was used before to generate the data for performance measurement
4) perform a t-test using the performance results provided by the different cross-validation operators
So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.
Let me explain step by step and point out where I need help:
1) generate a data set using a mathematical test function of my choice
I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.
For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²
or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²
I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.
The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.
2) adding noise on the previously generated data
Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.
3.1) performance evaluation using already existing regression operators etc.
This should also cause no troubles because here I would only use operators that already exist within RapidMiner.
3.2) performance evaluation using the function that was originally used to generate the data set
This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?
4) performing a t-test
Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.
I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.
I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.
I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.
So what I would like to do right now is:
1) generate a data set using a mathematical test function of my choice
2) add noise on that previously created data set
3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation
3.2) also import the mathematical function that was used before to generate the data for performance measurement
4) perform a t-test using the performance results provided by the different cross-validation operators
So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.
Let me explain step by step and point out where I need help:
1) generate a data set using a mathematical test function of my choice
I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.
For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²
or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²
I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.
The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.
2) adding noise on the previously generated data
Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.
3.1) performance evaluation using already existing regression operators etc.
This should also cause no troubles because here I would only use operators that already exist within RapidMiner.
3.2) performance evaluation using the function that was originally used to generate the data set
This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?
4) performing a t-test
Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.
I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.
0
Answers
The 'generate attributes' operator is the one you want to create arbitrary functions. Start with an example set containing x and y attributes and create new attributes how you want. For an example, you could copy this for some pointers.
http://rapidminernotes.blogspot.co.uk/2014/08/mandelbrot.html
You could also add noise as well as calculate other goodness of fit measures using this operator. In fact, one of the advanced videos I recently completed fits an optimum function to some real data using an evolutionary approach. It calculates a global error for a function compared to the data and minimises this by trying different parameters for the function. The heart of this process is 'generate attributes'.
Andrew
Edit: Well I just found out that your videos seem to be part of an online course that unfortunately is not free.
So I managed to use the "Generate Attributes" operator and now have the example set I need.
It basically has three columns now: X, Y and "function", whereby function could be any mathematical expression, for example (x+y)^2
Now I added noise onto that data and the next step would be to perfrom a cross validation for performance evaluation. But instead of learning a new function using exisitng operators like "Linear Regression" etc, I would like the X-Validation operator to use the mathematical function that I previously created, so for example the aforementioned function (x+y)^2 using the values X and Y from my exampleset and comparing the result with the "function" attribute from my exampleset.
The X-Validation Operator asks for a model as the output of the Training section so I am looking for a way to transform a mathematical function of my choice into a "model" so that it can be used within the X-Validation operator.
P.S.: I am probably not going tu use special values like pi or e or anything like that (at least for now), if that is somehow relevant.