The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Answers
I would use Optimize Parameters (Grid) for this.
You connect the incoming data to Optimize Parameters. Inside the Optimize Parameters process you put a Sample operator and configure Optimize Parameters to try different settings of Sample. For example, you could sample 0.05, 0.1, 0.15 and so on from the original data set. Then you put the three cross validations with the different models behind the Sample and a Multiply operator. And you use Log to extract the performance from those and the sampling parameter. You will get a Log output in the Results view and you can visualize it, or use Log to Data after Optimize Parameters to turn it into a regular data table which you can export.
Regards,
Balázs
Thank you for your response.
Would the suggested way generate a curve of the training and testing as illustrated in the attached picture?
the performance of a cross validation returns the performance on the test set that wasn't used for building the model. This is the correct way to calculate the performance.
If you want to calculate the training performance, you can apply the model on its own input and get the performance from that result. But in data science we consider that cheating. Models should be tested on a test set, not the training set.
I would actually expect the validation curve to also get better with more data. Where is this illustration coming from? It's strange.
You can generate these curves with varying training samples, but I doubt you will get similar curves.
Another important aspect for the model performance, especially on the training set, is the model complexity. That is on the X axis in most similar illustrations and it describes the phenomenon of the training performance growing while the test performance getting worse when the point of overfitting has been reached.
Regards,
Balázs
This is what I would do. I had to do it in two steps. Probably someone here more knowledgeable than me can do it in one step. In Process 1 (not shown below I splitted the famous diamonds dataset (ggplot): diamonds1 (80%) and diamonds2 (20%). These are the datasets used in the process below.
Balázs: the learning curve is a tool to diagnose overfitting (Andrew Ng made it famous). It requires the computation of both the training error and the test error. When the TestError >> TrainingError this is taken a sign of overfitting. You could do two things to fix it then: simplify your model or get more data. There used to be an operator in RM to graph learning curves.
Hope this helps.
\Ernesto
P.S. The graph I get for the learning curve is:
<?xml version="1.0" encoding="UTF-8"?><process version="10.1.001">