The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to do a custom split by indices? (R script is too slow)
I need to split my data into training and test set. This operation should be done automatically so that I can loop it for cross-validation.
I don't want to do stratified sampling, instead I wrote an R script that chooses instances by their group membership, which is computed by a regex expression.
But I can't use this script with the R extension, the script takes forever to execute. The main objective of this script is to ensure that if one instance of this group is selected for the test set, all remaining instances are selected too.
I came up with a quick work-around where I use R (outside of RM) to precompute an example set with all the id's of my known set and then 0 or 1 next to it, to signify if they belong to the test set. I can join this example set with my known set and then use the inTestSet? attribute for filtering the rows for the test set.
Now I wonder if their is a better way. Is there an operator that can filter rows by a given list of indices?
I don't want to do stratified sampling, instead I wrote an R script that chooses instances by their group membership, which is computed by a regex expression.
But I can't use this script with the R extension, the script takes forever to execute. The main objective of this script is to ensure that if one instance of this group is selected for the test set, all remaining instances are selected too.
I came up with a quick work-around where I use R (outside of RM) to precompute an example set with all the id's of my known set and then 0 or 1 next to it, to signify if they belong to the test set. I can join this example set with my known set and then use the inTestSet? attribute for filtering the rows for the test set.
Now I wonder if their is a better way. Is there an operator that can filter rows by a given list of indices?
0
Answers
so far I understand your idea you want to split your data sets into several partitions and select a subset for the training process. I attached a process which does select a partition subset using two different ways. One is filtering examples based on a list of values you have to set (a,b,c), the other delivers a specified number of partitions using a random selection. For demonstration purpose I set the global random seed to -1 (which is a time dependency). If you use the whole thing inside a loop please set global random seed to a fixed value to be able to reproduce your process results. Cheers,
Helge