The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Copy Dataset Properties
This is a "is it possible, if not, best way to handle this" type of question.
My use-case is one where I have two files; a training file and a validation set. The training is meant to fit the model ,and the validation has the same columns short of the label. I am doing a decent amount of preprocessing, and want to leverage that work.
I am hitting a roadblock because when I do Read CSV on the validation set, the predicted data type for a given column varies (train = polynominal, test = integer), and even though I can bring forward the preprocessing steps via Apply Model, the column is not being dummy encoded with the Nominal to Numeric operator I am carrying forward. As such, applying the model to the validation set fails because the column is not present.
I know that I could manually fix the file on load or via an operator, but I am wondering if there is a "copy data type" when columns share the same name. I would prefer this type of error not to happen during my in-class data competitions, and with a dataset that has 50 columns, my end goal would be to try to avoid having them ensure column types 1 by 1.
My use-case is one where I have two files; a training file and a validation set. The training is meant to fit the model ,and the validation has the same columns short of the label. I am doing a decent amount of preprocessing, and want to leverage that work.
I am hitting a roadblock because when I do Read CSV on the validation set, the predicted data type for a given column varies (train = polynominal, test = integer), and even though I can bring forward the preprocessing steps via Apply Model, the column is not being dummy encoded with the Nominal to Numeric operator I am carrying forward. As such, applying the model to the validation set fails because the column is not present.
I know that I could manually fix the file on load or via an operator, but I am wondering if there is a "copy data type" when columns share the same name. I would prefer this type of error not to happen during my in-class data competitions, and with a dataset that has 50 columns, my end goal would be to try to avoid having them ensure column types 1 by 1.
1
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi,
"copy data types" means different things in different contexts.
In the case of CSV files, it's always best to set up the data types the way you know they are right. One day you'll encounter a file that has content that makes the automatic detection change its decision (as you already did).
To reuse this, create a process with just the Read CSV operator, set it up using the wizard, and then connect the process input (left side: "inp") to the "fil" input of your CSV file. Then you can use this process as a subprocess in another, and it will read the files you send it in the same way. Use Open File in the calling process to access your file (training or validation file).
This approach could also be used with Read Excel.
When reading from databases, you can specify the data set structure in the query, and so on.
As Martin wrote, when you want to apply the same preprocessing, you use the "pre" output to get the preprocessing model and ideally Group Models to combine the preprocessing and the actual model building to one integrated model.
Regards,
Balázs6
Answers
Nominal to Numerical has a preprocessing model. you can group this with your prediction model, so that you always to do both at the same time.
Best,
Martin
Dortmund, Germany