The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
importing data with null values
Legacy User
Member Posts: 0 Newbie
Is there a way to replace null values, or at least reject lines with nulls, during import?
I am trying to import a file with scattered missing values and I can only import up to the first omission.
The example for dealing with missing data I found in the tutorial has '?' in the data file for missing values. My data has nothing; here is an example of my data: the 1st & 3rd lines are complete, the 2nd line is missing the 1st & last columns.
N282WN,WN,978,91,91,1525,1630,65,308,2,-1
,WN,1114,91,91,1850,1955,65,308,2,
N207WN,WN,1182,91,91,1405,1510,65,308,2,-1
This is the error I get:
[Error] Data format error in line 393: the line does not provide the expected number of columns (was: 10, expected: 11)! Stop reading...
Thanks much!!
I am trying to import a file with scattered missing values and I can only import up to the first omission.
The example for dealing with missing data I found in the tutorial has '?' in the data file for missing values. My data has nothing; here is an example of my data: the 1st & 3rd lines are complete, the 2nd line is missing the 1st & last columns.
N282WN,WN,978,91,91,1525,1630,65,308,2,-1
,WN,1114,91,91,1850,1955,65,308,2,
N207WN,WN,1182,91,91,1405,1510,65,308,2,-1
This is the error I get:
[Error] Data format error in line 393: the line does not provide the expected number of columns (was: 10, expected: 11)! Stop reading...
Thanks much!!
0
Answers
I copied your data into a simple text-file and loaded it with the operator "SimpleExampleSource" default settings using RapidMiner 4.2. I had no problems, the operator recognized all missing values.
idea: maybe the line 393 of your data is corrupted, e.g. a comma is missing.
hope this was helpful
Steffen
Thank you for your help.
There are no missing commas. Could it have to do with the fact that one of the missing fields is at the end or beginning of the line? Is there an option I need to set?
I am using version community 4.1
I tried duplicating what you did. I switched from ExampleSource to SimpleExampleSource and copied the input data back off this post into a new file. I got a similar error. This is the error:
Error in: SimpleExampleSource (SimpleExampleSource) Could not read file ...\twig.txt': Number of columns in line 1 was unexpected, was: 10, expected: 11
Maybe it depends on the version. I remember something like this but I am not sure....
Is there any specific reason you cannot switch to 4.2 ?
greetings
Steffen
You have all what is value replenishment, either replacing "unknown" values in metadata by a constant (typically zero), or by the attribute's mean. You have more sophisticated approaches where a learner trained on complete values is used to guess missing values, but I have never been able to understand how the operator works and is organized. You can use "Sparse array management" option in your (file/database)ExampleSource if needed.
This item could be a good wiki article in "data formats" ;D
Cheers,
Jean-Charles.
greetings
Steffen
actually there was a bug in versions < 4.2 for reading CSV-like data with missing values at the end of lines. The new version 4.2 which is available now on our web site does no longer contain this bug and everything should work fine as Steffen has pointed out. So I would suggest to upgrade to RM 4.2.
Cheers,
Ingo
I have upgraded to 4.2 and the same error occurs. I have found that it happens when I have missing integer-type data, but not when I have missing nominal-type data. I am beginning to think this may be a follow-on to the bug in version 4.1.
Is there a way to have the import skip incomplete lines?
thank you.
CSVExampleSource works fine.
maybe it would have worked with the ExampleSource operator, too (both operators are basically the same but with different parameter settings), so it might have something to do with quoting, line trimming, or the column separation parameter. However: good to hear it works now ;D
Cheers,
Ingo