importing data with null values

Legacy User · July 2008

Is there a way to replace null values, or at least reject lines with nulls, during import?

I am trying to import a file with scattered missing values and I can only import up to the first omission.

The example for dealing with missing data I found in the tutorial has '?' in the data file for missing values. My data has nothing; here is an example of my data: the 1st & 3rd lines are complete, the 2nd line is missing the 1st & last columns.
N282WN,WN,978,91,91,1525,1630,65,308,2,-1
,WN,1114,91,91,1850,1955,65,308,2,
N207WN,WN,1182,91,91,1405,1510,65,308,2,-1

This is the error I get:
[Error] Data format error in line 393: the line does not provide the expected number of columns (was: 10, expected: 11)! Stop reading...

Thanks much!!

steffen · July 2008

Hello b2

I copied your data into a simple text-file and loaded it with the operator "SimpleExampleSource" default settings using RapidMiner 4.2. I had no problems, the operator recognized all missing values.

idea: maybe the line 393 of your data is corrupted, e.g. a comma is missing.

hope this was helpful

Steffen

Legacy User · July 2008

Steffen,

Thank you for your help.

There are no missing commas. Could it have to do with the fact that one of the missing fields is at the end or beginning of the line? Is there an option I need to set?

I am using version community 4.1

I tried duplicating what you did. I switched from ExampleSource to SimpleExampleSource and copied the input data back off this post into a new file. I got a similar error. This is the error:
Error in: SimpleExampleSource (SimpleExampleSource) Could not read file ...\twig.txt': Number of columns in line 1 was unexpected, was: 10, expected: 11

steffen · July 2008

Hello b2

Maybe it depends on the version. I remember something like this but I am not sure....
Is there any specific reason you cannot switch to 4.2 ?

greetings

Steffen

Legacy User · July 2008

Hi Steffen,

You have all what is value replenishment, either replacing "unknown" values in metadata by a constant (typically zero), or by the attribute's mean. You have more sophisticated approaches where a learner trained on complete values is used to guess missing values, but I have never been able to understand how the operator works and is organized. You can use "Sparse array management" option in your (file/database)ExampleSource if needed.

This item could be a good wiki article in "data formats" ;D

Cheers,
Jean-Charles.

steffen · July 2008

Hello Jean-Charles

jean-charles wrote:

You have all what is value replenishment, either replacing "unknown" values in metadata by a constant (typically zero), or by the attribute's mean. You have more sophisticated approaches where a learner trained on complete values is used to guess missing values, but I have never been able to understand how the operator works and is organized.

Yes, but not during import.

jean-charles wrote:

You can use "Sparse array management" option in your (file/database)ExampleSource if needed.

Why ? As far I as see, Sparse Data Format is for data wiith a lot of missing values or a small number of different values (for efficient storage).

This item could be a good wiki article in "data formats" ;D

True, true... :-[

greetings

Steffen

IngoRM · July 2008

Hi all,

actually there was a bug in versions < 4.2 for reading CSV-like data with missing values at the end of lines. The new version 4.2 which is available now on our web site does no longer contain this bug and everything should work fine as Steffen has pointed out. So I would suggest to upgrade to RM 4.2.

Cheers,
Ingo

Legacy User · July 2008

Thank you all very much for your help.

I have upgraded to 4.2 and the same error occurs. I have found that it happens when I have missing integer-type data, but not when I have missing nominal-type data. I am beginning to think this may be a follow-on to the bug in version 4.1.

Is there a way to have the import skip incomplete lines?

thank you.

Legacy User · July 2008

ExampleSource was giving me trouble.

CSVExampleSource works fine.

IngoRM · July 2008

Hi again,

maybe it would have worked with the ExampleSource operator, too (both operators are basically the same but with different parameter settings), so it might have something to do with quoting, line trimming, or the column separation parameter. However: good to hear it works now ;D

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

importing data with null values

Answers