The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Read AML vs Read CSV (Data Types Differ)"
Hey Guys,
It looks like the Read CSV operator has been updated to correctly import data however I still see a discrepancy in data type between the import wizard and Read CSV. I would like to avoid having to use the import wizard for each of my many files. Is data type at all important for the subsequent computations (besides any filtering on data type etc.)? I noticed when I normalize the data all reals/integers are converted to "numerical".
Bottom line: Is Read CSV safe to use regardless if it sees some reals are labeled as integers? :P
-Gagi
It looks like the Read CSV operator has been updated to correctly import data however I still see a discrepancy in data type between the import wizard and Read CSV. I would like to avoid having to use the import wizard for each of my many files. Is data type at all important for the subsequent computations (besides any filtering on data type etc.)? I noticed when I normalize the data all reals/integers are converted to "numerical".
Bottom line: Is Read CSV safe to use regardless if it sees some reals are labeled as integers? :P
-Gagi
Tagged:
0
Answers
The problem arises when you have some real data and for whatever reason read CSV assumes the column is integer. On first inspection this ok when looking at the imported file however when you save the example set into the repository and open it up all integer type attributes that had real values are either truncated or turned to zero.
So in a nutshell Read CSV cannot work if the type is not correctly interpreted for real values.
Read CSV --> Column with reals but type is set to integer --> Save Example Set --> Open Example Set ==> All reals in integer type colums are '0'.
So this is a big problem how can I import all my tab delimited data without using read AML over and over and over manually.
-Gagi
It seems you can simply use numerical to real to convert everything to a real number then saving and loading works as expected.
This is a great find! I cant tell you how shocking it is to look back at your gigs of data and see zeros for most of it!
-Gagi ;D
Read CSV will occasionally get the data-type of an attribute (data column) wrong but RM sees the data correctly when reading from memory. Saving the example set with the wrong data type will lead you to loose your data. I believe its a simple casting issue. If you cast 0.5 as an INT it will be stored as 0. This appears to be happening here.
-Gagi Regarding the above statement, I would say it depends on how the data is used. If for any reason the data-type is used for storing intermediate results or casting prior to computations I would ensure that all data types are correct before using the data.
The easy fix is set all numerical to real numbers that way you are safe, however this will come with some memory overhead.
I'm sure you'll understand that this is quite a big warning you are issuing, and that the next step is that someone else should reproduce the error. Could you please post some data, and the XML to reproduce the error?
Many thanks.
System: Windows 7 Ultimate 64bit, RM 5.0.008 Here is the data set: I tested it both tab and space delimited. In both cases the problem persists. Copy and paste it into a text file and use read csv with the default parameters (or customized it does not make a difference) you will see the data type interpreted as integer rather than real. Then when you save and retrieve the data all your values are 0. ???
:P If you use numerical to real to force data to be saved as reals the problem is solved. Note the read AML wizard parses these numbers fine so I'm wondering why read csv does not.
0.454975009 0.455107987 0.455808014 0.455841005 0.459513009 0.459479004 0.458346009 0.457713991 0.432123005
0.459161013 0.459277004 0.459910989 0.459093988 0.454291999 0.456124991 0.455359012 0.455841988 0.427572995
0.457906991 0.457857996 0.458921999 0.457855999 0.456030011 0.459479004 0.459811985 0.459547013 0.431939989
0.463925004 0.463490993 0.463759005 0.463490993 0.452091992 0.459975004 0.459576994 0.460426003 0.430689991
0.459006011 0.459506989 0.459838003 0.459504992 0.454380989 0.45984599 0.460729003 0.460462987 0.432305992
0.459894001 0.46074301 0.460094005 0.460559994 0.453191996 0.458692014 0.457192987 0.457125992 0.430689991
0.458822995 0.458958 0.459288001 0.459138989 0.456580013 0.459479004 0.461095005 0.460462987 0.432123005
0.460994005 0.461109012 0.461010993 0.461109012 0.454474986 0.459057987 0.458110005 0.458041996 0.431055993
0.457722992 0.457857996 0.457455993 0.458222985 0.456764013 0.458563 0.459996015 0.459547013 0.430474013
0.461910009 0.461842 0.461926997 0.461475998 0.455942005 0.461441994 0.458660007 0.459508985 0.432705998
0.459188998 0.459324002 0.460388005 0.459504992 0.454746991 0.459663004 0.461461991 0.461562991 0.431757003
0.459345013 0.460009992 0.459178001 0.45964399 0.454291999 0.459057987 0.456276 0.456759006 0.430139005
0.459739 0.459140986 0.459472001 0.46023801 0.467211008 0.462229013 0.461645007 0.461196989 0.434139013
0.461360008 0.462574989 0.462292999 0.46074301 0.456674993 0.455758989 0.457376003 0.457309008 0.426288992
0.457540005 0.458041012 0.457823008 0.458039999 0.457863003 0.46112901 0.460361987 0.460280001 0.433771998
0.460810989 0.461475998 0.461010993 0.460925996 0.454109013 0.459057987 0.457926005 0.45785901 0.430323005
0.457356989 0.457307994 0.457823008 0.457489997 0.456764013 0.459479004 0.459262013 0.459363014 0.431573004
0.461176991 0.461293012 0.461010993 0.460193008 0.453191996 0.456492007 0.456458986 0.457491994 0.427938998
0.457906991 0.457857996 0.458555996 0.457855999 0.456030011 0.460029006 0.459446013 0.459363014 0.431939989
0.460260987 0.46074301 0.460460991 0.460559994 0.453925014 0.45759201 0.456458986 0.456759006 0.428489
0.45845601 0.458958 0.458921999 0.459872007 0.453280985 0.460213006 0.461277992 0.460280001 0.433405995
0.460078001 0.460193008 0.460094005 0.459827006 0.451725006 0.458324999 0.456642985 0.456209004 0.430323005
0.456440985 0.456207991 0.456907004 0.456391007 0.456396997 0.458929986 0.458528996 0.45716399 0.431939989
0.460444003 0.460925996 0.460460991 0.460009992 0.460707992 0.457042009 0.457008988 0.456759006 0.428672999
0.460810989 0.460925996 0.460644007 0.46074301 0.458142012 0.457774997 0.457192987 0.457491994 0.429589003
0.457540005 0.457307994 0.458005995 0.458039999 0.458047003 0.461311996 0.459996015 0.460096985 0.433771998
0.46062699 0.461293012 0.461194009 0.460925996 0.453191996 0.458875 0.457376003 0.457675993 0.430323005
0.458090007 0.458407998 0.458005995 0.458772004 0.457497001 0.462229013 0.460545003 0.460462987 0.435054988
0.462642998 0.463856995 0.463759005 0.462024987 0.454842001 0.457224995 0.458292991 0.45785901 0.428672999
0.459006011 0.458958 0.459654987 0.458772004 0.456946999 0.459295988 0.459628999 0.460096985 0.431023985
0.460810989 0.460925996 0.461010993 0.46074301 0.453375012 0.46034199 0.457192987 0.458225995 0.43197301
0.459739 0.46005699 0.460388005 0.459872007 0.457863003 0.462962002 0.462195009 0.461930007 0.435054988
0.461360008 0.461659014 0.461560011 0.461475998 0.452459008 0.457958996 0.457560003 0.45785901 0.429405987
0.456806988 0.457125008 0.456907004 0.457307011 0.455114007 0.459113002 0.459996015 0.459547013 0.431389987
0.461542994 0.461475998 0.461376995 0.462024987 0.456124991 0.459241986 0.458842993 0.459508985 0.431605995
0.459739 0.46005699 0.460570991 0.459872007 0.457130015 0.460945994 0.462011009 0.461012989 0.432673007
0.460260987 0.460377008 0.460094005 0.460559994 0.455392003 0.45759201 0.457008988 0.457309008 0.429039001
0.458640009 0.458407998 0.458739012 0.458406001 0.457863003 0.461311996 0.460179001 0.459729999 0.433405995
0.461176991 0.461475998 0.461010993 0.460925996 0.452459008 0.457774997 0.457376003 0.45785901 0.429223001
0.45845601 0.459140986 0.458921999 0.458588988 0.458597004 0.460395992 0.460911989 0.460280001 0.433405995
0.46062699 0.461109012 0.460644007 0.459827006 0.454291999 0.457774997 0.456642985 0.456941992 0.429589003
0.45845601 0.458774 0.459288001 0.458772004 0.457314014 0.460395992 0.461277992 0.461012989 0.432123005
0.460260987 0.460559994 0.460278004 0.460925996 0.453741997 0.458142012 0.457376003 0.457125992 0.429955989
0.458272994 0.458223999 0.458371997 0.458222985 0.457314014 0.459663004 0.460361987 0.459912986 0.431939989
0.461176991 0.461109012 0.460826993 0.461475998 0.454842001 0.457958996 0.458292991 0.458775997 0.429223001
0.460105985 0.459874004 0.460570991 0.460420996 0.456396997 0.460395992 0.462377995 0.461012989 0.432489991
0.459161013 0.459461004 0.458445013 0.458727986 0.453009009 0.457042009 0.455175012 0.455659002 0.428122997
0.456806988 0.456575006 0.45708999 0.456939995 0.453464001 0.462594986 0.458896011 0.458629996 0.435238004
0.461726993 0.462392002 0.462292999 0.461293012 0.454842001 0.457042009 0.457560003 0.457491994 0.429223001
0.455707997 0.455657989 0.456539989 0.456757009 0.463178992 0.458563 0.458346009 0.458447009 0.430474013
0.458090007 0.458958 0.458921999 0.457673013 0.461163014 0.459479004 0.460179001 0.459547013 0.431939989
0.460994005 0.461659014 0.461376995 0.460377008 0.456308991 0.456492007 0.457008988 0.457309008 0.428306013
0.459739 0.459324002 0.458921999 0.460054994 0.458413005 0.462045014 0.462011009 0.461380005 0.434139013
0.461176991 0.460925996 0.460644007 0.460925996 0.455758989 0.459607989 0.457376003 0.457675993 0.431239992
0.459921986 0.46024099 0.460388005 0.460420996 0.456764013 0.461861998 0.462195009 0.461380005 0.433405995
0.461542994 0.462024987 0.461926997 0.461475998 0.456674993 0.459791988 0.458476007 0.458409011 0.431789994
0.459188998 0.459506989 0.459838003 0.459322006 0.456580013 0.460945994 0.461645007 0.461196989 0.43303901
0.463375986 0.464040995 0.463575989 0.463856995 0.455208987 0.459057987 0.460126996 0.460610002 0.430873007
0.459006011 0.459506989 0.459105015 0.459688991 0.458779991 0.460761994 0.461461991 0.461930007 0.433223009
0.461726993 0.461842 0.461560011 0.462208986 0.455942005 0.460525006 0.458660007 0.459508985 0.432889998
0.459739 0.46005699 0.460020989 0.460788012 0.457863003 0.461311996 0.462011009 0.461380005 0.434322
0.462092996 0.462392002 0.462475985 0.462392002 0.455024987 0.459607989 0.458476007 0.458593011 0.430873007
0.458822995 0.458591014 0.459472001 0.459322006 0.456946999 0.462412 0.460729003 0.460646987 0.434139013
0.462826014 0.46312499 0.462110013 0.462940991 0.456492007 0.460707992 0.459392995 0.459876001 0.432155997
0.457356989 0.458041012 0.458555996 0.458039999 0.457679987 0.45984599 0.459262013 0.458997011 0.432855994
0.46062699 0.461293012 0.461010993 0.460377008 0.455574989 0.457042009 0.456642985 0.457309008 0.428489
0.459006011 0.459690988 0.459288001 0.459872007 0.458779991 0.46112901 0.461827993 0.461380005 0.432489991
0.460078001 0.460009992 0.459910989 0.459461004 0.454474986 0.45759201 0.456092 0.456209004 0.429039001
0.46102199 0.460606992 0.460936993 0.460970998 0.456764013 0.462412 0.463660985 0.462296009 0.434139013
0.461726993 0.461842 0.461010993 0.461293012 0.453375012 0.459975004 0.457560003 0.458041996 0.431605995
0.457906991 0.457673997 0.458005995 0.458406001 0.458413005 0.462045014 0.460911989 0.460096985 0.434139013
0.461176991 0.461475998 0.461010993 0.460377008 0.454109013 0.458142012 0.457742989 0.457125992 0.429589003
0.459739 0.459874004 0.460020989 0.459138989 0.455664009 0.460213006 0.460911989 0.461012989 0.432123005
0.460444003 0.459277004 0.460094005 0.459277004 0.452459008 0.457774997 0.456458986 0.456941992 0.429589003
0.460839003 0.46134001 0.461120009 0.460788012 0.454198003 0.462962002 0.463111013 0.462296009 0.43560499
0.459345013 0.45964399 0.458261997 0.459461004 0.450259 0.45759201 0.455541998 0.455841988 0.428855985
0.45845601 0.458223999 0.458739012 0.458588988 0.458779991 0.461495012 0.460545003 0.460096985 0.433955997
0.461542994 0.462208986 0.461926997 0.461659014 0.461441994 0.455208987 0.457376003 0.45639199 0.42573899
0.459345013 0.460193008 0.460094005 0.459093988 0.454291999 0.457224995 0.456092 0.456941992 0.428489
0.45845601 0.459140986 0.459288001 0.459138989 0.454930991 0.460761994 0.460729003 0.460096985 0.433589011
0.461360008 0.461293012 0.460826993 0.460377008 0.455942005 0.457224995 0.457192987 0.457125992 0.429223001
0.457540005 0.458223999 0.458189011 0.458039999 0.456946999 0.458012998 0.459446013 0.458813995 0.430656999
0.458061993 0.458178014 0.459178001 0.457628995 0.457774997 0.455574989 0.454075009 0.454742014 0.427938998
0.459921986 0.460424006 0.460388005 0.460054994 0.460429996 0.463328004 0.462195009 0.462112993 0.435238004
0.460810989 0.461475998 0.461560011 0.460009992 0.457042009 0.457224995 0.456276 0.45657599 0.428122997
0.459739 0.46005699 0.460388005 0.460604995 0.458413005 0.462229013 0.462927997 0.46303001 0.433405995
0.46062699 0.461659014 0.461194009 0.461109012 0.455758989 0.458508998 0.458110005 0.458225995 0.430689991
0.459188998 0.458774 0.459472001 0.458588988 0.458779991 0.460761994 0.460179001 0.459912986 0.433223009
0.462642998 0.462758005 0.462660015 0.46312499 0.456124991 0.46015799 0.459576994 0.459508985 0.431055993
0.45845601 0.458223999 0.458739012 0.459504992 0.459513009 0.462778002 0.461827993 0.461930007 0.435422003
0.463741988 0.464040995 0.464309007 0.463856995 0.456308991 0.459607989 0.460126996 0.460610002 0.430689991
0.456990987 0.457673997 0.458739012 0.457307011 0.456214011 0.459479004 0.459262013 0.458813995 0.432123005
0.461542994 0.462392002 0.462110013 0.461475998 0.457224995 0.458875 0.457742989 0.45785901 0.430689991
0.459372997 0.459140986 0.460204005 0.459504992 0.457679987 0.460029006 0.460911989 0.460280001 0.432489991
0.461360008 0.460925996 0.460644007 0.461475998 0.455758989 0.458692014 0.457008988 0.458041996 0.430689991
0.459739 0.46024099 0.460388005 0.460604995 0.456764013 0.462594986 0.461645007 0.461196989 0.434872001
0.461542994 0.461659014 0.461194009 0.461842 0.455208987 0.458142012 0.457742989 0.458225995 0.430139005
0.458272994 0.458407998 0.459105015 0.458772004 0.456030011 0.460395992 0.460729003 0.459729999 0.432489991
0.463559002 0.464406997 0.463759005 0.464590013 0.455208987 0.46162501 0.461044014 0.461710006 0.432155997
0.461360008 0.461109012 0.460826993 0.462024987 0.453741997 0.457958996 0.458110005 0.458775997 0.429773003
Thanks for the data and process, from which I can confirm that the CSV reader, as configured, screws up on your data, reporting reals as integers.
Par contre, if you accept that the number parsing is having a rough hair day and turn it off you will get a really smart result - it parses the data as nominals and immediately reveals the quite exceptional repeat frequencies in your dataset - every column of 100 "reals" contains less than 50 distinct values.
So in this case I have to disagree, I am more than happy to accept the output of a properly configured CSV reader on your data. That is not to say that I do not appreciate how meaning can slip between symbol lists, how even a face can be drawn with ASCII, http://dragoljub.com/, or how annoyingly smart RM can sometimes seem http://lh5.ggpht.com/_ZfZ7CWtSXr0/TEgIU0EhDfI/AAAAAAAAAOQ/pEAlPcF17Ak/crushgermanswithyourmind.jpg.
Thanks for confirming and unraveling the issue.
I would think that having a decimal point and only number characters would be consistently assumed to be numerical not nominal. Obviously nominal attributes can have all kinds of characters but when the operator has 'parse numbers' checked I think it is reasonable to expect that real numbers (even if they repeat) should be parsed as real numbers.
On the other hand its great that RM attempts to "think" for us but its the inconsistency that bugs me. Why does read AML assume all the numbers are reals while read CSV does not?
Either way there should be the option to turn off this predictive behavior especially for industry batch use. I would say the majority of people are working with uniform types of data in their tables. Maybe having an option to import numbers as real would be a good option to have. In reality its not really a problem since there is the numerical to real operator that actually highlights the problem of converting integers to real numbers. :-\
Oh and yes ascii art can be quite fun. ;D
we are aware of the inconsistency. And I hope that they all will be solved with the next update. One of our developers had completely revised these operators.
Greetings,
Sebastian
Same behaviour with the Excel reader.
Regards,
Pop
same holds for the excel import...
Greetings,
Sebastian
I'm using RapidMiner 5.0.005 and I have the same problem with importing CSV. I've debugged the code while importing, and I've found the following:
AbstractDataReader.java (line 288): Because of rounding of the double value the two values can be equal. So the column get the Integer type while the data is read correctly (line 292): I think it's a bug or my conception is bad. ???
Regards,
Gergely G.
I'm afraid your RapidMiner version is quite old. That bug has been fixed for quite some time
You may want to consider updating to the latest version, which is 5.1.002.
Regards,
Marco
(but I have an other problem on Ubuntu: if I refresh RM from the Help/Update it can't relaunch the application itself, so it
can't carry out the modifications. But this is an other problem...)
gg