The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Read AML vs Read CSV (Data Types Differ)"

dragoljubdragoljub Member Posts: 241 Contributor II
edited June 2019 in Help
Hey Guys,

It looks like the Read CSV operator has been updated to correctly import data however I still see a discrepancy in data type between the import wizard and Read CSV. I would like to avoid having to use the import wizard for each of my many files. Is data type at all important for the subsequent computations (besides any filtering on data type etc.)? I noticed when I normalize the data all reals/integers are converted to "numerical".

Bottom line: Is Read CSV safe to use regardless if it sees some reals are labeled as integers?  :P

-Gagi
Tagged:

Answers

  • haddockhaddock Member Posts: 849 Maven
    Yes
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Unfortunately this is not the case!  :P  :'(

    The problem arises when you have some real data and for whatever reason read CSV assumes the column is integer. On first inspection this ok when looking at the imported file however when you save the example set into the repository and open it up all integer type attributes that had real values are either truncated or turned to zero.

    So in a nutshell Read CSV cannot work if the type is not correctly interpreted for real values.

    Read CSV --> Column with reals but type is set to integer --> Save Example Set --> Open Example Set ==> All reals in integer type colums are '0'.

    So this is a big problem how can I import all my tab delimited data without using read AML over and over and over manually.

    -Gagi
  • dragoljubdragoljub Member Posts: 241 Contributor II
    A Solution!

    It seems you can simply use numerical to real to convert everything to a real number then saving and loading works as expected.

    This is a great find! I cant tell you how shocking it is to look back at your gigs of data and see zeros for most of it!

    -Gagi  ;D
  • haddockhaddock Member Posts: 849 Maven
    So in a nutshell Read CSV cannot work if the type is not correctly interpreted for real values.
    So is that a true statement?
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Here is what I mean:

    Read CSV will occasionally get the data-type of an attribute (data column) wrong but RM sees the data correctly when reading from memory. Saving the example set with the wrong data type will lead you to loose your data. I believe its a simple casting issue. If you cast 0.5 as an INT it will be stored as 0. This appears to be happening here.

    -Gagi
    haddock wrote:

    So is that a true statement?
    Regarding the above statement, I would say it depends on how the data is used. If for any reason the data-type is used for storing intermediate results or casting prior to computations I would ensure that all data types are correct before using the data.

    The easy fix is set all numerical to real numbers that way you are safe, however this will come with some memory overhead.
  • haddockhaddock Member Posts: 849 Maven
    So in a nutshell Read CSV cannot work if the type is not correctly interpreted for real values.
    Hi Gagi,

    I'm sure you'll understand that this is quite a big warning you are issuing, and that the next step is that someone else should reproduce the error. Could you please post some data, and the XML to reproduce the error?

    Many thanks.
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Here is an example process and data table that reproduces the problem. First of all the data is pretty generic I just removed the column names to avoid giving any information away. Also this is a very small subset of the data I work with maybe 1% but for whatever reason all these columns all have this problem.

    System: Windows 7 Ultimate 64bit, RM 5.0.008

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
       <process expanded="true" height="475" width="299">
         <operator activated="true" class="read_csv" compatibility="5.0.8" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="file_name" value="C:\Users\gagi\Desktop\Test.txt"/>
           <parameter key="use_first_row_as_attribute_names" value="false"/>
         </operator>
         <operator activated="true" class="store" compatibility="5.0.8" expanded="true" height="60" name="Store" width="90" x="179" y="30">
           <parameter key="repository_entry" value="//Repository/test"/>
         </operator>
         <operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve" width="90" x="179" y="120">
           <parameter key="repository_entry" value="//Repository/test"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Store" to_port="input"/>
         <connect from_op="Store" from_port="through" to_port="result 1"/>
         <connect from_op="Retrieve" from_port="output" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="72"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    Here is the data set: I tested it both tab and space delimited. In both cases the problem persists. Copy and paste it into a text file and use read csv with the default parameters (or customized it does not make a difference) you will see the data type interpreted as integer rather than real. Then when you save and retrieve the data all your values are 0. ???

    :P If you use numerical to real to force data to be saved as reals the problem is solved. Note the read AML wizard parses these numbers fine so I'm wondering why read csv does not.

    0.454975009 0.455107987 0.455808014 0.455841005 0.459513009 0.459479004 0.458346009 0.457713991 0.432123005
    0.459161013 0.459277004 0.459910989 0.459093988 0.454291999 0.456124991 0.455359012 0.455841988 0.427572995
    0.457906991 0.457857996 0.458921999 0.457855999 0.456030011 0.459479004 0.459811985 0.459547013 0.431939989
    0.463925004 0.463490993 0.463759005 0.463490993 0.452091992 0.459975004 0.459576994 0.460426003 0.430689991
    0.459006011 0.459506989 0.459838003 0.459504992 0.454380989 0.45984599 0.460729003 0.460462987 0.432305992
    0.459894001 0.46074301 0.460094005 0.460559994 0.453191996 0.458692014 0.457192987 0.457125992 0.430689991
    0.458822995 0.458958 0.459288001 0.459138989 0.456580013 0.459479004 0.461095005 0.460462987 0.432123005
    0.460994005 0.461109012 0.461010993 0.461109012 0.454474986 0.459057987 0.458110005 0.458041996 0.431055993
    0.457722992 0.457857996 0.457455993 0.458222985 0.456764013 0.458563 0.459996015 0.459547013 0.430474013
    0.461910009 0.461842 0.461926997 0.461475998 0.455942005 0.461441994 0.458660007 0.459508985 0.432705998
    0.459188998 0.459324002 0.460388005 0.459504992 0.454746991 0.459663004 0.461461991 0.461562991 0.431757003
    0.459345013 0.460009992 0.459178001 0.45964399 0.454291999 0.459057987 0.456276 0.456759006 0.430139005
    0.459739 0.459140986 0.459472001 0.46023801 0.467211008 0.462229013 0.461645007 0.461196989 0.434139013
    0.461360008 0.462574989 0.462292999 0.46074301 0.456674993 0.455758989 0.457376003 0.457309008 0.426288992
    0.457540005 0.458041012 0.457823008 0.458039999 0.457863003 0.46112901 0.460361987 0.460280001 0.433771998
    0.460810989 0.461475998 0.461010993 0.460925996 0.454109013 0.459057987 0.457926005 0.45785901 0.430323005
    0.457356989 0.457307994 0.457823008 0.457489997 0.456764013 0.459479004 0.459262013 0.459363014 0.431573004
    0.461176991 0.461293012 0.461010993 0.460193008 0.453191996 0.456492007 0.456458986 0.457491994 0.427938998
    0.457906991 0.457857996 0.458555996 0.457855999 0.456030011 0.460029006 0.459446013 0.459363014 0.431939989
    0.460260987 0.46074301 0.460460991 0.460559994 0.453925014 0.45759201 0.456458986 0.456759006 0.428489
    0.45845601 0.458958 0.458921999 0.459872007 0.453280985 0.460213006 0.461277992 0.460280001 0.433405995
    0.460078001 0.460193008 0.460094005 0.459827006 0.451725006 0.458324999 0.456642985 0.456209004 0.430323005
    0.456440985 0.456207991 0.456907004 0.456391007 0.456396997 0.458929986 0.458528996 0.45716399 0.431939989
    0.460444003 0.460925996 0.460460991 0.460009992 0.460707992 0.457042009 0.457008988 0.456759006 0.428672999
    0.460810989 0.460925996 0.460644007 0.46074301 0.458142012 0.457774997 0.457192987 0.457491994 0.429589003
    0.457540005 0.457307994 0.458005995 0.458039999 0.458047003 0.461311996 0.459996015 0.460096985 0.433771998
    0.46062699 0.461293012 0.461194009 0.460925996 0.453191996 0.458875 0.457376003 0.457675993 0.430323005
    0.458090007 0.458407998 0.458005995 0.458772004 0.457497001 0.462229013 0.460545003 0.460462987 0.435054988
    0.462642998 0.463856995 0.463759005 0.462024987 0.454842001 0.457224995 0.458292991 0.45785901 0.428672999
    0.459006011 0.458958 0.459654987 0.458772004 0.456946999 0.459295988 0.459628999 0.460096985 0.431023985
    0.460810989 0.460925996 0.461010993 0.46074301 0.453375012 0.46034199 0.457192987 0.458225995 0.43197301
    0.459739 0.46005699 0.460388005 0.459872007 0.457863003 0.462962002 0.462195009 0.461930007 0.435054988
    0.461360008 0.461659014 0.461560011 0.461475998 0.452459008 0.457958996 0.457560003 0.45785901 0.429405987
    0.456806988 0.457125008 0.456907004 0.457307011 0.455114007 0.459113002 0.459996015 0.459547013 0.431389987
    0.461542994 0.461475998 0.461376995 0.462024987 0.456124991 0.459241986 0.458842993 0.459508985 0.431605995
    0.459739 0.46005699 0.460570991 0.459872007 0.457130015 0.460945994 0.462011009 0.461012989 0.432673007
    0.460260987 0.460377008 0.460094005 0.460559994 0.455392003 0.45759201 0.457008988 0.457309008 0.429039001
    0.458640009 0.458407998 0.458739012 0.458406001 0.457863003 0.461311996 0.460179001 0.459729999 0.433405995
    0.461176991 0.461475998 0.461010993 0.460925996 0.452459008 0.457774997 0.457376003 0.45785901 0.429223001
    0.45845601 0.459140986 0.458921999 0.458588988 0.458597004 0.460395992 0.460911989 0.460280001 0.433405995
    0.46062699 0.461109012 0.460644007 0.459827006 0.454291999 0.457774997 0.456642985 0.456941992 0.429589003
    0.45845601 0.458774 0.459288001 0.458772004 0.457314014 0.460395992 0.461277992 0.461012989 0.432123005
    0.460260987 0.460559994 0.460278004 0.460925996 0.453741997 0.458142012 0.457376003 0.457125992 0.429955989
    0.458272994 0.458223999 0.458371997 0.458222985 0.457314014 0.459663004 0.460361987 0.459912986 0.431939989
    0.461176991 0.461109012 0.460826993 0.461475998 0.454842001 0.457958996 0.458292991 0.458775997 0.429223001
    0.460105985 0.459874004 0.460570991 0.460420996 0.456396997 0.460395992 0.462377995 0.461012989 0.432489991
    0.459161013 0.459461004 0.458445013 0.458727986 0.453009009 0.457042009 0.455175012 0.455659002 0.428122997
    0.456806988 0.456575006 0.45708999 0.456939995 0.453464001 0.462594986 0.458896011 0.458629996 0.435238004
    0.461726993 0.462392002 0.462292999 0.461293012 0.454842001 0.457042009 0.457560003 0.457491994 0.429223001
    0.455707997 0.455657989 0.456539989 0.456757009 0.463178992 0.458563 0.458346009 0.458447009 0.430474013
    0.458090007 0.458958 0.458921999 0.457673013 0.461163014 0.459479004 0.460179001 0.459547013 0.431939989
    0.460994005 0.461659014 0.461376995 0.460377008 0.456308991 0.456492007 0.457008988 0.457309008 0.428306013
    0.459739 0.459324002 0.458921999 0.460054994 0.458413005 0.462045014 0.462011009 0.461380005 0.434139013
    0.461176991 0.460925996 0.460644007 0.460925996 0.455758989 0.459607989 0.457376003 0.457675993 0.431239992
    0.459921986 0.46024099 0.460388005 0.460420996 0.456764013 0.461861998 0.462195009 0.461380005 0.433405995
    0.461542994 0.462024987 0.461926997 0.461475998 0.456674993 0.459791988 0.458476007 0.458409011 0.431789994
    0.459188998 0.459506989 0.459838003 0.459322006 0.456580013 0.460945994 0.461645007 0.461196989 0.43303901
    0.463375986 0.464040995 0.463575989 0.463856995 0.455208987 0.459057987 0.460126996 0.460610002 0.430873007
    0.459006011 0.459506989 0.459105015 0.459688991 0.458779991 0.460761994 0.461461991 0.461930007 0.433223009
    0.461726993 0.461842 0.461560011 0.462208986 0.455942005 0.460525006 0.458660007 0.459508985 0.432889998
    0.459739 0.46005699 0.460020989 0.460788012 0.457863003 0.461311996 0.462011009 0.461380005 0.434322
    0.462092996 0.462392002 0.462475985 0.462392002 0.455024987 0.459607989 0.458476007 0.458593011 0.430873007
    0.458822995 0.458591014 0.459472001 0.459322006 0.456946999 0.462412 0.460729003 0.460646987 0.434139013
    0.462826014 0.46312499 0.462110013 0.462940991 0.456492007 0.460707992 0.459392995 0.459876001 0.432155997
    0.457356989 0.458041012 0.458555996 0.458039999 0.457679987 0.45984599 0.459262013 0.458997011 0.432855994
    0.46062699 0.461293012 0.461010993 0.460377008 0.455574989 0.457042009 0.456642985 0.457309008 0.428489
    0.459006011 0.459690988 0.459288001 0.459872007 0.458779991 0.46112901 0.461827993 0.461380005 0.432489991
    0.460078001 0.460009992 0.459910989 0.459461004 0.454474986 0.45759201 0.456092 0.456209004 0.429039001
    0.46102199 0.460606992 0.460936993 0.460970998 0.456764013 0.462412 0.463660985 0.462296009 0.434139013
    0.461726993 0.461842 0.461010993 0.461293012 0.453375012 0.459975004 0.457560003 0.458041996 0.431605995
    0.457906991 0.457673997 0.458005995 0.458406001 0.458413005 0.462045014 0.460911989 0.460096985 0.434139013
    0.461176991 0.461475998 0.461010993 0.460377008 0.454109013 0.458142012 0.457742989 0.457125992 0.429589003
    0.459739 0.459874004 0.460020989 0.459138989 0.455664009 0.460213006 0.460911989 0.461012989 0.432123005
    0.460444003 0.459277004 0.460094005 0.459277004 0.452459008 0.457774997 0.456458986 0.456941992 0.429589003
    0.460839003 0.46134001 0.461120009 0.460788012 0.454198003 0.462962002 0.463111013 0.462296009 0.43560499
    0.459345013 0.45964399 0.458261997 0.459461004 0.450259 0.45759201 0.455541998 0.455841988 0.428855985
    0.45845601 0.458223999 0.458739012 0.458588988 0.458779991 0.461495012 0.460545003 0.460096985 0.433955997
    0.461542994 0.462208986 0.461926997 0.461659014 0.461441994 0.455208987 0.457376003 0.45639199 0.42573899
    0.459345013 0.460193008 0.460094005 0.459093988 0.454291999 0.457224995 0.456092 0.456941992 0.428489
    0.45845601 0.459140986 0.459288001 0.459138989 0.454930991 0.460761994 0.460729003 0.460096985 0.433589011
    0.461360008 0.461293012 0.460826993 0.460377008 0.455942005 0.457224995 0.457192987 0.457125992 0.429223001
    0.457540005 0.458223999 0.458189011 0.458039999 0.456946999 0.458012998 0.459446013 0.458813995 0.430656999
    0.458061993 0.458178014 0.459178001 0.457628995 0.457774997 0.455574989 0.454075009 0.454742014 0.427938998
    0.459921986 0.460424006 0.460388005 0.460054994 0.460429996 0.463328004 0.462195009 0.462112993 0.435238004
    0.460810989 0.461475998 0.461560011 0.460009992 0.457042009 0.457224995 0.456276 0.45657599 0.428122997
    0.459739 0.46005699 0.460388005 0.460604995 0.458413005 0.462229013 0.462927997 0.46303001 0.433405995
    0.46062699 0.461659014 0.461194009 0.461109012 0.455758989 0.458508998 0.458110005 0.458225995 0.430689991
    0.459188998 0.458774 0.459472001 0.458588988 0.458779991 0.460761994 0.460179001 0.459912986 0.433223009
    0.462642998 0.462758005 0.462660015 0.46312499 0.456124991 0.46015799 0.459576994 0.459508985 0.431055993
    0.45845601 0.458223999 0.458739012 0.459504992 0.459513009 0.462778002 0.461827993 0.461930007 0.435422003
    0.463741988 0.464040995 0.464309007 0.463856995 0.456308991 0.459607989 0.460126996 0.460610002 0.430689991
    0.456990987 0.457673997 0.458739012 0.457307011 0.456214011 0.459479004 0.459262013 0.458813995 0.432123005
    0.461542994 0.462392002 0.462110013 0.461475998 0.457224995 0.458875 0.457742989 0.45785901 0.430689991
    0.459372997 0.459140986 0.460204005 0.459504992 0.457679987 0.460029006 0.460911989 0.460280001 0.432489991
    0.461360008 0.460925996 0.460644007 0.461475998 0.455758989 0.458692014 0.457008988 0.458041996 0.430689991
    0.459739 0.46024099 0.460388005 0.460604995 0.456764013 0.462594986 0.461645007 0.461196989 0.434872001
    0.461542994 0.461659014 0.461194009 0.461842 0.455208987 0.458142012 0.457742989 0.458225995 0.430139005
    0.458272994 0.458407998 0.459105015 0.458772004 0.456030011 0.460395992 0.460729003 0.459729999 0.432489991
    0.463559002 0.464406997 0.463759005 0.464590013 0.455208987 0.46162501 0.461044014 0.461710006 0.432155997
    0.461360008 0.461109012 0.460826993 0.462024987 0.453741997 0.457958996 0.458110005 0.458775997 0.429773003
  • haddockhaddock Member Posts: 849 Maven
    Bonjour Gagi!

    Thanks for the data and process, from which I can confirm that the CSV reader, as configured, screws up on your data, reporting reals as integers.

    Par contre, if you accept that the number parsing is having a rough hair day and turn it off you will get a really smart result - it parses the data as nominals and immediately reveals the quite exceptional repeat frequencies in your dataset - every column of 100 "reals" contains less than 50 distinct values.

    So in this case I have to disagree, I am more than happy to accept the output of a properly configured CSV reader on your data. That is not to say that I do not appreciate how meaning can slip between symbol lists, how even a face can be drawn with ASCII, http://dragoljub.com/, or how annoyingly smart RM can sometimes seem http://lh5.ggpht.com/_ZfZ7CWtSXr0/TEgIU0EhDfI/AAAAAAAAAOQ/pEAlPcF17Ak/crushgermanswithyourmind.jpg.







  • dragoljubdragoljub Member Posts: 241 Contributor II
    Haddock,

    Thanks for confirming and unraveling the issue.

    I would think that having a decimal point and only number characters would be consistently assumed to be numerical not nominal. Obviously nominal attributes can have all kinds of characters but when the operator has 'parse numbers' checked I think it is reasonable to expect that real numbers (even if they repeat) should be parsed as real numbers.

    On the other hand its great that RM attempts to "think" for us but its the inconsistency that bugs me. Why does read AML assume all the numbers are reals while read CSV does not?

    Either way there should be the option to turn off this predictive behavior especially for industry batch use. I would say the majority of people are working with uniform types of data in their tables. Maybe having an option to import numbers as real would be a good option to have. In reality its not really a problem since there is the numerical to real operator that actually highlights the problem of converting integers to real numbers.  :-\

    Oh and yes ascii art can be quite fun.  ;D
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we are aware of the inconsistency. And I hope that they all will be solved with the next update. One of our developers had completely revised these operators.

    Greetings,
      Sebastian
  • poppop Member Posts: 21 Maven
    Hi,

    Same behaviour with the Excel reader.

    Regards,

    Pop
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    same holds for the excel import... :)

    Greetings,
      Sebastian
  • gulyasggulyasg Member Posts: 4 Contributor I
    Hi,

    I'm using RapidMiner 5.0.005 and I have the same problem with importing CSV. I've debugged the code while importing, and I've found the following:

    AbstractDataReader.java (line 288):

    if (!Tools.isEqual(Math.round(number.doubleValue()), number.intValue())) {
    canParseInteger = false;
    }
    Because of rounding of the double value the two values can be equal. So the column get the Integer type while the data is read correctly (line 292):
    values = number;
    I think it's a bug or my conception is bad.  ???

    Regards,
    Gergely G.

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee-RapidMiner, Member, University Professor Posts: 1,996 RM Engineering
    Hello gulyasg,

    I'm afraid your RapidMiner version is quite old. That bug has been fixed for quite some time ;)
    You may want to consider updating to the latest version, which is 5.1.002.

    Regards,
    Marco
  • gulyasggulyasg Member Posts: 4 Contributor I
    Thanks, it works for me now!

    (but I have an other problem on Ubuntu: if I refresh RM from the Help/Update it can't relaunch the application itself, so it
    can't carry out the modifications. But this is an other problem...)

    gg
    Marco Boeck wrote:

    Hello gulyasg,

    I'm afraid your RapidMiner version is quite old. That bug has been fixed for quite some time ;)
    You may want to consider updating to the latest version, which is 5.1.002.

    Regards,
    Marco
Sign In or Register to comment.