The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Read CSV - How to use quotes as escape character for quotes
I'm reading a series of CSV data files, comma-separated and using quotes. Within a data line, if quotes are used in a field, it indicates that using double-quotes. i.e., It essentially uses quotes as the escape character for quotes. An example line could be:
"News Alert","Mon, 13 May 2019 08:29:58","""NEWS OFFICE"" <newsoffice@spamdude.com>"
which it SHOULD interpret as 3 fields as follows:
(1) News Alert (2) Mon, 13 May 2019 08:29:58 (3) "NEWS OFFICE" <newsoffice@spamdude.com>
I'm using the Read CSV operator, with "use quotes" checked and using quotes as both the quotes character and escape character. The result is that it not only doesn't read the line correctly, it completely skips reading any line that has the double-quotes in it. My operator XML is as follows:
Is there a way to do this so it reads and interprets my example line properly, or do I have to preprocess all my data files with a Python script or something similar to replace the double-quotes with some other escape character (like the default backslash), before ingesting to RapidMiner? Thanks for the help!
"News Alert","Mon, 13 May 2019 08:29:58","""NEWS OFFICE"" <newsoffice@spamdude.com>"
which it SHOULD interpret as 3 fields as follows:
(1) News Alert (2) Mon, 13 May 2019 08:29:58 (3) "NEWS OFFICE" <newsoffice@spamdude.com>
I'm using the Read CSV operator, with "use quotes" checked and using quotes as both the quotes character and escape character. The result is that it not only doesn't read the line correctly, it completely skips reading any line that has the double-quotes in it. My operator XML is as follows:
<operator activated="true" class="read_csv" compatibility="9.0.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
<parameter key="column_separators" value=","/>
<parameter key="escape_character" value="""/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
Tagged:
0
Best Answers
-
SGolbert RapidMiner Certified Analyst, Member Posts: 344 UnicornHi @LearnAW ,I have reproduced your problem by creating a csv file. It's better when you share your entire process and a test file.Where does the csv come from? Is there a possible to change the escape character in the csv generator? I have to say the problem lies in the operator, I can read the file just fine with Pandas.So a solution would be to use the read_csv function with Pandas in a Execute Python operator. You could also read without the "use quotes" option and fix the result in RapidMiner, but I don't recommend it.Kind regards,Sebastian6
-
Marco_Boeck Administrator, Moderator, Employee-RapidMiner, Member, University Professor Posts: 1,996 RM EngineeringHi,
that's why I hate CSV, you get the most weird contents and >9000 special cases which would need manual handling..
In your case, you defined " as escape character and " as quote character. That cannot work, because what escape character simply means is: the next characters (whatever it is!) is read as is into the result, it will have no semantic meaning. So if you have the following line and would like to split on commas (so 2 columns):"Hello",123
this will create only a single column with the contentHello,123
as output, because you escaped the comma. This is necessary because escaping must always work, otherwise the point of escaping becomes mood.
Obviously it gets even worse when you now have lines like"Hello World",""Hi World"",123 or "Hello World","""Hi World""",123 or "Hello World","""Hi"" World",123
This is not a well defined syntax anymore, because now you have different meanings for the occurance of escape/quote characters - but the characters are identical and thus are treated the same by the code.
Python probably solves this by having a huge segment of "if current character x and next character x/y and next character x/y/z then do something else than you normally would" statements.. We instead opted for a clear finite state machine which more or less assumes that the CSV file has a consistent syntax and thus goes through the line parsing methodically.
Just some insight into why this file will not work in Studio.
Regards,
Marco
7 -
LearnAW Member Posts: 4 Learner IThanks Sebastian and Marco, I appreciate your perspectives. And yes, I think I was assuming (or hoping for) a level of sophistication in parsing that I don't think Read CSV provides. (i.e. A precedence would have to exist, to first interpret a quote as the start of a string, and then any subsequent double quotes as an escaped quote.) If it's treating the escape character the same across the entire input, then I see Marco's point how this can't work. It looks like I have some preprocessing to do. If anyone else knows a way to avoid this and read it correctly with just RapidMiner, please illuminate for me! Thanks to all.0