The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting wrapped cells from a CSV file
Thomas_Ott
RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Ok, here's one that's stumping me. I'm attaching a CSV file that has some Instragram comments. The file is generated from a python script that I'll eventually embed into RM but right now I'm just trying to get this to work. My goal is to use these comments for some text processing later on.
My problem is correctly parsing the entire comment from each cell. For some strange reason, I can't seem to do it correctly. The text seems to be wrapped in the cell. Any ideas on how to do this?
Tagged:
1
Answers
Hi @Thomas_Ott,
Here, with the Nominal to Text operator between the Read CSV and Process Document from Data /Tokenize operators,
a TF_IDF dataset is well generated :
"...The text seems to be wrapped in the cell..."
What problem do you encounter precisely ?
Regards,
Lionel
Hi again @Thomas_Ott,
Indeed, RapidMiner has difficulties to read correctly your file.
So, once again, I propose a Python script as a possible solution (I used the read_csv function and filtered
the "carriage returns" which cause some problems in RapidMiner) :
The process :
To execute this process, don't forget to set the path (where your file is stored) in the Python script.
I hope it helps,
Regards,
Lionel
@lionelderkrikor I see, you used iloc[0] in a dataframe. So RM loads in one row at a time, hence you can use iloc[0] instead of looping. Correct?
I wonder if the trick is to write the data out correctly in the first place and avoid all this mess. Do you happen to know of a 'remove carriage return option' when writing out the dataframe?
Before you ask, why not skip exporting the CSV and just pass it through the RM process, I get that but the requirement is to spit out the CSV.
Thanks BTW!
@lionelderkrikor ok, I solved it. The data was incredibly messy. I just striped all white spaces using \s and then just replaced with a whitespace. Kinda stupid but it works.
Now I get everything I want in output file AND I can productionalize things.
Thanks for your help.
Hi @Thomas_Ott,
good that you solved it. I've had similar problems when reading CSV generated with R or Python before.
As a general note, I have found that saving your external tables as XML can be advantageous, as the Read XML operator is way more forgiving than the Read CSV one. Then you can retain special characters.
Regards,
Sebastian
@SGolbert I haven't thought of doing that. A handy tip indeed.
another trick I use in situations like these is to convert the text to URL-encoded text which makes it MUCH easier to strip out unwanted characters, then convert back.
Scott