The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Arabic Light Stemming a CSV file
NoorKhalifa
Member Posts: 7 Learner I
I have a CSV file with around 4000 rows of text. I want to use the Arabic Light Stemmer to stem each record.
I have done the following but the text is not being stemmed. The output is the same as the input.
and inside the Process
I have done the following but the text is not being stemmed. The output is the same as the input.
and inside the Process
Tagged:
0
Answers
To stem words, first you need words. Use Tokenize before Stem to split the text into words.
Regards,
Balázs
I did the following
inside the Process, but the output is still exactly the same as the input.
Is there a problem with reading Arabic text?
I specified the Encoding method to be UTF-8 when i imported the CSV file. Is there anything else I should do?
put a breakpoint after on Tokenize and play with the settings. If you see the words in different colors, the tokenization is working correctly.
I have no idea about the conventions with Arabic text, maybe a different word separator is necessary etc.
If the text looks normal to you in RapidMiner, then the encoding is correct. You would see that it is broken with a wrong encoding.
Regards,
Balázs
I am facing this issue, what could be a possible reason?
you need to use Nominal to Text before Process Documents in order to mark your nominal attributes as text (suitable for the Text Processing operators).
Regards,
Balázs
When I put a break point after Stem, I can see the correctly stemmed sentence. But the final output in the Results is like the following. What can I do to fix this? I want the output to be rows of the stemmed sentence.
Use the "keep text" option that all "Process Documents" operators have.
The default operation mode of Process Documents is to create the wide table suitable for machine learning methods.
Tokenization can split your text into letters, words or sentences. Stemming works on words, at least in Western languages.
Regards,
Balázs
Great, that solved it. But now, when i use Write CSV, I don't get Arabic text in the output CSV file.
I set the encoding to UTF-8 for Read CSV, Write CSV, and the process when pressing on the white canvas.
What can I do to solve that?
Try using a software in which you can set the import encoding. Excel is not very smart when just opening a CSV file. Something with Import should also work in Excel, where you get a dialog for selecting the encoding.
The encoding of text files is not obvious to most software. It often needs to be specified manually. You can use an advanced editor (GVim, Notepad++ etc.) to determine if the file itself is really in UTF-8.
Regards,
Balázs
Excel seem to have moved the CSV Import to Data → From Text/CSV
Greetings,
Jonas
Hello!
After clicking from Text/CSV, what should I do?
For me the first dialog was the "Import Data" file selector, the second one the csv table preview from my screenshot.
I fear the excel autodetection completely failed for your file, is there anything in the "Open As" menu that says csv or utf-8?
Greetings,
Jonas
I didn't manage to do that in Excel, but importing the file in Notepad gave me the Arabic equivalent.
Thanks!
you can force csv parsing here.
But you will stay in the more cumbersome Power Query Editor flow afterwards.
Greetings,
Jonas