"Text processing operators on example set"
Hello Everyone,
I have several csv files, that looks the same: they have 2 attributes; a word list (extracted from a document), and their occurrences. First, I have to filter them. For that, I made a Stopword Dictionary. Then, I have to make one huge matrix out of them, where there are the remaining words in the header, and every document represents a line.
The "Process Documents from Files" operator works almost perfectly, BUT the occurrences lost. This operator wants to count its own occurrence, so it is going to be 1 or 0, if the given word is presented in a document or nor not. How can I use the previously counted numbers?
I also tried it with "Read CSV", "Nominal to text" and "Process Documents from Data" operators, but in this way, I can't even filter the words.
I'll also need the name of the files in the final matrix at the beginning of the lines. I already found out how to use an existing macro, but I do not know how to make one. I would like to make a file_name macro, but I don't know how to do that.
I am a newbie, so if you know the answer for one of the questions, please detail it as much as possible, because what is obvious to you, it may not be for me.
Thank you in advance!
Laura
Answers
hello @laurahajnalka welcome to the community. Could you please post your XML and your data set so we can better understand what you're trying to do? You can find instructions on how to do this here.
Happy RapidMining!
Scott
Dear @sgenzer,
I haven't done it before, because I haven't got so much to show, but now I attached a sample of my dataset (there are 5000-6000 rows in one csv), and a sample of the matrix I got.
And here is the xml:
Hi @laurahajnalka,
Your XML process is broken : It can't be loaded in RapidMiner...
Anyway, if you already have the extracted words and their occurence, from my point view, a solution is to use Loop Files operator (instead of the Process Documents from XXXX operators) associated with the building block "Append with Union".
Here the results (with 3 fictive files) :
NB : the "?" caracter means [occurence = 0].
The process :
I hope it helps,
Regards,
Lionel