The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
CSV to N-Gram Process
I have a large CSV file that I'm trying to process in order to generate n-grams for a selected attribute. The process is currently as follows:
Read CSV => Select Attribute => Nominal to Text => Data to Documents => Process Documents => Mutual Information
=> Wordlist to Data
Now, within the Process Documents operator I have:
Tokenize => Transform Cases (to lower case) => Filter Tokens => Stem => Filter Stopwords => Generate N-grams
I can run the process without error, but for some reason it seems to generate results after "Process Documents", but when I go to view results none are shown. The run time is also very quick, I would expect it to take a few minutes at least for the size of the file.
Can anyone shine some light on possible breaks in this process or modifications so that I can get results?
Thanks.
Read CSV => Select Attribute => Nominal to Text => Data to Documents => Process Documents => Mutual Information
=> Wordlist to Data
Now, within the Process Documents operator I have:
Tokenize => Transform Cases (to lower case) => Filter Tokens => Stem => Filter Stopwords => Generate N-grams
I can run the process without error, but for some reason it seems to generate results after "Process Documents", but when I go to view results none are shown. The run time is also very quick, I would expect it to take a few minutes at least for the size of the file.
Can anyone shine some light on possible breaks in this process or modifications so that I can get results?
Thanks.
Tagged:
0
Answers
e.g. splitting it into 2 or more processes.
e.g.
Process 1
Read CSV => Select Attribute => Nominal to Text => Store Result
Process2
Load Result => Data to Documents => Process Documents => Mutual Information=> Wordlist to Data
Could this help? (It has on some of mine reading large datasets.)
Best,
JEdward
Thanks.
Is there a way that I can have the processes run one after another? I'm new to RM so pardon the perhaps obvious questions, but when I store the data that's the end of one process and when I retrieve that's the beginning of another, however, is there a way to link these so that the whole thing runs in one fluid motion?
you can easily chain processes by using the "Execute Process" operators in some sort of super-process. You can also drag processes from the repository view into the process view, which will create these operators readily set for the chosen processes.
For memory settings please have a look at the installation guide, this should cover the relevant steps: http://rapid-i.com/content/view/17/211/lang,en/
Regards
Matthias
I was about to post the same answer, but Matthias beat me to it.
He clearly gets up much earlier than I.
Regards,
JEdward
sorry - I did not mean to capture the topic :-X
Unless you slept until noon , I doubt that I'm getting up much earlier. It all depends on the time zone...
Regards
Matthias
Your help is greatly appreciated.
I wasn't able to figure out how to do this though. It'd be great help if someone can provide further info on this or suggest another method.
this sounds like you are creating token n-grams, no character n-grams, right? I didn't know this is possible until now, but I just found both of the operators
It took me 2 minutes to extend the operator and now provide a min-length parameter. If you are able to build the extension from source yourself, I will provide you the modified source code. Otherwise I might send you the jar file ready for inclusion into your RapidMiner (E-Mail?).
If you want to use the regex approach instead (but this won't reduce processing time, which my modification should do), try "Filter Tokens (by Content)", set the condition parameter to matches and use something like to keep only n-grams of at least 3 words (since two underscores are required).
Regards
Matthias
Yes, I'm creating token n-grams, in order to see patterns in text.
I'm not familiar with how to extend an operator, but if you can provide the source code I can give it a shot. My email is d.saraph@gmail.com, where the jar file can be sent as well What would the reasoning be behind your modification being able to reduce processing time as opposed to the token filter operator?
For now, I will try the filter token operator, but I would be happy to improve the efficiency of this process if possible.
Thank you for your help.
I just sent the mail. Hope this will be delivered since the attachment size is above 15 MB.
Creating all n-grams and using additional computations to remove them afterwards will probably consume more time, then just creating the desired ones. This is the reasoning
If you are also interested in the code, this is the slightly modified part of TermNGramGeneratorOperator And this adds the additional parameter: Hope this will help you somehow...
Best regards
Matthias
Thanks for this, as well as emailing me the .jar file. I'm going to try implementing this shortly and will post back on here regarding my progress.
Just wanted to report back that I was able to run the n-grams quite well, but in the end the results were not exactly what I was looking for so I'm going to be tinkering with the data for the next little bit. Thanks for all your help on this.
On another topic, I wanted to inquire if anyone was familiar with word clustering. For example, is there a way that I can cluster the text without considering the order (n-grams are formed based on the order of the words)... I was looking into some of the clustering operators but I'm not sure what would be applicable to what I'm trying to do. I was hoping there would be an operator that could just replace the n-gram operator in order to carry this out since I still wanted the pre-processing of the data, stemming, and filtering as I currently have. Any suggestions are greatly appreciated.
Thanks.
I wanted to extract number "2ADFH0B121AO92" from comments.
I have used read excel->nominal to text->process documents->n-gram ->14
But its not working. Can you suggest what can be done pls...
any response on this? I have a similar use case where order of n-grams dont matter and i want to group "word1-word2-word3" same as "word2-word1-word3"
is that possible?
The link posted above doesnt work
Can't you use the Extract Information operator for this?