The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Working with and editing wordlist in RM
Hello people,
I have wordlist generated and stored after text processing. Wordlist contains N-grams as well as single words. I'm using this wordlist as WOR input in my next text processing operator, but I only need to keep N-Grams (contain _). There is Wordlist to Data operator that I can use to filter it, but there is no reverse Data to Wordlist Operator. Any other ways for me to filter the worldist?
0
Answers
Hi @svtorykh,
i am afraid there is none. Isn't it okay in your case to filter the Attributes in the end?
BR,
Martin
Dortmund, Germany
Hi Martin,
Ultimately what I'm trying to sovle is how I can customize wordlist on the outside and use that as WOR input for Process Documents Operator. I think it's pretty important as it helps tremendously with filtering of the proper content while processing documents.
As @mschmitz says, there is not really a way to edit wordlists today, although I agree it would be helpful so please go ahead and submit it as a product idea!
In the meantime, as long as the generated wordlist contains a superset of the words you actually need, there is no real functional problem in RapidMiner. It will simply generate attributes for words that you don't care about, which can be ignored or filtered out later once they are attributes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@svtorykh intersting application. I'm curious, how do you plan on using the customized wordlist? Could you use the Filter Tokens (by Region) opeator to automatically filter for '_' and then do 1 before an after?
Yeah, I've figured out how to filter _ using filter tokent by content. The wordlist must contain both N-grams and single words though. It's easy for me to decide which of the single words should be included in the wordlist, and I can then merge N-grams output with single words on the outside.
Can you guys elaborate on post process attribute filtering? Both wordlist and final attributes list will contain thousands of attributes, so not sure how complicated it can be to filter thousands from thousands in the post process?
On the other note! One of the benefits to be able to import customized wordlist is the ability to actually generate N-grams better.
E.g. I' m looking into business skills and have repository of skills with many of them being 3 words or more. In this case for "Business Process Optimization" using N-grams of 3, results will contain business, process, optimization, business_process, process_optimization, business_process_optimization. While if I could just replace spaces in Excel and have business_process_optimization as wordlist input, I won't see the noise of all other n-grams generated. Makes sense? Consider thousands of possible skills combination and scalability of filtering attributes becomes a problem.