The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Matching Text with ngramms"
Hello,
i am still kinda new to rapidminer. From what i saw sofar, this is clearly a powerful result of massiv brainpower!
I have 2 questions, and hope some of you can help me:
the situation:
I have a master list of product descriptions (big) and have to find similar entries in other lists (small-medium size). It is a 1:n matching task.
I am using basic operators to get rid of unwanted text (stemming, stop words, html,..) and can generate ngramms. I do this twice, once for the master list, and once for a specific description and combine the results. Followed by sorting the results.
first the practical question:
Each description has to be matched on all entries of the master list (there is potential for optimization, however the dataset is too small to do this in the first step). Is there an operator to avoid redundant ngramm generation? What would be the best way to match say 5 lists of descriptions on the master list without redundant task?
and second, the theoretical question:
Can you think of a setup where i consider all lists equal? Basically a cloud of descriptions, where i aggregate the most similar ones?
If you could spare some time to assist me, i would be very grateful.
Thank You
i am still kinda new to rapidminer. From what i saw sofar, this is clearly a powerful result of massiv brainpower!
I have 2 questions, and hope some of you can help me:
the situation:
I have a master list of product descriptions (big) and have to find similar entries in other lists (small-medium size). It is a 1:n matching task.
I am using basic operators to get rid of unwanted text (stemming, stop words, html,..) and can generate ngramms. I do this twice, once for the master list, and once for a specific description and combine the results. Followed by sorting the results.
first the practical question:
Each description has to be matched on all entries of the master list (there is potential for optimization, however the dataset is too small to do this in the first step). Is there an operator to avoid redundant ngramm generation? What would be the best way to match say 5 lists of descriptions on the master list without redundant task?
and second, the theoretical question:
Can you think of a setup where i consider all lists equal? Basically a cloud of descriptions, where i aggregate the most similar ones?
If you could spare some time to assist me, i would be very grateful.
Thank You
Tagged:
0
Answers
Cheers,
Ingo
Thank you for your quick repley:) Let me add some more Details:
On one side you have a complete list of products you are interested in, the list includes detailed description and information about these products.
On the other side you have all kinds of lists (in my case 5) with incomplete information, missing IDs, cut off descriptions or slightly different wording.
The task is to match the 5 lists on the first complete list. Not all products from the 5 lists must have a match in the master list, and you can assume that max. one entry from the master list matches.
Example:
Master list:
Apple iPhone 4 32GB black
Match:
iphone 4 32GB b.
iphone 32GB black
apple iphone 32GB black
etc.
What i can do so far is:
Selecting one product from the 5 lists, generate ngramms of the selection and the master list then match it.
But how this is done for all entries...? Just looping looks kinda sequential...
well, looping would indeed be an option. Transforming first all matching list data into one vectorized example set which is matched via similarity again the master list would be the other one. If you have detailed questions about how this can be achieved with RapidMiner, I would suggest to post in the board "Data Mining / ETL / BI Processes" the processes you already have together with some detailed questions. It's more likely that anybody there can help you with those details.
Cheers,
Ingo