"Text Mining: How do I assign/create a macro to reference a group of attributes collectively?"
Hi,
Before I begin, I sincerely apologize that I cannot share my process here due to confidentiality issues with my school. This is a service learning project that I am doing for the co-op placements department of my business school. But I will try my best to describe it.
I have a process that tries to classify a job based on lexicons. I have one dictionary per category of jobs ~ 30 dictionaries for ~30 categories. Also I have one massive custom stopwords dictionary. Each job posting is run against these dictionaries and we try to see how many words from each of the dictionaries are contained in each individual job posting. The idea is that whichever category of dictionary gets the highest word count for a given job posting, that is the predicted category of that job. The concept by itself is simple, except in order to automate the whole thing and run it on scale, I'm using file and repository loops, macros, branches, subprocesses, etc.
The process works fine except the results are very clutterred. For every job posting, word counts for all 30 dictionaries are being returned. I'd like to limit it to just the highest one or the top 3. I know that can use the Max function in Generate Attributes to select the one with the highest count but that would mean the dictionary names will be hard-coded into the process. I'd like it to be able to handle new dictionaries on it's own in the future without me having to go in to the parameter settings and modifying things. Also if I used attribute names in Max(), the function will be very long ex: Max(dictionary_1, dictionary_2, dictonary_3, ...., dictionary_30). Is there a way to use a macro instead to refer to these dictionary attributes such that I can write a simple function - Max(%{dictionary}) and have it select the highest count?
I've attached a sample csv with breakpoint results for one row/document/job posting. As you can see, it has wordcounts for several dictionaries however I'm only interested in the largest one. And I need to do this for over 5k job postings. I want to have an attribute(s) that picks the top or the top three categories for each document/row using macros and generate attributes.
Thank you very much and your help is greatly appreciated.
Best Answer
-
batstache611 Member Posts: 45 Maven
@mschmitz, nevermind I got it. Using the aggregate operator at the end does the trick. I needed to aggregate the Count using the maximum function and PredictedCat with mode. And then group it by the company IDs. Thank you very much for pointing me in the proper direction though!
Best regards.
2
Answers
Hi,
what you can do is use Generate Aggregation with a regex. This gives you the option to take max of n attributes.
Best,
Martin
Dortmund, Germany
Thank you very much @mschmitz. It works but I'm only halfway there. I want the attribute name to be the cateogry that was picked. So in the parameter settings of Generate Aggregation, the place where it asks me for the Attribute Name, I want to insert some kind of macro in there that will return the name of the dictionary with the highest count.
To summarise, I want to pick the dictionary with the highest tag count and return the name and count number for that dictionary. Hope I was able to explain myself clearly. Thank you for your solution.
Update: I am using a macro that grabs the file name of the dictionary used. However if I use this macro for Generate Aggregation's attribute name, it returns the name of the last dictionary used which is always the same.
Hey,
ok, so yo do not just need the max, but also the name of the max. Is transposing the table and sorting + Filter Example Range an option?
~Martin
Dortmund, Germany
Hi @mschmitz I'm sorry it took me a while to reply as I went on a tangent from this porcess for a short while. Yes, transposing and sorting works. However I'm not exactly sure how Filter Example Range would help me. I've attached a csv of the current process output. Cat ID is all the job categories that we have, HighestCount is the word count for category with the maximum amount of hits for a given job posting. PredictedCat is the name of that category with max hits. JobCompanyID is the id of the company that posted this job.
As you can see, for each job posting company, the HighestCount number can literally vary from anywhere to anywhere. But I'd only like to keep the row with the greatest number. So in the example of the demo file there are 3 companies that have job postings, I'd like RapidMiner to return only 3 rows with companyID, count, and predicted cat. Hope I was able to explain myself. Thank you very much.