Investigate customer feedback
Hello everyone,
I want to start a new project and need your help.
My knowledge of Rapidminer is rather basic, so I am not an expert.
For my project I have the following assumptions:
I am a provider of a product/service and receive regular customer feedback in text form. Customers report only negative experiences, so sentiment analysis is not required.
In the feedback, the customer reports on one or more issues. Now all customer feedback should be examined for the mentioned problems. Beside already known problems also new (unknown) sample elements should be worked out.
I see the main difficulty in the fact that each customer may describe a problem differently. Of course, I am sure that there will be other challenges in implementation, especially for me.
Do you think this is possible, if so, what is the possible difficulty level and how do I start best?
I appreciate your answer/support.
Answers
Hi @Nicson - I moved this thread to "Getting Started" as it seemed like a more appropriate place for your question.
So to me this is a classic text mining problem - you're trying to cluster customer feedback (natural text) into topics / categories. So there is the traditional way using tokenization, n-grams, and so forth. And then there are the nifty new tools that the ever-resourceful @mschmitz has developed as part of his Operator Toolbox. I would begin by starting to understand text mining in RapidMiner (maybe start here) ; then I'd move on to the tools in Operator Toolbox.
Scott
Thank you @sgenzer for moving this topic.
I looked at the text processing and now I have the following idea:
Input Feedback >[text processing: transform cases > tokenize > filter stopwords > n-grams (bigrams) > filter n-grams]
Now I get a list of n-grams, which are included in the document (customer feedback). I have thought about carrying out this process with all the feedbacks. All collected n-grams should then be checked for similarity and grouped if necessary. I would then like to manually name these groups in my own category. Subsequently, these manually revised files are to be added by machine learning in order to assign the own categories automatically.
Would this be a possible approach or is it going in the wrong direction?
For me, it would be important to create a working basic model first and to optimize it later if necessary.
Thank you
yes @Nicson this is exactly the approach I would take. You are very welcome to post your XML processes here in this thread as you go along and we can help when you get stuck.
Good luck!
Scott
This is what I have come up with after looking at some tutorials. @sgenzer
As data input I use a two-column Excel file, it is structured as follows:
I then send this data through the "Training Model" and save the wordlist and the model.
In the last process I take real (unclassified data) according to the same principle and have them assigned.
What do you think of this solution?
Further questions would be:
I am very grateful for further feedback.
hi @Nicson - there's something very weird with your XML. Can you please re-post?
Scott
I hope it works this time. @sgenzer
hi @Nicson - yes that XML works and it all looks fine. To improve your model (particularly with TF-IDF), you really should do some feature selection. There are several tutorials on how to do this. And of course optimizing your model (e.g. using Optimize Parameters) and trying different models may work well. You could even try using AutoModel on the example set AFTER you've done the Process Documents from Data operator.
Scott
After some time has passed, I would like to report back again. @sgenzer
In the meantime I have read through a lot of things and understood how important a good data model is.
That is why, as has already been mentioned, I have considered or carried out the following. I have processed a variety of test documents and extracted a list of words with the most important terms and n-grams. However, I have noticed that there are a lot of similar terms and have asked myself whether I can group them again in Rapidminer? I'm sure it works, but my attempts have all failed so far.
I just want to check the terms in the list (without reference to the documents) for similarities and cluster them. What is the easiest way to do this?
The next point would be the actual classification. The "basic framework" is already there. (If this works at all with this setup?) I've reworked my concept a little bit. I would like to check new documents in such a way that, after the occurrence of a word/n-grams (either directly coinciding or to a certain degree of similarity) this is indicated in a new column by True or False, for example. The problem is that I only have words/n-grams that stand for a certain category, not against it.
Thank you very much
hi @Nicson glad you're making progress. So for grouping similar terms, I generally use Replace Tokens (Dictionary) and choose one token to represent each grouping. I'm not sure I understand your second question very well. The TF-IDF will give you a value; if you want to convert this to true/false, you can simple create a threshold and convert.
Scott
@sgenzer
I might have said something wrong. However, the method you mentioned is generally not uninteresting for further projects. What I meant was, I extracted a list of frequent terms and n-grams. These are to be classified manually for machine learning. Now I wanted to "cluster" this list first. Terms such as "error" or "error_xy" should be grouped together so that I can manually assign them to a certain label in one step.
Then I would like to check each new document for consistency or similarity of terms from the list.
As an example:
The Error list contains: (Error, Error_code, Error_xy,...)
A new document has the following text:"I'm getting an error."
In this case, the document gets the label "error" because there is a direct match.
You know what I mean?
Dear @Nicson,
for your info - we've just added a new operator called Extract Topics (LDA) to operator toolbox. It is able to automatically detect topics for documents and returns the n-most important word per topic. The difference to scott's clustering approach is, that a document can be assigned to more than one topic.
Tell me if you need this ASAP. We are finishing other operators at the moment so it could take a few days before this hits market place.
Best,
Martin
Dortmund, Germany
Thank you @mschmitz for this information. I am not in a hurry with my project at the moment, so I can wait until then. Maybe @sgenzer has an further idea that I could try in the meantime.
nope. If I were you @Nicson, I would follow @mschmitz's lead.
Scott
I will definitely do that @sgenzer :smileyhappy:
In the meantime I have managed to process my training documents and create valid clusters. In the book "Predictive Aalytics and Data Mining" I became aware of another cluster example and was able to implement it successfully.
I have now saved the output of the cluster operator (k-Medoid) as an Excel file and would like to rename the cluster labels to my specific classes. Afterwards I would like to apply it to the training model and apply it to unknown data.
How do I have to integrate the classified data into the process? Do I have to process them again by text processing or how do I proceed?
Thank you for your efforts.
hi @Nicson can you please share your XML and data set?
Here's the XML code @sgenzer:
Unfortunately, I can't publish the data set, because in this case I used internal data from my university.
hi @Nicson - ok no problem. So I cannot test your process without data but I am attaching your process with some additional notes and operators so you can get the gist of where to go. You do not need to do Process Documents again.
Scott
Thank you @sgenzer
What do you mean by feature selection, I can't really connect to it. (This is probably due to my lack of knowledge)
hi @Nicson so feature selection is not one operator, it's a data science technique where you reduce the dimensionality of your data set in order to improve your model.
https://en.wikipedia.org/wiki/Feature_selection
There are numerous operators in RapidMiner to help you do this:
Scott