Client Ideas/ Suggestion Sorting to Merge

kjkellish24 · April 2018

I was hoping to get a little guidance to build a process that will help me organize data for a work that I have been losing sleep over... Here is the problem and what I am trying to accomplish: I work for a construction technology company that uses Uservoice as a means to gather client suggestions and ideas that are then voted on by other clients of ours. Unfortunately, many tickets are being created that are different user posts but same content as existing tickets; essentially the same idea but worded a little differently. Due to this, we have multiple like-tickets with client users voting on both. So my responsibility, under a time crunch, is to sort the tickets to be grouped together by tickets asking similar questions. Then I can convert to excel and have all my similar tickets together to easily copy and paste into Uservoice to find then bulk merge into one ticket in our account... This way all votes are combined and we get a true picture of what our client needs are so Resources for developing can then be allocated appropriately.

Please find the attached sample of what I am trying to accomplish, for those visual folks out there. You will notice that the first tab is how I receive them and the second tab is how, in a perfect world, would be sorted by the rapid miner process. Text color is only to show the various different client request with votes on it from other users... Currently, I have about 3000 tickets that need to be sorted by the end of Q2. This has caused a lot of stress and lost sleep. You have no idea how helpful this would be!!

Please let me know if you can help! Any process suggestions or extension recommendations to accomplish this would be really really appreciated! Look forward to hearing back.

Thanks!

Telcontar120 · April 2018

In my view, this is a very domain-knowledge specific task, so it will probably require some iterative interaction on your part with different approaches to find the best solution.

I would also recommend checking out the "Extract Topics from Documents" operator, which might help with this project.

Another approach that you might consider is some form of agglomerative clustering. This is often helpful when the precise number of desired categories is not known in advance, because it starts at the most granular level (one record per cluster) and keeps aggregating based on similarities as you get larger and larger groups. Here's a simple process to get you started on that. Keep in mind you may want to try different distance metrics in the clustering operator to see how the results vary, or you may also want to do some additional text pre-processing such as stemming or token replacement for synonymous terms.

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="179" y="136">
        <parameter key="excel_file" value="C:\Users\brian\Downloads\Uservoice Text Mining Example  (3).xlsx"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
        <parameter key="attribute_name" value="id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="filter_example_range" compatibility="8.1.003" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="238">
        <parameter key="first_example" value="1"/>
        <parameter key="last_example" value="18"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.1.003" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_percent" value="1.0"/>
        <parameter key="prune_above_percent" value="100.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="447" y="34"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="agglomerative_clustering" compatibility="8.1.003" expanded="true" height="82" name="Clustering" width="90" x="581" y="34">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
      <connect from_op="Filter Example Range" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 3"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Thomas_Ott · April 2018

@kjkellish24 do you have a fixed number of categories? Like Daily Log, Closeout, Markup, etc? I think somesort of initial mapping using a Generate Attributes operator for the 'title' attribute then followed by Text Processing and a maybe a Bayesian algo would help categorize your test set.

kjkellish24 · April 2018

thThanks for the quick response back @Thomas_Ott! You da man! So an internal team has already gone through the user ideas/suggestions and assigned them with tags that correspond to the tool it's covering. So they are fixed and I can export per ideas for only that specific tool, which makes things easier. However, once I filter it down to the granular level of one specific tool, I still am left with 300-400 suggestions for just the daily log, as an example. At this point it becomes very time consuming as I have to go through and pick out which have similarities enough to be merged and which are asking for a completely new request. See the attached photo of the labeling options for each tool; also added the model that I created but this one doesn't seem to work like I need it to. Hope this helps and again, I really appreciate it tons! I'll pay it forward to someone else today!

uservoice example.jpg

sgenzer · April 2018

hi @kjkellish24 - so I would suggest checking out our newest text mining operator in the "Operator Toolbox" extension called "Extract Topics from Document". You can see an overview in the slide deck posted here and there's a tutorial in Studio:

Screen Shot 2018-04-24 at 9.14.55 AM.png

Good luck!

Scott

kjkellish24 · April 2018

Thank you, Everyone! This is all extremely helpful! I'm gonna start testing some of the various actions mentioned and I'll keep ya posted on how it goes... Fingers crossed! Btw... I got into this to find a lean approach to this task but I've got to say, I'm kind of hooked on this and look forward to learning more. It's kind of fun! Never thought I would think that about text mining documents lol

kjkellish24 · April 2018

@Telcontar120 Do you by chance have any articles on stemming or token replacement for synonymous terms? I tried to accomplish this with Word2Vec but kept getting hung up and couldn't get it to work

Telcontar120 · April 2018

There are built-in Stemming operators for RapidMiner, so you should be able simply to add them into the Process Documents subprocess in the process that I supplied. I prefer the Porter stemming one myself.

In terms of synonym replacement, the best approach I have found is a manual one, because this is very domain specific. You can take the wordlist from the raw data output and make a note of any tokens that you would consider to be synonyms and then enter them in the Replace Tokens operator inside Process Documents. This takes a little bit of time to set up initially, but will definitely give the best results.

sgenzer · April 2018

hi @kjkellish24 so yes Word2Vec is specifically designed to create a "stemming dictionary" of synonymous words from a corpus. The author of the Word2Vec extension @mschmitz wrote a nice KB article with example processes that you can look at here.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Client Ideas/ Suggestion Sorting to Merge

Best Answer

Answers