Can I create multiple models using an attribute as a loop variable?

adamf · August 2018

I have a data set that is comprised of messages from dozens of different authors. My goal is to develop a model based on multiple attributes (including TF-IDF) of each author's messages. Since each author's messages are likely to be unique in terms of their content, topics, word usage, etc., I'd like to develop one model for each author. In other words, if I have 10 authors, I want to create 10 unique models (one for each author's messages). Thus, I have several questions:

1) One of the attributes of my data is the author's name. Can I use this attribute somehow as a loop variable so that for each iteration of the loop I can analyze all of an author's messages and train and create a model unique to that author?

2) How can I name and store these models in such a way that in another RM process I can retrieve a model based on an author's name? In other words, if I train a model based on messages whose author is Jenny, then how can I retrieve and apply "Jenny's model" if I get new messages from Jenny in the future (or "Steve's model" if I get new messages from Steve, and so on)?

3) Also, is there an unsupervised model that can be used to model all of an individual author's messages as a single class, and then apply the model to future messages to detect deviations or anomalies?

MartinLiebig · August 2018

@lionelderkrikor,

you can use Extract Macro and use the first author as a macro. Then you can use this in Write Clustering

- I would recommend to use the usual loop over loop examples. Requieres you to extract numberOfExamples first but you get the loop in parallel.

- I would use a Store operator over Write Clustering.

@adamf: Maybe you want to try LDA of Toolbox on it.

BR,

Martin

lionelderkrikor · August 2018

Hi @adamf,

I will try to provide some elements of anwers :

0. Hypothesis :

I assume that your dataset is under this form :

Id	Author	Messages
1	Author_1	My taylor is rich
2	Author_2	To Alcohol ! The cause of, and solution to, all of life's problems
3	Author_3	God bless America

1. Process 1 :

Basically, it is a process which create N cluster models from your dataset (one model for each author, k = 1 / model used = DBScan) and write the cluster model in a path on oyur computer (path to set in the parameters).

Process 1 :

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA4">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA4" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.0.000-BETA4" expanded="true" height="68" name="Read Excel" width="90" x="112" y="136">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Loop_models\Loop_model.xlsx"/>
        <list key="annotations"/>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Id.true.integer.attribute"/>
          <parameter key="1" value="Author.true.polynominal.attribute"/>
          <parameter key="2" value="Messages.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Messages"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="9.0.000-BETA4" expanded="true" height="124" name="Loop Examples" width="90" x="380" y="136">
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
            <parameter key="first_example" value="%{example}"/>
            <parameter key="last_example" value="%{example}"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="34">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
              <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
              <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="581" y="34"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="dbscan" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="34"/>
          <operator activated="true" class="legacy:write_clustering" compatibility="9.0.000-BETA4" expanded="true" height="68" name="Write Clustering" width="90" x="715" y="85">
            <parameter key="cluster_model_file" value="C:\Users\Lionel\Desktop\model_%{example}.clm"/>
          </operator>
          <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering (2)" to_port="example set"/>
          <connect from_op="Clustering (2)" from_port="cluster model" to_op="Write Clustering" to_port="input"/>
          <connect from_op="Clustering (2)" from_port="clustered set" to_port="output 1"/>
          <connect from_op="Write Clustering" from_port="through" to_port="output 2"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop Examples" from_port="output 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

2. Process 2 :

I reflected on a process which is not exactly what you asking, but I think it can be relevant for your final use :

This process create an unique cluster model from your training dataset with k = number of authors. In this case, each author/message belongs to a cluster.

So when you have a future message from a "known author" to "score" :

- if effectivly the author uses the same wording in this second message, the model will classify it in the author's cluster.

- if the "wording" of this second message is different than in the first message (from training dataset), the model will classify it in a other author's cluster. From there you can study deviations, anomalies, like you said. Maybe you can calculate the distance between the 2 differents clusters (I don't know if it is feasable in RapidMiner).

Process 2 :

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA4">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA4" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.0.000-BETA4" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Loop_models\Loop_model.xlsx"/>
        <list key="annotations"/>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Id.true.integer.attribute"/>
          <parameter key="1" value="Author.true.polynominal.attribute"/>
          <parameter key="2" value="Messages.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="9.0.000-BETA4" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="85">
        <parameter key="macro" value="numberCluster"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Messages"/>
      </operator>
      <operator activated="true" class="work_on_subset" compatibility="9.0.000-BETA4" expanded="true" height="124" name="Work on Subset" width="90" x="447" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Messages"/>
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="85">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
              <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
              <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="581" y="34"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:k_means" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Clustering" width="90" x="380" y="85">
            <parameter key="k" value="%{numberCluster}"/>
          </operator>
          <connect from_port="exampleSet" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_port="through 2"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="through 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="example set"/>
          <portSpacing port="source_exampleSet" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
          <portSpacing port="sink_through 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_excel" compatibility="9.0.000-BETA4" expanded="true" height="68" name="Read Excel (2)" width="90" x="45" y="187">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Loop_models\Loop_model.xlsx"/>
        <parameter key="sheet_number" value="2"/>
        <list key="annotations"/>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Id.true.integer.attribute"/>
          <parameter key="1" value="Author.true.polynominal.attribute"/>
          <parameter key="2" value="Messages.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Messages"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="648" y="187">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.0.000-BETA4" expanded="true" height="103" name="Multiply" width="90" x="648" y="34"/>
      <operator activated="true" class="apply_model" compatibility="9.0.000-BETA4" expanded="true" height="82" name="Apply Model" width="90" x="782" y="85">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Work on Subset" to_port="example set"/>
      <connect from_op="Work on Subset" from_port="example set" to_port="result 1"/>
      <connect from_op="Work on Subset" from_port="through 1" to_op="Multiply" to_port="input"/>
      <connect from_op="Work on Subset" from_port="through 2" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Read Excel (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

I hope It helps,

Regards,

Lionel

NB : I don't know how "rename" the model with the author's name. Basically in my first process, the model are named model_1, model_2, ..., model_N, in the order of authors.

SGolbert · August 2018

Hi Adam,

I will go a bit more to the point:

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="8.2.001" expanded="true" height="82" name="Loop Values" width="90" x="447" y="34">
        <parameter key="attribute" value="Outlook"/>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Outlook.equals.%{loop_value}"/>
            </list>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="8.2.001" expanded="true" height="82" name="Naive Bayes" width="90" x="380" y="34"/>
          <connect from_port="input 1" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I would use the Loop Values operator and inside filter the dataset using the macro.
You can use the same macro (probably with some modification) to name the models and store them in the repository. Then you can select the correct one in another processs with the Loop Repository operator and a regular expression.
Coming to the real question, I don't understand why do you need 10 models (what kind of models?). What are you trying to predict? Is the author the label?

Best,

Sebastian

adamf · August 2018

Thanks for the suggestions. I am a bit unclear whether I should use a distinct model for each author or one model for all authors. For the former approach, would a one-class SVM be a good option? Is it supported in RM? Other/better clustering models?

My goal is: given a new message X from author Y, predict whether the message X is really from author Y or is from an imposter pretending to be author Y.

I thought if I can train one model per author, then when I receive a new message/author tuple I would retreive the appropriate model based on author to most accurately predict whether the message is consistent with the other messages by the same author or if it is an outlier.

SGolbert · August 2018

Hi,

I have found some examples of using Autoencoders for anomaly detection in text (or other unstructured data). There are some kernels in Kaggle:

https://www.kaggle.com/imrandude/h2o-autoencoders-and-anomaly-detection-python

I think you can do it in RM with the Keras extension, but I'm not sure. Or maybe with tweaking the Deep Learning operator.

I would definitely like to see your finished process .

Regards,

Sebastian

lionelderkrikor · August 2018

Hi @adamf,

An other ressource :

I think that the Chapter 12 of "Data Mining for the Masses" dedicated to "text mining" can be helpful for you.

I hope it helps,

Regards,

Lionel

Thomas_Ott · August 2018

@adamf you could check out my tutorial on one class svm's and autolabeling a training set here: http://www.neuralmarkettrends.com/use-rapidminer-to-auto-label-twitter-training-set/

lionelderkrikor · August 2018

Hi @Thomas_Ott,

Your link ends to a "404 not found".

I also tried to access directly via inside your blog ==> same result.

Regards,

Lionel

lionelderkrikor · August 2018

Hi again @Thomas_Ott,

OK after a new test, the link works...

Regards,

Lionel

Thomas_Ott · August 2018

@lionelderkrikor yes, I borked something as I was making website updates. Should be all fixed now.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Can I create multiple models using an attribute as a loop variable?

Best Answer

Answers