[SOLVED] Really basic question, I think I'm applying models wrong.

johnyma22 · January 2012

My first read database gets all of the values from the documents (20k)

My second read database(1k documents) has a value isGood = 1 if the value is good, -2 if the value is bad and a bunch of other really bad ideas.. I set isGood to label. Should I actually only be passing true/false or is an integer okay?

I use nominal to text to get the "data" field as text.

I then process the document, looking for word frequencies etc.

Is my Naive bayes even in the right place?

My end goal is that I feed it 1000 known good documents and it can find very similar documents from the first read database... I want my confidence score to be based on document similarity.

I am getting an output that contains confidence but I'm not sure how to present my output, I don't come from a statistical background so I'm learning on my feet. I appreciate I have a lot to learn so in 3 weeks time I'm going to read some books/content about how to use rapidminer and ML in general. I can only apologize for my ignorance!

TLDR;
Can I use an integer as a label?
Am I using naive bayes and apply model correctly?
How can I view my data in an easy to interpret way. Ideally something like a list of document IDs with their confidence rating.

Thanks guys!

misanthropic789 · January 2012

If I understand what you are saying, you have your read's backwards. Your 1000 records is your training date set; that is what you know is correct and that you want to use to define the model. You should read that in first and feed that to the naive bayes process. Your full data set is what you want to score/apply the model to, so you should feed that in second and then apply the model to it. That will provide you a score for each record in your full data set that will tell you both whether the naive bayes model thinks your record is good or not and the level of confidence (0-1) that it has in that assessment.

As far as your variables go, I don't think there is a technical reason why you can't use integers, however the spread of your variables is odd. I would use 1 and 0 (1 is good, 0 is not good) if I were using integers. Someone else will need to say whether there needs to be a numeric to nominal process in there on your label. That is how my job is set up.

Regarding output, what you need to do is save the output of the apply model, either to a csv file or to the repository. Then you can extract the fields you need from it (ID and prediction(yes).

BTW, I'm one step less of a newbie than you are, so I hope others will jump in and correct both of us. However I am sure about your read's being backwards so you should start with fixing that.

johnyma22 · January 2012

Now when I attempt to run I get:

The learning scheme naive bayes does not have sufficient capabilities for handling an example set with only one label

But "include special attributes" is ticked, so is keep text and add meta information, any idea what I could be doing wrong?

MariusHelf · January 2012

Hi,

some additions from my side:

- are you sure that your training data contains more than one value for isGood? If it contains only examples of one class, that could cause the error message.

- For Text Processing it is very important to use the same word list for training and application. Thus you have to connect the "wor" output of the Process Documents operator in the training branch to the "wor" input in the application branch. That way it is guaranteed that training and application example sets contain the same word vectors.

- do your integer values in isGood imply an order, or are they actually categories? In the latter case you should convert the label to a nominal value, so Naive Bayes will perform a classification. If it is left to Integer, it will perform a regression.

Best,
Marius

johnyma22 · January 2012

Hey guys, so I made some progress.

I extended my DB structure to support a label field and set any that are known positive matches as true and any known negatives as false.

I use these MySQL select queries:

SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50

This select gets the items with a label true and false. Naive Bayes learnes from these.

SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != 1 AND isGood = 0 ORDER BY score desc LIMIT 0,10
This select gets all of the items that dont have a true or false label

This select gets all of the items that dont have a true or false label.

My output data doesn't have any confidence rating. Should it?

It looks like this:

Thanks!

PS if someone could add me on skype/other IM service I'd be happy to screen share and work on this in real time?

MariusHelf · January 2012

Hi,

if you applied the classification model: yes, your output should contain predictions and confidences. It would be helpful if you posted your process as XML here, so we can check the setup. You get the XML code via the XML tab at the top of the process view in RapidMiner. Just copy the text from there into your next answer, and please use the #-button on top of the input box for that.

Best,
Marius

johnyma22 · January 2012

Here ya go

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
    <process expanded="true" height="369" width="835">
      <operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != &quot;true&quot; AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="75">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = &quot;true&quot; OR isGood = -2 AND school_list_holiday_sources.label = &quot;false&quot; LIMIT 0,50"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="480" width="815">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="90"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
          <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.014" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="75">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.1.014" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="75">
        <parameter key="laplace_correction" value="false"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.1.014" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="90"/>
    </process>
  </operator>
</process>

Editable here
http://beta.etherpad.org/p/rapidminer

IngoRM · January 2012

Hi there,

please try this one and let me know if it works:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
    <process expanded="true" height="369" width="835">
      <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = &quot;true&quot; OR isGood = -2 AND school_list_holiday_sources.label = &quot;false&quot; LIMIT 0,50"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="480" width="815">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="90"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
        <parameter key="laplace_correction" value="false"/>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != &quot;true&quot; AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
          <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="180"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,
Ingo

johnyma22 · January 2012

Exactly the same output, no confidence etc.

IngoRM · January 2012

Hi,

are you sure that you have pressed the green check icon after inserting the XML (I frequently forget this ;D ). The difference is really small: I just have connected the output port with the word list of the first operator for text processing with the input port for the word list of the second one. This is definitely necessary, since otherwise the resulting example sets would differ and a prediction is not possible then. This should actually also be stated in the log, by the way.

Another thing which cames into my mind is the fact that your query delivers an attribute label, which get the role "label" during training but not during testing. Remove this or also set the role to label before model application. Here is the suggested process:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
    <process expanded="true" height="369" width="835">
      <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = &quot;true&quot; OR isGood = -2 AND school_list_holiday_sources.label = &quot;false&quot; LIMIT 0,50"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="480" width="815">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="90"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
        <parameter key="laplace_correction" value="false"/>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != &quot;true&quot; AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
          <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="210">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

If this things are not the reason, I am afraid I would have to look into the data and the transformed data (i.e. the two example sets which are actually delivered to the learner - do they really contain regular attributes? Are those the same for training and testing?

Cheers,
Ingo

johnyma22 · January 2012

I didn't check the green check, however when I did I still get the same output.

If you want I can share my screen via skype and we can make modifications in real time?

Skype: johny_mac

IngoRM · January 2012

Hi,

I didn't check the green check, however when I did I still get the same output.

Even with the second one process with the additional Set Role operator? Weird. Now indeed we have to inspect the data delivered to the learner and apply model operators (see questions below).

If you want I can share my screen via skype and we can make modifications in real time?

Yeah, sounds like fun but I am out of office and surfing only via my mobile phone. And usually I charge 200 Euro per hour for this type of consulting (but be assured: we have some junior consultants which are less expensive

). But let's face it: although we are usually not working on an per-hour base maybe at some point of time this would indeed be the most time-efficient thing to do if you or others here do not find the reason...

If somebody else has more time and wants to dive deeper into this: the next thing I would check is what is delivered to the learner (see my questions below) and to the operator Apply Model together with the log messages. If the dimension is really high, maybe another learner would also be more appropriate. Just my 2c.

Cheers,
Ingo

johnyma22 · January 2012

I'm happy to paypal some cash over, I would expect it is only a 5 minute job as my task is so simple and I'm probably only missing a checkbox somewhere!

Would anyone be willing to just do it as a side job and not charge the 200 euros per hour but maybe 20 euros for 5 minutes of your time or maybe I can donate some money to charity or to your favorite open source project?

johnyma22 · February 2012

Just a note. I'm still stuck with this. I'm sure the process is working fine but I'm not able to interpret the results correctly

MariusHelf · February 2012

Hi,

as Ingo said above: please check your data, and also your SQL queries. To me it seems a bit odd that you said that you want to use isGood as label, but are fetching a label column from the database. Next, in your screenshot of the data the columns for label and isGood are almost empty. Please check that you are fetching correct data sets by putting a breakpoint on the Read Database operators.

Best,
Marius

johnyma22 · February 2012

As requested I extended my database schema creating a field=label that was "true" / "false"

I actually get this in my results which I think means something is working right:

Can anyone please confirm?

Thanks

MariusHelf · February 2012

At least it does not look wrong... this is the Naive Bayes model created by the Naive Bayes operator. More interesting would be the distribution table of that model (access it via the radio buttons in the results view). But you are probably more interested in the labelled result set. Thus, you have to connect the lab output of Apply Model to the result output. Anyway, the process Ingo posted should work.

If you still don't get valid results, again check the following:

Did you:
- connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch?
- did you double check that you read correct data from both Read Database operators?
- if you don't use isGood, don't retrieve it from the database.
- find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?

Best, Marius

johnyma22 · February 2012

Dist model looks like this:

In answer to your questions:
Yes.
Yes.
Removed isGood

I'm running some more tests now, will reply once they are completed.

Thanks

MariusHelf · February 2012

I would claim that you changed your SQL statement and don't fetch a "data" attribute with the text anymore, but your text attributes are now called "Title" and "Description". Thus, the Nominal to Text operators have to be adapted such that they don't operate on "data", but on the two new attributes. If you have only text attributes and the label, you could use "filter type" all and uncheck "include special attributes".
Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?

Best, Marius

johnyma22 · February 2012

SQL statements only get Data.

Include special attributes not checked.

Didn't get any warnings..

View:

XML is this:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="386" width="835">
      <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = &quot;true&quot; OR isGood = -2 AND school_list_holiday_sources.label = &quot;false&quot; LIMIT 0,100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="480" width="815">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="90"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != &quot;true&quot; AND isGood = 0 LIMIT 0,100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
          <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
        <parameter key="laplace_correction" value="false"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

MariusHelf · February 2012

You are using Extract Content, which is for HTML documents only. Does your database contain HTML documents? What happens if you remove those operators?

johnyma22 · February 2012

It contains both HTML documents and Text extracted from PDF documents.

haddock · February 2012

Your current XML does not set the label role on the test set, but it does on the training set.

I refer you to earlier posts in this thread from Ingo and to the help for this operator...

Please pay attention to the fact, that the application of Models will need the same attributes during application on an ExampleSet that where part of the ExampleSet it was trained on. Some minor changes like adding attributes might be possible, but might cause severe calculation errors. Please make sure, that the attributes' number, order, type and role are consistent during training and application.

johnyma22 · February 2012

Okay great I have some results now

Thanks guys. I think it might be healthy for other people if I keep this thread open with my requests for how to interpret the data.

haddock · February 2012

I disagree.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Really basic question, I think I'm applying models wrong.

Answers