Binary text classification - Help in process needed.

thiemo · December 2016

Hey guys,

We want to do a binary classification on a text data set with the distribution 80% negative class, 20% positive class. In order to reach maximum statistical meaningfulness, we want to do so by using 10-fold cross validation.

If we model this within Rapidminer, we are unsuccessful since it doesn’t output any statistical metrics (like precision, recall, etc):

Bildschirmfoto 2016-12-01 um 12.14.37.png

We found a workaround that works, but it doesn’t make any sense out of a ML perspective: If we first divide into training or test and then use 10-fold-crossvalidation it works — But the training or test split should be part of the crossvaligdation (9 training folds, 1 test fold, 10 iterations). So right now the only way to get this working is by FIRST dividing into test and training and THEN use X-Validation. Did we model it the right way or did we miss anything?

Bildschirmfoto 2016-12-01 um 12.14.37.png

If you need any more information for helping us, just comment.

Thank you very much in advanced.

Best regards!

Thomas_Ott · December 2016

Ok, silly questions but did you set a label role in your data set?

Telcontar120 · December 2016

This sounds like a strange problem, but it's very hard to troubleshoot from a screenshot of a process--can you post the process itself for review? You can export it from the file menu and attach it as a file.

Thanks,

thiemo · December 2016

Hey T-Bone,

yes I set a label role

Regards,

thiemo · December 2016

Hey Brian,

thank you for your answer.

Here is the process which gives me results but makes no sense

It would be great if you could help me. If you need any more information I am happy to provide them

Best regards,
Thiemo

Thomas_Ott · December 2016

I would double check your process, something doesn't appear to be correct because I can easily extract P/R's and confusion matrix.

See the sample XML below. This process takes Tweets, does a bit of processing up front and generates a random label. The Process Documents from Data operator then processes them to TF-IDF (you can select Binary Occurances) and spits out the confusion matrix.

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
        <parameter key="connection" value="Twitter Connection"/>
        <parameter key="query" value="rapidminer"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Id|Text"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="label" value="if(rand()&lt;0.5,&quot;good&quot;,&quot;bad&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.3.000" expanded="true" height="82" name="Set Role" width="90" x="648" y="34">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.3.000" expanded="true" height="145" name="Validation" width="90" x="916" y="34">
        <parameter key="sampling_type" value="stratified sampling"/>
        <process expanded="true">
          <operator activated="true" class="parallel_decision_tree" compatibility="7.3.000" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
          <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.3.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.3.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
          <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Telcontar120 · December 2016

Hi @thiemo,

I took your original process, and modified it only by inputting a simple toy example set using the identical Excel format (since I don't have your original dataset). Then I removed your outer split validation, and ran it again only using the cross-validation that you had as an inner operator. And it works fine! Here's the modified process. So if you are having problems, I suspect it must be something strange related to your original dataset. There's nothing that appears to be wrong with the process or with the cross-validation operator. Sorry I couldn't be more definitive.

Telcontar120 · December 2016

And here's the Excel file I used as input in case you are interested.

thiemo · December 2016

Hey Brian,

thank you very much for your solution. I downloaded the process and the excel and tried it and it works perfectly, but I do not get the performance parameters such as accurancy, recall, precision and the AUC?

How can I use this process and receive those 4 parameters?

Regards,

Thiemo

Telcontar120 · December 2016

Hi @thiemo,

I'm not sure what you mean--those performance metrics are all available in the performance tab output from the process when it runs. See the attached screenshot. This is part of the output for the process I supplied with no changes. Of course, the values are useless with my test examples since there are only 10 of them, but you can see that AUC, accuracy, precision, and recall are all available. If you run it on a larger dataset then they should all be there.

performance output.PNG

thiemo · December 2016

thiemo · December 2016

Hi Biran,

thanks again for the quick answer.

However, if I take the process you uploaded and use the excel of you, I get an result but not the statistical paremeters such as precision and recall.

Bildschirmfoto 2016-12-03 um 15.33.04.png

Did you do anything special while importing the data? I just set the type of need data to binominal. What can I do to get the precison and recall for the data?

Thanks you and best regards,

Thiemo

thiemo · December 2016

Thomas_Ott · December 2016

What you see in the Statistics tab is just some basic descriptive statistics of your data set, there will be no P/R or confusion matrix because you didn't do any modeling yet. This view is similar to a summary or head command in Python/R.

You need to attached a Cross Validation operator with a machine learning algoritm emebded + performance operator to generate the P/R's and confusion matrix.

thiemo · December 2016

Hi T-Bone,

thank you for the answer.

Exaclty this was my intitial problem. If I add another corss validation operater with a performance operator around the actual process, then it makes no sense anymore, right?

Regards,

Thiemo

Thomas_Ott · December 2016

From that point in your process (where you show the staistics tab) now connect a Cross Validation operator (insert your algo in the Training side and an Apply Model and Performance operator) THEN connect the "Per" port on the Cross VAlidation to the Results port. This will out put the P/R's etc for you.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Binary text classification - Help in process needed.

Answers