Strange Results: Confidence almost always near 100% on model

kcasey · May 2013

I have built a model via stacking with a .7 relative split by reading records out of a database. The output of validation step indicates the model has an accuracy of 81.73% with precision of predicting the state of "Other" at 91.25% and predicting "Success" at 13.25%. So far, so good.

To see how this performs, I read the records in, apply the model, then store the output to a database. To my surprise, not a single record has a prediction of "Success". All are predicted to be "Other" with a confidence(Other)=1. All but 150 records of the 79,000. The 150 have super small confidence values for Success (like 7.97298895275624E-13). Its almost as if the model is broken and only dumping out the same answer over and over again.

I do not understand how this can happen. The confusion matrix from the output of validation shows that it predicted 2540 success (and was right in 388 of the cases). But when I apply the model to the data, there isnt a single success! I would expect to see 2540 records labelled as a success.

I have tried selecting smaller subsets of the data, fewer fields. I have tried applying the model to small handful records, as well as by parsing XML records or a CSV file of the records instead of reading from the database.. nothing seems to alter the outcome. The model just seems fubarred once it has been stored to the repository.

Here is the code that builds the model


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="75">
        <parameter key="connection" value="SQL1"/>
        <parameter key="query" value="&#10;select &#10;[CallReceivedID],&#10;[DispositionGroup],&#10;PrimaryAreaCodeExchange, &#10;RemoteName, &#10;RateCenter, &#10;PrimaryZipCode, &#10;Company, &#10;PrimaryAreaCode, &#10;PrimaryState, &#10;DNIS, &#10;WeekDayNumber, &#10;Income, &#10;HourOfDay&#10;&#10;from warehouseanalysis.dbo.DRTVInbound012013Analysis&#10;"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="delete_repository_entry" compatibility="5.3.008" expanded="true" height="76" name="Delete Repository Entry" width="90" x="179" y="75">
        <parameter key="entry_to_delete" value="pvACD"/>
      </operator>
      <operator activated="true" class="delete_repository_entry" compatibility="5.3.008" expanded="true" height="76" name="Delete Repository Entry (2)" width="90" x="313" y="75">
        <parameter key="entry_to_delete" value="modelACD"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.008" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="75">
        <parameter key="attribute_name" value="DispositionGroup"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="CallReceivedID" value="id"/>
          <parameter key="PrimaryAreaCodeExchange" value="regular"/>
          <parameter key="RemoteName" value="regular"/>
          <parameter key="RateCenter" value="regular"/>
          <parameter key="PrimaryZipCode" value="regular"/>
          <parameter key="Company" value="regular"/>
          <parameter key="PrimaryAreaCode" value="regular"/>
          <parameter key="PrimaryState" value="regular"/>
          <parameter key="DNIS" value="regular"/>
          <parameter key="WeekDayNumber" value="regular"/>
          <parameter key="Income" value="regular"/>
          <parameter key="HourOfDay" value="regular"/>
        </list>
      </operator>
      <operator activated="true" class="split_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="581" y="75">
        <process expanded="true">
          <operator activated="true" class="stacking" compatibility="5.3.008" expanded="true" height="60" name="Stacking" width="90" x="179" y="75">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.008" expanded="true" height="76" name="k-NN" width="90" x="112" y="30"/>
              <operator activated="true" class="decision_tree" compatibility="5.3.008" expanded="true" height="76" name="Decision Tree" width="90" x="112" y="120"/>
              <operator activated="true" class="random_forest" compatibility="5.3.008" expanded="true" height="76" name="Random Forest" width="90" x="112" y="210"/>
              <operator activated="false" class="rule_induction" compatibility="5.3.008" expanded="true" height="76" name="Rule Induction" width="90" x="112" y="300"/>
              <operator activated="false" class="nominal_to_numerical" compatibility="5.3.008" expanded="true" height="94" name="Nominal to Numerical" width="90" x="45" y="390">
                <list key="comparison_groups"/>
              </operator>
              <operator activated="false" class="neural_net" compatibility="5.3.008" expanded="true" height="76" name="Neural Net" width="90" x="179" y="390">
                <list key="hidden_layers"/>
              </operator>
              <connect from_port="training set 1" to_op="k-NN" to_port="training set"/>
              <connect from_port="training set 2" to_op="Decision Tree" to_port="training set"/>
              <connect from_port="training set 3" to_op="Random Forest" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="base model 1"/>
              <connect from_op="Decision Tree" from_port="model" to_port="base model 2"/>
              <connect from_op="Random Forest" from_port="model" to_port="base model 3"/>
              <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Neural Net" to_port="training set"/>
              <portSpacing port="source_training set 1" spacing="0"/>
              <portSpacing port="source_training set 2" spacing="0"/>
              <portSpacing port="source_training set 3" spacing="0"/>
              <portSpacing port="source_training set 4" spacing="0"/>
              <portSpacing port="sink_base model 1" spacing="0"/>
              <portSpacing port="sink_base model 2" spacing="0"/>
              <portSpacing port="sink_base model 3" spacing="0"/>
              <portSpacing port="sink_base model 4" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="naive_bayes" compatibility="5.3.008" expanded="true" height="76" name="Naive Bayes" width="90" x="179" y="75"/>
              <connect from_port="stacking examples" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Naive Bayes" from_port="model" to_port="stacking model"/>
              <portSpacing port="source_stacking examples" spacing="0"/>
              <portSpacing port="sink_stacking model" spacing="0"/>
            </process>
          </operator>
          <connect from_port="training" to_op="Stacking" to_port="training set"/>
          <connect from_op="Stacking" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="112" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="246" y="75"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.008" expanded="true" height="60" name="Store" width="90" x="849" y="30">
        <parameter key="repository_entry" value="modelACD"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.008" expanded="true" height="60" name="Store (2)" width="90" x="849" y="165">
        <parameter key="repository_entry" value="pvACD"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Delete Repository Entry" to_port="through 1"/>
      <connect from_op="Delete Repository Entry" from_port="through 1" to_op="Delete Repository Entry (2)" to_port="through 1"/>
      <connect from_op="Delete Repository Entry (2)" from_port="through 1" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Store" to_port="input"/>
      <connect from_op="Validation" from_port="averagable 1" to_op="Store (2)" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <connect from_op="Store (2)" from_port="through" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Here is the code that applies the model


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.008" expanded="true" height="60" name="Read Database" width="90" x="112" y="120">
        <parameter key="connection" value="SQL1"/>
        <parameter key="query" value="&#10;select &#10;[CallReceivedID],&#10;[DispositionGroup],&#10;PrimaryAreaCodeExchange, &#10;RemoteName, &#10;RateCenter, &#10;PrimaryZipCode, &#10;Company, &#10;PrimaryAreaCode, &#10;PrimaryState, &#10;DNIS, &#10;WeekDayNumber, &#10;Income, &#10;HourOfDay&#10;&#10;from warehouseanalysis.dbo.DRTVInbound012013Analysis"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.008" expanded="true" height="76" name="Set Role (3)" width="90" x="246" y="120">
        <parameter key="attribute_name" value="DispositionGroup"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="CallReceivedID" value="id"/>
          <parameter key="PrimaryAreaCodeExchange" value="regular"/>
          <parameter key="RemoteName" value="regular"/>
          <parameter key="RateCenter" value="regular"/>
          <parameter key="PrimaryZipCode" value="regular"/>
          <parameter key="Company" value="regular"/>
          <parameter key="PrimaryAreaCode" value="regular"/>
          <parameter key="PrimaryState" value="regular"/>
          <parameter key="DNIS" value="regular"/>
          <parameter key="WeekDayNumber" value="regular"/>
          <parameter key="Income" value="regular"/>
          <parameter key="HourOfDay" value="regular"/>
        </list>
      </operator>
      <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve" width="90" x="246" y="30">
        <parameter key="repository_entry" value="modelACD"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="447" y="75">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="write_database" compatibility="5.3.008" expanded="true" height="60" name="Write Database" width="90" x="715" y="75">
        <parameter key="connection" value="SQL1"/>
        <parameter key="table_name" value="DRTVInbound012013Validation"/>
        <parameter key="overwrite_mode" value="overwrite"/>
        <parameter key="set_default_varchar_length" value="true"/>
        <parameter key="default_varchar_length" value="255"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Retrieve" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Write Database" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

And here is the model



  <?xml version="1.0" encoding="UTF-8" ?> 
  <StackingModel>Stacking Model (prediction model for label DispositionGroup) Stacking Model: Distribution model for label attribute DispositionGroup Class Other (0.909) 14 distributions Class Success (0.091) 14 distributions Base Models: 1-Nearest Neighbour model for classification. The model contains 79985 examples with 11 dimensions of the following classes: Other Success : Other {Other=72717, Success=7268} Model 0: --- : Other {Other=72705, Success=7280} Model 1: --- : Other {Other=72768, Success=7217} Model 2: --- : Other {Other=72669, Success=7316} Model 3: --- : Other {Other=72604, Success=7381} Model 4: --- : Other {Other=72690, Success=7295} Model 5: --- : Other {Other=72791, Success=7194} Model 6: --- : Other {Other=72817, Success=7168} Model 7: --- : Other {Other=72740, Success=7245} Model 8: --- : Other {Other=72801, Success=7184} Model 9: --- : Other {Other=72754, Success=7231}</StackingModel>

I have followed these steps with other datasets and its worked, but the last two datasets produce this kind of oddness. Anybody got an idea of what I am doing wrong?

*********UPDATE*******
I rewrote the process so that it a) reads from the database b) builds the model but then c) applies the model and stores the results to the database WITHOUT writing the model to repository. Lo and behold it works--the database has various degrees of confidence in Other and Success as I would have expected. So the problems lies in writing the model to the repository using the Store Operator... or reading the model from the repository using the Retrieve operator. I am using RM 5.3.008

*********UPDATE AGAIN*******
I deleted the Store and Retrieve repository operators from the processes, then re-added them. Now it works (sort of). When I run the validation portion, I get different values for confidence(Other) and confidence(Success) as you might hope. That is, until you change the query used to read from the database. If you use SELECT TOP 1000, everything seems fine. however, If I do a SELECT where CallReceviedID=180106, the confidence(Success) now always returns a 1.In other words, if you perform the Apply Model to more than one row of data,. it seems to work. If I apply it to only a single row, I get, not only a different answer, but the same answer no matter which row I select.

Its getting stranger: First it calculates 1 for confidence(Other) on all records. Then it seems to work. Now it only works on multiple records and fails on individual records, but always returns a 1 for confidence(Success).

I am baffled.

*** UPDATE ****
It appears that the stacking section causes the problem (which is that results produce over a large dataset, > 10 records are different than results on a single record). If I remove teh stacking uses just a single model, I'll get the correct result, whether its applied over lots of records, or whether the model is applied to a single record.

Anyone out there that knows whats going on?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Strange Results: Confidence almost always near 100% on model