Getting different results when loading a process vs coding it

behrangsa · September 2008

I want to create the example text clustering process using the Java APIs. Here's a copy of the original process that comes with the Examples bundle:


<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Clustering text documents#ylt#/h3#ygt##ylt#p#ygt#In this experiment, texts from two newsgroups are read and clustered. To make the clusters better comprehensible, three keywords are extracted for each cluster and added to the cluster description.#ylt#/p#ygt#"/>
    <parameter key="logverbosity"	value="status"/>
    <operator name="TextInput" class="TextInput" expanded="yes">
        <parameter key="default_content_language"	value="english"/>
        <list key="namespaces">
        </list>
        <parameter key="prune_above"	value="10"/>
        <parameter key="prune_below"	value="5"/>
        <list key="texts">
          <parameter key="graphics"	value="../data/newsgroup/graphics"/>
          <parameter key="hardware"	value="../data/newsgroup/hardware"/>
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="min_chars"	value="5"/>
        </operator>
        <operator name="PorterStemmer" class="PorterStemmer">
        </operator>
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="AttributeSumClusterCharacterizer" class="AttributeSumClusterCharacterizer">
    </operator>
</operator>

When I run this using this code:


        System.setProperty("rapidminer.home", "C:\\Java\\RapidMiner-4.2");
        RapidMiner.init();
        Process p = new Process(theProcessFile);
        p.run();

The result is:


IOContainer (2 objects):
A cluster model with the following properties:

Cluster 0 [characterization: graphic buffer model]: 11 items
Cluster 1 [characterization: appl memori crabappl]: 9 items
Total number of items: 20

If I run the process multiple times, I get the same result. So I assume that the initial centroids are not selected randomly and the outcome is always the same.

Now I want to create this process using the Java API. Here's my code:


        System.setProperty("rapidminer.home", "C:\\Java\\RapidMiner-4.2");
        RapidMiner.init();
        
        Process p = new Process();
        
        OperatorChain textInput = (OperatorChain) OperatorService.createOperator("TextInput");
        textInput.setParameter(PARAMETER_DEFAULT_CONTENT_LANGUAGE, "english");
        textInput.setParameter(PARAMETER_PRUNE_ABOVE, "15");
        textInput.setParameter(PARAMETER_PRUNE_BELOW, "5");
        
        
        List<Object[]> textList = new LinkedList<Object[]>();
        textList.add(new Object[] {"graphics","newsgroup/graphics"});
        textList.add(new Object[] {"hardware","newsgroup/hardware"});        
        textInput.setListParameter("texts", textList);
        textInput.addOperator(OperatorService.createOperator("StringTokenizer"));
        textInput.addOperator(OperatorService.createOperator("EnglishStopwordFilter"));
        
        Operator tlfOperator = OperatorService.createOperator("TokenLengthFilter");
        tlfOperator.setParameter("min_chars", "5");
        textInput.addOperator(tlfOperator);
        textInput.addOperator(OperatorService.createOperator("PorterStemmer"));
        
        p.getRootOperator().addOperator(textInput);
        p.getRootOperator().addOperator(OperatorService.createOperator("KMeans"));
        p.getRootOperator().addOperator(OperatorService.createOperator("AttributeSumClusterCharacterizer"));
        
        System.out.println(p.getRootOperator().createProcessTree(1));
        
        p.save(new File("Process.xml"));
        
        p.run();

When I save the process to a file, it looks identical to the original process that comes with the examples bundle with the only difference being that it is wrapped inside a <process> element:


<?xml version="1.0" encoding="windows-1252"?>
<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="TextInput" class="TextInput" expanded="yes">
          <parameter key="default_content_language"	value="english"/>
          <parameter key="prune_above"	value="15"/>
          <parameter key="prune_below"	value="5"/>
          <list key="texts">
            <parameter key="graphics"	value="newsgroup/graphics"/>
            <parameter key="hardware"	value="newsgroup/hardware"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"	value="5"/>
          </operator>
          <operator name="PorterStemmer" class="PorterStemmer">
          </operator>
      </operator>
      <operator name="KMeans" class="KMeans">
      </operator>
      <operator name="AttributeSumClusterCharacterizer" class="AttributeSumClusterCharacterizer">
      </operator>
  </operator>

</process>

However the result of running the process is different compared to the original process:


IOContainer (2 objects):
A cluster model with the following properties:

Cluster 0 [characterization: graphic buffer memori]: 12 items
Cluster 1 [characterization: appl state problem]: 8 items
Total number of items: 20

Any ideas what is causing this?

Thanks in advance,
Behi

IngoRM · September 2008

Hi,

If I run the process multiple times, I get the same result. So I assume that the initial centroids are not selected randomly and the outcome is always the same.

yes, they are always the same but the reason is not that the centroids are not randomly chosen. They actually are. But in RM, it is ensured that repetitions of processes always lead to the same results by ensuring that the sequence of used random numbers is always the same for a specific process. By the way, this behaviour can be changed by setting the random seed parameter of the root operator to -1.

The reason for the difference could be the value of "prune_above". It's 10 in the original process and 15 in yours.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Getting different results when loading a process vs coding it

Answers