The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"The regression trees returned by the operators W-M5P and W-REPTree"

nicugeorgiannicugeorgian Member Posts: 31 Maven
edited May 2019 in Help
Hello,

in the text version of the regression trees returned by W-M5P and W-REPTree: how should one read a tree branch (leaf) of the following form:
 attribute = RU,PK,TW,TR,IT <= 0.5 : : LM5 (798/81.241%) 
Does
 attribute = RU,PK,TW,TR,IT <= 0.5 
mean that attribute is not among the values RU,PK,TW,TR,IT?

LM5 is defined below the tree, and I assume it represents the value predicted (forecasted) for that leaf, correct? 

What do the numbers 798 and 81.241% represent?

It seems to me that, in my example, attribute is treated as numerical although it's categorical (nominal). Is there a way to specify before the regression trees are run?

Many thanks for any idea!

Cheers,
Geo
Tagged:

Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    mean that attribute  is not among the values RU,PK,TW,TR,IT?
    Yes. As far as I remember this is the way how it is represented. You can get the idea if you look at the graph view.

    What do the numbers 798 and 81.241% represent?
    The first number is the number of training instances falling into this leaf and the second number is the root mean squared error of the linear model on these training examples divided by the global absolute deviation.

    It seems to me that, in my example, attribute is treated as numerical although it's categorical (nominal). Is there a way to specify before the regression trees are run?
    As far as I know the nominal attributes are internally all converted into binary attributes which are then handled as numerical (hence the split value 0.5). I don't think that you can change this behavior since it one of the basic idea of the M5 algorithm.

    Cheers,
    Ingo
  • nicugeorgiannicugeorgian Member Posts: 31 Maven
    Hi Ingo,

    thanks for the explanations.
    mierswa wrote:

    the second number is the root mean squared error of the linear model on these training examples divided by the global absolute deviation.
    What do you exactly mean by
    the global absolute deviation
    ? Do you mean the global average absolute deviation defined as

    the average of all the absolute differences between every element of the whole sample (not only the instances falling into that leaf) and the mean of the whole sample set?

    Is there a document where I can see the exact definitions of the numbers in the tree's leaves?

    Thanks in advance!

    Cheers,
    Geo
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello,

    the average of all the absolute differences between every element of the whole  sample (not only the instances falling into that leaf) and the mean of the whole  sample set?
    Yes.

    Is there a document where I can see the exact definitions of the numbers in the tree's leaves?
    I took the information from the Weka source code and as far as I know there is no document describing this.

    Cheers,
    Ingo
  • BAMBAMBAMBAMBAMBAM Member Posts: 20 Maven
    Further questions on "text view" and "graph view" when viewing (tree) models :

    This is the output (RapidMiner version 4.4)

    W-REPTree
    REPTree
    ============
    Intensity < 0.98 : 0.23 (240/0.48) [144/0.49]
    Intensity >= 0.98 : -0.07 (1754/0.47) [853/0.48]
    Size of the tree : 3


    or to simplify, for each leaf we have

    Condition : A (B/C) [D/E]

    I'm guessing that:

    A is the label or predicted class
    B is the number of training samples found at this leaf and used to calculate the statistics
    C is the RMSE (root mean squared error) when 'A' is used as the prediction for the B samples,  divided by the global absolute deviation

    ... but I don't know what D or E are...

    Any help would be appreciated.
    Thanks!
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I'm sorry, but I'm completely unfamiliar with the weka learners. Unlike Ingo I don't even have the source code to take a deeper look into. Did you search on the Weka Mailing list for informations about that?

    Greetings,
      Sebastian
  • radoneradone RapidMiner Certified Expert, Member Posts: 74 Guru
    The code is the following in the case of numeric values:
    0410:                        buffer.append(" : "
    0411:                                + Utils.doubleToString(classMean, 2));
    0412:                        double avgError = 0;
    0413:                        if (m_Distribution[1] > 0) {
    0414:                            avgError = m_Distribution[0] / m_Distribution[1];
    0415:                        }
    0416:                        buffer
    0417:                                .append(" ("
    0418:                                        + Utils.doubleToString(
    0419:                                                m_Distribution[1], 2) + "/"
    0420:                                        + Utils.doubleToString(avgError, 2)
    0421:                                        + ")");
    0422:                        avgError = 0;
    0423:                        if (m_HoldOutDist[0] > 0) {
    0424:                            avgError = m_HoldOutError / m_HoldOutDist[0];
    0425:                        }
    0426:                        buffer
    0427:                                .append(" ["
    0428:                                        + Utils.doubleToString(
    0429:                                                m_HoldOutDist[0], 2) + "/"
    0430:                                        + Utils.doubleToString(avgError, 2)
    0431:                                        + "]");
    and the following in the case of nominal values:
    0440:                        return " : "
    0441:                                + m_Info.classAttribute().value(maxIndex)
    0442:                                + " ("
    0443:                                + Utils.doubleToString(Utils
    0444:                                        .sum(m_Distribution), 2)
    0445:                                + "/"
    0446:                                + Utils
    0447:                                        .doubleToString(
    0448:                                                (Utils.sum(m_Distribution) - m_Distribution[maxIndex]),
    0449:                                                2)
    0450:                                + ")"
    0451:                                + " ["
    0452:                                + Utils.doubleToString(
    0453:                                        Utils.sum(m_HoldOutDist), 2)
    0454:                                + "/"
    0455:                                + Utils
    0456:                                        .doubleToString(
    0457:                                                (Utils.sum(m_HoldOutDist) - m_HoldOutDist[maxIndex]),
    0458:                                                2) + "]";
    I would be honest if anyone can understand the code.
Sign In or Register to comment.