The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Decision Stump Split"
... this seems to be too simple to raise a question. Still, I'm puzzled.
Very simple dataset:
Thanks for your help!
Stefan
Very simple dataset:
"t_90005";"Label"and very simple process with just a Decision Stump:
-16.3319;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
"t_90005";"Label"... but why doesn't it split out the one example with a separate value (-16.33)? ???
-16.3319;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"0"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
0.0;"1"
Thanks for your help!
Stefan
Tagged:
0
Answers
But apart from that: why shouldn't it? The stump is configured to do whatever it can, without a minimal gain requirement.
There is one example of class 0 which has an attribute value clearly different from zero. All the other examples - including all of class 1 - have attribute value zero. So I can split out successfully at least one example with a cut at, let's say, -8 (or at any other value between 0 and -16, that is)
If - in the example set - the numerical attribute is replaced with a polynomial one, the thing works as expected.
Hence, on second thinking:
- Decision Stump claims to support numerical attributes
- but it clearly has both an undocumented and evidently flawed discretization algorithm
Of course, this is not the full example set or process - but it more and more makes me believe that data mining, at least with RapidMiner, is more black magic than science. At every corner one has to trust undocumented operator behavior much like one has to trust the oracle of Delphi.
How can I ever trust a tool if it claims to handle something - but when looking close, evidently it does not?
Here is the code: A bit frustrated - mostly at the quality of documentation. Maybe it's one pays for ???.
Stefan
I'm with you, if it don't make sense figure it out! My two cents says that it has to do with parsing numbers, because if I disable that bit of of the CSV reader, and if I trash the numerical to binominal operator, then I get something more sensible, as below. The good news is that the Code Police already have this reprobate in custody, and new incarnations of CSV readers will do a less eccentric job of parsing those naughty numbers! This produces the following on your data for me.
I see your answer but I don't quite believe it If I look at the example set after ReadCSV, in your case, variable type is "nominal" - so -16.33 is much like "category_A". So the learner didn't have to discretisize based on a numerical threshold value.
So, I still guess the ball is back in the court of the learner. (At some point I'll have a look at the source... )
Hi Wessel,
yes - I wouldn't have a problem with that kind of interpretation: each number same gain.
Stefan
I still think it is about the data or the setup, rather than the inner workings of the decision stump operator, because the following code does exactly what you'd expect.
thanks for your help!... but still disagree (sorry for the insistence). I changed my data-set to the below in two steps:
- Step 1: change -16.33 to -16; reason was that the csv reader was convinced that -16.33 was an integer - so I gave up on that front. I also converted all "0.0" to "0" to have something more integerish ... Still no split was performed.
- Step 2: I converted one value of 'class 1' to -1 (before all were 0); this created a nice split:
So, I don't think a random dataset helps here - this is too data dependent in a random fashion >:(Here is the dataset after Step 2:
thanks - on the example set here, Weka works; but on the 'real live example' I have the same problem that in some cases where a split is obvious, it doesn't do it. Since the Weka stump has no parameters, I give it the benefit of the doubt and assume that it stops splitting because of a too small gain, but I need trace this further (I deliberatly want to accept small gain).
Stefan
Like a pseudo code on how decision stumps are supposed to work?
Machine learning: Mitchell
3.4.1.2 INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY
3.7 ISSUES IN DECISION TREE LEARNING
Practical issues in learning decision trees include determining how deeply to grow
the decision tree, handling continuous attributes, choosing an appropriate attribute
selection measure, handling training data with missing attribute values, handling
attributes with differing costs, and improving computational efficiency. Below
we discuss each of these issues and extensions to the basic ID3 algorithm that
address them.
It can be shown that the value of c that maximizes information gain
must always lie at such a boundary (Fayyad 1991).
Utgoff and Brodley (1991) and Murthy et al.
( 1994) discuss approaches that define features by thresholding linear combinations
of several continuous-valued attributes.
thanks - and sorry for the long delay in replying: I'm currently travelling and have little access to the internet. The benefit of that is that it gives me some time for reading too ... Hence, I'm now convinced that all I want is a properly working Gini index.
Meanwhile I had a look at the code of DecisionStump (not of Weka, I didn't trace that route any further... I didn't yet figure out how to get that source ...) Eventually, it calls NumericalSplitter.getBestSplit which is buggy: In the above code, I outcommented the offending two "if"s which fixes the problem (although I don't have evidence for a case of a criterion which doesn't support incremental calculation).
It's clear what happens in the data set provided atop of this thread: It is sorted according to the attribute, but when the gain increases, the label of the previous and new examples are identical. Later, when the labels are different, the values are the same... hence, in neither case it creates an update of the split value.
Of course the emedded if (on identical attribute values) needs be there...
Other interesting side effects can be demonstrated with below data set, which creates a split at 0.5: I raised a bug (#401) in the bug tracker.
Additionally, I observed that DecisionStump doesn't observe Minimal Size to Split, Minimal Leave Size, Minimal Gain. DecisionTree doesn't observe Miminal Gain. This is reported as bug #402.
Thanks for all your help!
Stefan
I'm not sure how to change java code, compile it, and compare results.
checkout here. You need install Eclips first, then follow the steps on the link, pretty straight forward. You then can go down the tree on the left side to find NumericalSplitter.java, edit and run again ...
Stefan
this thread has been fallen a sleep since the bug was assigned to the bug tracker, but perhaps there's somebody out there, who still reads this.
I see there's a problem with the split point, but I don't agree that you need to comment one of the if conditions out. Since:
1. If the label's are the same this cannot be the optimal splitting point, taking the next into account will increase the purity of the exampleset and hence the splitting criterion.
2. If the values are the same, you can't split between them.
So I guess the real problem is that it does not test on the LAST example according to the sorting. If you would enter a 17 into your data set, the split point would be made before the 16.
Greetings,
Sebastian
I corrected this behavior to check the last example. Would you please check out the latest version and try it, if it performs as you expected?
Greetings,
Sebastian
- the last two postings of Sebastian escaped me, as I had a working solution in 5.0 with my patch proposed above
- unfortunately, the remarks of Sebastian were not added to the bug tracker, hence, I was not pushed on them
- even more unfortunate, the "fix" in 5.1 does not only not fix the problem, but also introduce new ones.
See bug 952 for details. This is a desolate story!Stefan
Entropy before split:
-3/5 * log2(3/5) -2/5 * log2(2/5) = 0.971
Label Attribute1 // optimal split value is anywhere between 0 and 16? Entropy after split = -1/1 * log2(1/1) -0/1 * log2(0/1) + -2/4 * log2(2/4) -2/4 * log2(2/4) = 1
0 16
0 0
0 0
1 0
1 0
So actually splitting here on Attribute1 leads to a negative Information Gain, meaning you don't want to split!
Label Attribute1 // optimal split value is anywhere between 0 and 16? Entropy after split = -2/2 * log2(2/2) -0/2 * log2(0/2) + -1/3 * log2(1/3) -2/3 * log2(2/3) = 0.918
0 16
0 16
0 0
1 0
1 0
Splitting here gives information gain 0.971 - 0.918 = 0.053
So, even though these examples look simple they are not.
Such small information gain is never worthwhile, so asking where is the optimal split value is a weird question.
thanks for your comments. I assume, you intend to use "information_gain" in your calculations below, as this is not explicitly stated?
If so, I agree on the entropy before split. But for after split you need weight with the partition sizes. So you end up with (first example with one value 16):
Entropy after split = -1/5 * [1/1 * log2(1/1) + 0/1 * log2(0/1)] + -4/5 * [2/4 * log2(2/4) + 2/4 * log2(2/4)] = 0.8
... so with a gain of 0.171.
(I stepped through the example with the debugger, and these are also exactly the values used internally) The gain would obviously be the same where ever you split. Splitting in the middle (what the code does) is likely the reasonable assumption. (I could imagine that domain knowledge implies that raw data is not linear but eg. logarithmic... It would then be a pre-processing task to transform the data before giving it to the splitter.) I find this a weird remark. The code is supposed to exactly implement an algorithm described by formulas. As long as we don't discuss rounding issues of floats, whatever small the gain is, it is a gain and has to be considered.
Note that the minimal gain is a property which can be controlled in the operator Decision Tree but for whatever reason is removed from Decision Stump. (One would imagine that a tree is build of stumps, but if I understand the code hierarchy correctly, it's the other way around and the stump is a degenerated tree ...)
I do agree though that my application is a bit off the beaten track for decision trees. I'm working on outlier identification, which may explain why I come across below example sets ...
Stefan
Actually I used for the calculation: http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
And yes, I made a mistake, I forgot to multiply with the priors.
So, yes -1/5 * (1/1 * log2(1/1) + 0/1 * 0) + -4/5 * (2/4 * log2(2/4) + 2/4 * log2(2/4)) gives gain 0.171
You say:
"It would have been quite easy to disproof the 5.1.0 solution by changing the 401 example set to something like <insert example here>".
I don't follow this reasoning.
I don't see why these examples are good at all.
3.7.2 Incorporating Continuous-Valued Attributes
Page 72 Machine Learning Tom M. Mitchell
"What threshold-based boolean attribute should be defined based on Temperature?
Clearly, we would like to pick a threshold, c, that produces the greatest
information gain. By sorting the examples according to the continuous attribute
A, then identifying adjacent examples that differ in their target classification, we
can generate a set of candidate thresholds midway between the corresponding
values of A. It can be shown that the value of c that maximizes information gain
must always lie at such a boundary (Fayyad 1991). These candidate thresholds
can then be evaluated by computing the information gain associated with each."
... let me see. If we apply the weighting with partition size to the second example (2x value 16), we end up with a gain of 0.41 (hope I got this correct ??? ). So, we have just shown that this is a good split and creates gain according to the formulas involved.
Sebastian argued above that the problem in the code was related to handling the last example - and implemented a 'fix' for that. Whereas I keep creating examples showing that the issue is with no boundary existing where both the value and the label changes.
Mitchell (quoting Fayyad) states differently. Well, either there are some assumptions which we haven't available here, or something is just plain wrong in that text. After all, the best mathematical proof is falsified with a simple counter example.... In fact, it's quite easy to construct such examples:
Label <--positive--> | <----------------------------------negative-------------------------------------->
Attr p p 000000000000000000000000000000000000000000000000000000000000000 n n
Here, the attribute is sorted, p means a positive number, n a negative one; 0 = zero. Evidently, there is no place where both the label and the attribute value change. Of course, I can imagine that in some cases class sizes and number of 'p' and 'n' in each class conspire such that no information gain is possible. But our example shows that evidently once in a while it is ...
Maybe Fayyad presumed that "at least one such a boundary existed" and then proofed that this gave the best split? We can't tell without the article.
For sure, my two text books (Jiawei Han, Micheline Kamber, Data Mining 2nd edition, p. 300 and Ian H Witten, Eibe Frank, Data Mining, 2nd edition, p.189) don't make such a statement and suggest to test all value boundaries. Well, one good reason for them being good is that the fall out of my computer program creating the example set based on measured data ;D
Stefan
PS: It's as late as it is because I've just gone through a validation (on independent dataset, not x-validation!) of my process (with patched NumericSplitter of course) and it looks very promising. Time to catch some sleep ::) .
just wanted to step in for a second to show that somebody of Rapid-I is actually reading this text
First of all: thanks again for bringing this topic up again. Although decision stumps are usually not the most important modeling schemes, it is such a simple building block that it should properly work as expected. The problem here seems to be: What is the expected way?
For your information: the implementation of the decision tree operators followed the book
"C4.5: programs for machine learning" of Ross Quinlan, the inventor of ID3 and C4.5.
I don't have the book here at home so I cannot check again, but I am really sure that there was also a statement that the label has to change in order to find a valid split (that's the reason why the implementation has been done as it is). And this also explains the reaction of Sebastian ("this is set, that cannot be the error - let's search for another reason"...). This problem also slipped our tests since in many real-life scenarios on, let's say, more realistic data, this change will probably not make a difference. But as it can clearly seen: it makes a huge difference on the data you have provided and this is problem enough.
So the consequence would be to just change the split mechanism as suggested by Stefan. I have, however, two additional things to consider: First, changing this behavior breaks compatibility and since we have at least two or three well known authors (Quinlan and Mitchell / Fayyad) this is probably not desired - it is only a matter of time until somebody else is asking for the old behavior since those authors have suggested this. Second, calculating a split at every possible point makes the learning (much) slower. We will have a look into both issues on real-world data and see if we A) just change the behavior as described and use different behaviors for different compatibility levels in order to ensure backwards compatibility or add a parameter so the user can control which type of numerical splitting should be used (slow and exact vs. fast and sometimes wrong). I appreciate any comment on this.
Stefan, I can understand your frustration and as I said really appreciate that you have brought up the topic again. However, please do not get too upset (especially your contribution in the bug tracker is close before an insult...) - this does not help at all. I hope that the references to Quinlan and Mitchell (those have been our pages 2 in the nineties ) help to understand the reasons for the implementation as it is. And please also consider that we get about 10 to 20 bugfix contributions every month from the community. All of them help since they point out a problem, most are perfect and can be directly used but some are unfortunately flawed again and it's part of our quality assurance to check those fixes before just commiting them. I hope you understand our situation here.
Cheers,
Ingo
thanks for your reply. Always good to have somebody at Rapid-I following
First, yes, I was a bit sharp in the bug report. There are two discussions here; one about the example set submitted with 401 (the cases discussed here, with attribute values 16 and 0). This case is obviously not fixed in 5.1, as we can conclude from all these postings here.
But what angered me more was that the 'fix' breaks the new sample data submitted with 952: This data set has one attribute degenerated as we discuss here and one perfectly "normal" attribute, which gives a much better split. The 'fix' now gives preference to the degenerated attribute and thus breaks cases, where even "Remove Useless Attributes" cannot easily be used to create a work-around. So we are simply stuck with a broken DecisionStump. As I understand it, the Stump is implemented as a constrained tree - so, also the trees are supposedly flawed... - I didn't check that.
So, we are in a situation, where I submit a change; I get a reply "Fixed in a proper manner" (which I looked at being quite arrogant as well, but ok.) and only later have to discover that the more detailed description of the change is hidden in the forum and not referenced from the bug tracker, where I'd have gotten a push notification to check for correctness. Yes, this triggered my anger. Apologies for the insult though!
But now back to the case discussed here. Google Books gives me access to C4.5, Quinlan (1993), where in chapter 2.4 Tests on continuous attributes on p. 25 it reads: So, he is quite explicit that all m-1 cases are (have to be?) examined. And this is after Fayyad (1991). Two paragraphs later, Quinlan claims: ... and the sorting you have to do anyway in getBaseSplit (first code line).
Unfortunatly, I don't find Mitchell on Google >:( He has only his slides up on the web, which are obviously not detailed enough. I also don't find Fayyad (1991); was that his Thesis?
I share your concern on compatibility. But then, I see my 5.0 processes broken on other places too in 5.1; so either there needs be a very consistent way throughout the whole system to maintain compatibility, or I wouldn't care that much. (I recall that in 5.0 there was an expert setting on some operators to select compatibility level, but this seems gone in 5.1?)
At the very minimum, the code change in 5.1 needs be rolled back, as it breaks common data sets as shown in 952 without really fixing the problem of 401.
I'm obviously not neutral here, but I'd strongly advise implementation of the proposed change. I don't think a compatibility flag is worthwhile - if done, a link to this thread in the help text is required (or better yet, a concise summary), else nobody will understand what it's good for ...
Stefan
PS: Yes, you are right, DecisionStumps are not the most prominent learners ... I need build a tree, but need some additional checking as to what splits I do and do not accept. Hence, I build the tree in a Groovy script out of the individual stumps. I really like this Groovy thing! Two exceptions:
- would be great if it used the editor configured in Tools / Preferences / Tools
- would be great if it could do some syntax check: my script is embedded in an "Execute Process" operator, hence debugging gets a bit tedious
(Should probably put that in a separate thread...)I'm getting confused, are we still talking about Decision Stumps? Because if we are, could you not use the Weka:W-OneR operator instead?
http://en.wikipedia.org/wiki/Decision_stump
@haddock: nope, unfortunately not, since W-OneR can perform multiple splits on numerical attributes, while W-DecisionStump and the DecisionStump of RapidMiner only performs a binary split.
@Stefan: ok, so we should be fine now :-*
And back to the discussion: thanks for the google search - I was really sure that Quinlan himself also suggested to test only in cases where the label has changed - which obviously was wrong. So all the anger goes back to myself since this is something I have done and teached (also to Sebastian ) since 1996 - sorry for that!
This makes things much easier since we can indeed simply consider this to be a bug and compatibility is less of an issue now. Sorry for the broken processes. Sometimes this can indeed be hardly prevented (especially in the case of bugs or fixes) but in general we try our best to keep compatibility since version 5. The setting is still there (no expert setting), but is only shown for operators where the behaviour has changed and different compatibility levels can be selected (for example in case of the X-Validation operator).
I fully agree. So I will A) remove the check for the last example introduced by Sebastian and remove the if-clauses checking for non-equality of the labels as suggested in your original patch. The patched version will be available at some point of time tomorrow, then the internal tests will also be finished which should give us some insights if there other issues with the fix.
Sorry again for the inconvenience and thanks again. Cheers,
Ingo
P.S.: Thanks for the suggestion to the groovy editor. Yes, a specific thread or a request at bugzilla would increase the chance for a change...