The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Minimal use-case: YAGGA2, YAGGA"
dromiceiomimus
Member Posts: 4 Contributor I
Hi all,
First let me thank the developers for this wonderful tool. I've already had great success with some models.
Now, I'm trying to get YAGGA2 to work. My actual application is more complex than what's presented here, but I'd like to figure out a minimal setup that results in YAGGA2 functioning correctly before trying to apply it there.
So, here's some example data:
Here's my process:
[CSV] -> [YAGGA2 (NN -> Apply Model -> Performance)]
Using YAGGA (not 2) this process will run, but no new attributes will be generated.
What am I doing wrong?
First let me thank the developers for this wonderful tool. I've already had great success with some models.
Now, I'm trying to get YAGGA2 to work. My actual application is more complex than what's presented here, but I'd like to figure out a minimal setup that results in YAGGA2 functioning correctly before trying to apply it there.
So, here's some example data:
a,b,ca and b will be our attributes, c will be our label. c is log10(max(abs(b-a),50)*a) -- presumably a good candidate for yagga2.
1, 1, 1.698970004
2, 13, 2
4, 26, 2.301029996
8, 40, 2.602059991
16, 55, 2.903089987
32, 71, 3.204119983
64, 88, 3.505149978
128,106,3.806179974
256,125,4.525511261
512,235,5.15174973
Here's my process:
[CSV] -> [YAGGA2 (NN -> Apply Model -> Performance)]
<?xml version="1.0" encoding="UTF-8" standalone="no"?>This consistently errors with "Process failed: Generation exception: 'java.lang.IllegalArgumentException: Duplicate attribute name: prediction(c)'". Attempting to remove this attribute anywhere in the above chain does no good.
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="467" width="815">
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="C:\data\simpletest.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="a.true.integer.attribute"/>
<parameter key="1" value="b.true.integer.attribute"/>
<parameter key="2" value="c.true.real.label"/>
</list>
</operator>
<operator activated="true" class="optimize_by_generation_yagga2" compatibility="5.1.017" expanded="true" height="94" name="Generate" width="90" x="246" y="30">
<process expanded="true" height="647" width="950">
<operator activated="true" class="neural_net" compatibility="5.1.017" expanded="true" height="76" name="Neural Net" width="90" x="112" y="30">
<list key="hidden_layers"/>
<parameter key="training_cycles" value="5"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model" width="90" x="246" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.1.017" expanded="true" height="76" name="Performance" width="90" x="380" y="30"/>
<connect from_port="example set source" to_op="Neural Net" to_port="training set"/>
<connect from_op="Neural Net" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Neural Net" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance sink"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Generate" to_port="example set in"/>
<connect from_op="Generate" from_port="example set out" to_port="result 3"/>
<connect from_op="Generate" from_port="attribute weights out" to_port="result 2"/>
<connect from_op="Generate" from_port="performance out" to_port="result 1"/>
</process>
</operator>
</process>
Using YAGGA (not 2) this process will run, but no new attributes will be generated.
What am I doing wrong?
Tagged:
0
Answers
The "apply model" operator is adding new attributes to the example set and these are being passed to the upper level of the YAGGA operator. The second time round, the attributes are added again but duplicates happen.
One way to fix it is to use a cross validation operator inside the YAGGA operator. This leaves the example set alone and produces an averaged estimate of what the performance could be on unseen data.
regards
Andrew
Though, I must admit, I don't quite understand why. Makes enough sense to me. I wasn't thinking about the YAGGA operator's internal state and that being where the duplicates needed to not occur. Has me so confused.
Why does cross validation work but not cross validation (parallel)? Are there other operators I could use aside from normal cross validation there? Is there some other (I suppose, better) way to try to employ the YAGGA operator, am I going about that wrong from the beginning?
Any hints on those?
Cheers.
By using an internal cross validation (or split validation) you will get a better and more robust performance estimation anyway and you don't have to clean up yourself but this will be done automatically by the validation operator. So I also highly recommend to use either a cross validation or a single split validation inside of the YAGGA operators. The same is true for basically all wrapper approaches for feature selection, generation, weighting...
Hope that clarifies things a bit. In principle this should also be possible. You should, however, not nest different parallel algorithms, i.e. you should not nest a parallel cross validation into a parallel feature selection / generation, for example.
Yes, you could use X-Validation, Split Validation, Bootstrapping Validation, or Batch-X-Validation. If you are knowing what you are doing you could also create specialized subprocesses, but in that case you have to ensure to clean up the predictions yourself.
No, in principal you should be fine. The rest is more about parameter tuning. One tip though: I would try YAGGA2 on slightly bigger data sets since otherwise probably either no new and interesting attributes will be created or it will directly result in overfitting. In your case, log(a) is already highly correlated with the label c so any additional attribute does not really help...
By the way, there is also a sample process for YAGGA in the Sample repository delivered with RapidMiner: Sample/processes/04_attributes/19_YAGGA in the case you have not seen this one yet...
Cheers,
Ingo