"X-Validation for dependent data"
Hallo,
I am looking for a way to validate a model (classification) for which input consists of partially dependent examples. I read a couple of research papers that suggest using a h-block / hv-block cross validation as a more robust method to validate a model in such a scenario. Although I believe I generally understand those concepts, I am pretty much clueless when it comes to implementing them in Rapidminer.
To give a bit more color around my scenario, I am attaching a short csv file with made up data. I basically have a number of identical machines, each of them running independently from each other. All machines have the same attributes and the examples consist of those attribute values taken at different points in time during a production run (those time points are usually different for each machine, with irregular intervals). The label indicates, whether a machine needs maintenance during the current production run.
Ignoring the dependence of examples that belong to one unique machine and just running a regular cross validation across all data points leads to beautifully accurate models. However, applying those models to fresh and unseen data results in quite bad predictions (independent of the chosen model type).
I would like to know how others are dealing with such datasets. I also considered transforming the data so that the examples are transformed to attributes (at time x), leaving only one example per machine, but this would lead to a very wide and not necessarily useful dataset.
Also, and somehow related: My dataset is unbalanced with a ratio of about 0.65/0.35 for the two classes. How do I make sure that "useful" examples are chosen when I want to sample it down to a balanced dataset?
Thank you very much!
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hey,
i think what you want to do is only train on one machine and test on the other. Or at leas tmake sure that examples of one machine are never on training and on testing.
to do this you can generate yourself a BatchId from your machineId with BatchId = MachineId%10 and use this as role batch. There is a option in X-validation to use this kind of splitting.
edit: attached is an example process on your data doing it. I would recommend to use Generate Weight in the first run to fix the imbalances.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.3.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="/Users/mschmitz/Downloads/DependentData_Example.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="UTF-8"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
<list key="function_descriptions">
<parameter key="batchid" value="[Machine ID]%10"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="7.3.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<parameter key="attribute_name" value="batchid"/>
<parameter key="target_role" value="batch"/>
<list key="set_additional_roles">
<parameter key="Machine ID" value="machine id"/>
<parameter key="Maintenance (Label)" value="label"/>
</list>
</operator>
<operator activated="true" class="generate_weight_stratification" compatibility="7.3.001" expanded="true" height="82" name="Generate Weight (Stratification)" width="90" x="447" y="34">
<parameter key="total_weight" value="25.0"/>
<description align="center" color="transparent" colored="false" width="126">To balance classes</description>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.3.001" expanded="true" height="145" name="Cross Validation" width="90" x="648" y="34">
<parameter key="split_on_batch_attribute" value="true"/>
<process expanded="true">
<operator activated="true" class="h2o:logistic_regression" compatibility="7.3.000" expanded="true" height="103" name="Logistic Regression" width="90" x="112" y="34"/>
<connect from_port="training set" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.3.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="7.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Generate Weight (Stratification)" to_port="example set input"/>
<connect from_op="Generate Weight (Stratification)" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Interesting question indeed. While there is no HV Block Validation operator, there is one that might be similar. I'd have to see if they do the same thing but have you checked out the Sliding Window Validation operator? It's in the Series extension and let's you create a window over time series data and build a training set on the pervious culmlative windows (i.e. dependency).
Dear both,
thank you very much. I am always amazed how quickly you guys respond in a free community forum. I will try the batchid avenue and let you know how it goes. I am also quite curious about Thomas' windowing suggestion. Will see, if I get that to work. My real data is quite complex with several hundred machines that unfortunately don't behave as identical as I would like.
Also, if there are any other suggestions or references to literature, I would be happy to know.
Thanks again.
You're very welcome. @mschmitz and I are super happy to help and we hope that you find RapidMiner as awesome as we do!
Just a side note, there is also a Batch Sliding Window Validation operator too to check out.