The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Feature and instance selection for time series data
Hey there,
I have a binary classification problem, with 400 attributes (time series) and 3000 instances. The final 300 samples are for testing and remain unseen till the end. The first 2.700 samples are used for training. Therefore I use the split data operator at the beginning.
For the training set: Since there is a lot of probably useless features I want to apply Forward selection. Since there is the danger of overfitting I don’t want to use a simple split validation as inner operator. Even if there should not be a certain time index, I don’t want to use Cross validation since I would use future values for learning to predict past developments. Since in any case there are kind of structural breaks within the time series features, I am thinking of using the sliding window validation with 1500 data points training windows and 150 for testing. So I would have 18 validations (2.700/150=18) for the whole training set. Furthermore since I wanna apply SVM (any suggestions for an appropriate time series kernel?!) I am thinking of taking the rbf kernel and I want to optimize C and gamma. Therefore I want to use the grid search and the parametersetter operator. So it is kind of a Nested sliding window validation as inner operator of the FS. Last but not least since I assume that there is a lot of “noise” in the data, not every instance of the training set will be senseful for learning the SVM (maybe only half of the 2.700 instances). So I also want to apply instance selection to get better classification results in the test set. Do you have any suggestions which one I should use and how to implement it?
After the total training process I want to get the optimal feature set, the optimal parameters for the SVM and the useful instances of the training set.
Then I want to retrain the “optimal” SVM for the whole training set (2.700 – useless instances) with the optimal features, getting the performance for the training set.
Last I apply the learned model to the unseen testset (300 datapoints). The important issue is that besides the test set performance I need the prediction label for the test set.
I know that it is very computational expensive but I need an optimal process. Please be critical!
Here is the code of the process with everything included except the instance selection. If you have any recommendations to any point, please let me know.
Thanks in advance
Daniel
I have a binary classification problem, with 400 attributes (time series) and 3000 instances. The final 300 samples are for testing and remain unseen till the end. The first 2.700 samples are used for training. Therefore I use the split data operator at the beginning.
For the training set: Since there is a lot of probably useless features I want to apply Forward selection. Since there is the danger of overfitting I don’t want to use a simple split validation as inner operator. Even if there should not be a certain time index, I don’t want to use Cross validation since I would use future values for learning to predict past developments. Since in any case there are kind of structural breaks within the time series features, I am thinking of using the sliding window validation with 1500 data points training windows and 150 for testing. So I would have 18 validations (2.700/150=18) for the whole training set. Furthermore since I wanna apply SVM (any suggestions for an appropriate time series kernel?!) I am thinking of taking the rbf kernel and I want to optimize C and gamma. Therefore I want to use the grid search and the parametersetter operator. So it is kind of a Nested sliding window validation as inner operator of the FS. Last but not least since I assume that there is a lot of “noise” in the data, not every instance of the training set will be senseful for learning the SVM (maybe only half of the 2.700 instances). So I also want to apply instance selection to get better classification results in the test set. Do you have any suggestions which one I should use and how to implement it?
After the total training process I want to get the optimal feature set, the optimal parameters for the SVM and the useful instances of the training set.
Then I want to retrain the “optimal” SVM for the whole training set (2.700 – useless instances) with the optimal features, getting the performance for the training set.
Last I apply the learned model to the unseen testset (300 datapoints). The important issue is that besides the test set performance I need the prediction label for the test set.
I know that it is very computational expensive but I need an optimal process. Please be critical!
Here is the code of the process with everything included except the instance selection. If you have any recommendations to any point, please let me know.
Thanks in advance
Daniel
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Root">
<description><p> Often the different operators have many parameters and it is not clear which parameter values are best for the learning task at hand. The parameter optimization operator helps to find an optimal parameter set for the used operators. </p> <p> The inner crossvalidation estimates the performance for each parameter set. In this process two parameters of the SVM are tuned. The result can be plotted in 3D (using gnuplot) or in color mode. </p> <p> Try the following: <ul> <li>Start the process. The result is the best parameter set and the performance which was achieved with this parameter set.</li> <li>Edit the parameter list of the ParameterOptimization operator to find another parameter set.</li> </ul> </p> </description>
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.005" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Feature_Matrix_nonlin_test.xls"/>
<parameter key="imported_cell_range" value="A1:MH3000"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="396" value="a251.true.real.attribute"/>
<parameter key="397" value="a252.true.real.attribute"/>
<parameter key="398" value="a253.true.real.attribute"/>
<parameter key="399" value="a254.true.real.attribute"/>
<parameter key="400" value="a255.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="split_data" compatibility="5.3.005" expanded="true" height="94" name="Split Data_train_and_test" width="90" x="45" y="120">
<enumeration key="partitions">
<parameter key="ratio" value="0.9"/>
<parameter key="ratio" value="0.1"/>
</enumeration>
<parameter key="sampling_type" value="linear sampling"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="112" name="Multiply_trainset" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection_forward" compatibility="5.3.005" expanded="true" height="94" name="Forward Selection" width="90" x="380" y="30">
<parameter key="maximal_number_of_attributes" value="30"/>
<parameter key="speculative_rounds" value="10"/>
<process expanded="true">
<operator activated="true" class="optimize_parameters_grid" compatibility="5.3.005" expanded="true" height="148" name="loopThroughLocalParams" width="90" x="246" y="75">
<list key="parameters">
<parameter key="SVM_train.gamma" value="[0.0;1000;10;linear]"/>
<parameter key="SVM_train.C" value="[0;1000;10;linear]"/>
</list>
<parameter key="parallelize_optimization_process" value="true"/>
<process expanded="true">
<operator activated="true" class="series:sliding_window_validation" compatibility="5.3.000" expanded="true" height="112" name="Slidingwindow_Validation" width="90" x="45" y="75">
<parameter key="training_window_width" value="1500"/>
<parameter key="test_window_width" value="150"/>
<parameter key="cumulative_training" value="true"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.005" expanded="true" height="76" name="SVM_train" width="90" x="179" y="30">
<parameter key="gamma" value="1000.0"/>
<parameter key="C" value="1000.0"/>
<list key="class_weights"/>
</operator>
<connect from_port="training" to_op="SVM_train" to_port="training set"/>
<connect from_op="SVM_train" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Test_train" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.005" expanded="true" height="76" name="ClassificationPerformance_train_train" width="90" x="179" y="30">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Test_train" to_port="model"/>
<connect from_port="test set" to_op="Test_train" to_port="unlabelled data"/>
<connect from_op="Test_train" from_port="labelled data" to_op="ClassificationPerformance_train_train" to_port="labelled data"/>
<connect from_op="ClassificationPerformance_train_train" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="94" name="Multiply_trainset (2)" width="90" x="246" y="165"/>
<operator activated="true" class="log" compatibility="5.3.005" expanded="true" height="76" name="ProcessLog" width="90" x="380" y="120">
<list key="log">
<parameter key="generation" value="operator.Forward Selection.value.number of attributes"/>
<parameter key="performance_traintrain" value="operator.Slidingwindow_Validation.value.performance"/>
<parameter key="performance_grid_search_loop" value="operator.loopThroughLocalParams.value.performance"/>
<parameter key="feature_name" value="operator.Forward Selection.value.feature_names"/>
<parameter key="parameter_c" value="operator.SVM_train.parameter.C"/>
<parameter key="parameter_gamma" value="operator.SVM_train.parameter.gamma"/>
<parameter key="anzahl_validierungen" value="operator.Slidingwindow_Validation.value.iteration"/>
</list>
</operator>
<connect from_port="input 1" to_op="Slidingwindow_Validation" to_port="training"/>
<connect from_op="Slidingwindow_Validation" from_port="model" to_port="result 1"/>
<connect from_op="Slidingwindow_Validation" from_port="training" to_port="result 2"/>
<connect from_op="Slidingwindow_Validation" from_port="averagable 1" to_op="Multiply_trainset (2)" to_port="input"/>
<connect from_op="Multiply_trainset (2)" from_port="output 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="Multiply_trainset (2)" from_port="output 2" to_port="result 3"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_parameters" compatibility="5.3.005" expanded="true" height="130" name="ParameterSetter" width="90" x="514" y="75">
<list key="name_map">
<parameter key="SVM_train" value="SVM_test"/>
</list>
</operator>
<connect from_port="example set" to_op="loopThroughLocalParams" to_port="input 1"/>
<connect from_op="loopThroughLocalParams" from_port="performance" to_op="ParameterSetter" to_port="through 1"/>
<connect from_op="loopThroughLocalParams" from_port="parameter" to_op="ParameterSetter" to_port="parameter set"/>
<connect from_op="loopThroughLocalParams" from_port="result 1" to_op="ParameterSetter" to_port="through 2"/>
<connect from_op="loopThroughLocalParams" from_port="result 2" to_op="ParameterSetter" to_port="through 3"/>
<connect from_op="ParameterSetter" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="112" name="Multiply_weights" width="90" x="581" y="30"/>
<operator activated="true" class="select_by_weights" compatibility="5.3.005" expanded="true" height="94" name="Weights_for_svm_learn" width="90" x="45" y="255"/>
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.005" expanded="true" height="76" name="SVM_test" width="90" x="179" y="255">
<parameter key="gamma" value="500.0"/>
<parameter key="C" value="500.5"/>
<list key="class_weights"/>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="94" name="Multiply_model_for_apply" width="90" x="246" y="165"/>
<operator activated="true" class="select_by_weights" compatibility="5.3.005" expanded="true" height="94" name="Weights_for_train_new" width="90" x="380" y="300"/>
<operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="applyModel_train_new" width="90" x="514" y="255">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.005" expanded="true" height="76" name="Performance_train_new" width="90" x="648" y="255">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.3.005" expanded="true" height="94" name="Weights_for_test" width="90" x="380" y="165"/>
<operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="applyModel_test" width="90" x="514" y="165">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.005" expanded="true" height="76" name="Performance_testset" width="90" x="648" y="165">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Split Data_train_and_test" to_port="example set"/>
<connect from_op="Split Data_train_and_test" from_port="partition 1" to_op="Multiply_trainset" to_port="input"/>
<connect from_op="Split Data_train_and_test" from_port="partition 2" to_op="Weights_for_test" to_port="example set input"/>
<connect from_op="Multiply_trainset" from_port="output 1" to_op="Weights_for_train_new" to_port="example set input"/>
<connect from_op="Multiply_trainset" from_port="output 2" to_op="Weights_for_svm_learn" to_port="example set input"/>
<connect from_op="Multiply_trainset" from_port="output 3" to_op="Forward Selection" to_port="example set"/>
<connect from_op="Forward Selection" from_port="example set" to_port="result 5"/>
<connect from_op="Forward Selection" from_port="attribute weights" to_op="Multiply_weights" to_port="input"/>
<connect from_op="Forward Selection" from_port="performance" to_port="result 4"/>
<connect from_op="Multiply_weights" from_port="output 1" to_op="Weights_for_svm_learn" to_port="weights"/>
<connect from_op="Multiply_weights" from_port="output 2" to_op="Weights_for_train_new" to_port="weights"/>
<connect from_op="Multiply_weights" from_port="output 3" to_op="Weights_for_test" to_port="weights"/>
<connect from_op="Weights_for_svm_learn" from_port="example set output" to_op="SVM_test" to_port="training set"/>
<connect from_op="SVM_test" from_port="model" to_op="Multiply_model_for_apply" to_port="input"/>
<connect from_op="Multiply_model_for_apply" from_port="output 1" to_op="applyModel_test" to_port="model"/>
<connect from_op="Multiply_model_for_apply" from_port="output 2" to_op="applyModel_train_new" to_port="model"/>
<connect from_op="Weights_for_train_new" from_port="example set output" to_op="applyModel_train_new" to_port="unlabelled data"/>
<connect from_op="applyModel_train_new" from_port="labelled data" to_op="Performance_train_new" to_port="labelled data"/>
<connect from_op="Performance_train_new" from_port="performance" to_port="result 2"/>
<connect from_op="Performance_train_new" from_port="example set" to_port="result 6"/>
<connect from_op="Weights_for_test" from_port="example set output" to_op="applyModel_test" to_port="unlabelled data"/>
<connect from_op="applyModel_test" from_port="labelled data" to_op="Performance_testset" to_port="labelled data"/>
<connect from_op="applyModel_test" from_port="model" to_port="result 3"/>
<connect from_op="Performance_testset" from_port="performance" to_port="result 1"/>
<connect from_op="Performance_testset" from_port="example set" to_port="result 7"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
</process>
</operator>
</process>
0