The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"automatic feature engineering sample process with demonstrative results"
Telcontar120
RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
I've been playing with the new Automatic Feature Engineering operator, but most of the discussion I have seen so far has centered around the Feature Selection components of its capabilities. Is there any sample process or further documentation/discussion available for the feature engineering aspects of the operator? The tutorial process doesn't include feature engineering (and enabling it doesn't produce any new attributes), and the in-program help doesn't discuss the feature engineering options or parameters either. Even when I have applied this operator to other sample datasets (the usual suspects: Titanic, Sonar, etc.) I haven't been able to get it to generate anything useful. Any sample processes on available datasets (perhaps those used in development or other suggestions) would be appreciated. Thanks.
Tagged:
1
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi Brian,
I had the same feeling as you after the first tests I performed with this new tool.(ie feature generation does not produce new attributes)
But after going further, I adapted a process using the s&p500 dataset and in this case, AFE operator generate some relevant new attributes.
In deed with these new attributes, the final fitness function is lower than with the "original set" :
The process :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process" origin="GENERATED_TUTORIAL"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Retrieve s&p-500-data" width="90" x="45" y="289"> <parameter key="repository_entry" value="//Samples/Deep Learning/data/s&p-500-data"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="179" y="289"> <parameter key="attribute_name" value="Adj Close"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="nominal_to_date" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Nominal to Date" width="90" x="313" y="289"> <parameter key="attribute_name" value="Date"/> <parameter key="date_type" value="date"/> <parameter key="date_format" value="yyyy-MM-dd"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <operator activated="true" class="split_data" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Validation Set" origin="GENERATED_TUTORIAL" width="90" x="447" y="289"> <enumeration key="partitions"> <parameter key="ratio" value="0.8"/> <parameter key="ratio" value="0.2"/> </enumeration> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <description align="center" color="transparent" colored="false" width="126">Split off 20% validation set.</description> </operator> <operator activated="true" class="multiply" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Multiply Training Set" origin="GENERATED_TUTORIAL" width="90" x="514" y="187"> <description align="center" color="transparent" colored="false" width="126">Copy training set.</description> </operator> <operator activated="true" class="split_data" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Split Data for FS" origin="GENERATED_TUTORIAL" width="90" x="581" y="34"> <enumeration key="partitions"> <parameter key="ratio" value="0.8"/> <parameter key="ratio" value="0.2"/> </enumeration> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <description align="center" color="transparent" colored="false" width="126">Create data split for feature selection.</description> </operator> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.1.000" expanded="true" height="103" name="Feature Selection" origin="GENERATED_TUTORIAL" width="90" x="715" y="34"> <parameter key="mode" value="feature selection and generation"/> <parameter key="balance for accuracy" value="0.9"/> <parameter key="show progress dialog" value="false"/> <parameter key="use_local_random_seed" value="true"/> <parameter key="local_random_seed" value="1966"/> <parameter key="use optimization heuristics" value="true"/> <parameter key="maximum generations" value="30"/> <parameter key="population size" value="10"/> <parameter key="use multi-starts" value="true"/> <parameter key="number of multi-starts" value="3"/> <parameter key="generations until multi-start" value="10"/> <parameter key="use time limit" value="false"/> <parameter key="time limit in seconds" value="300"/> <parameter key="use subset for generation" value="false"/> <parameter key="maximum function complexity" value="6"/> <parameter key="use_plus" value="false"/> <parameter key="use_diff" value="false"/> <parameter key="use_mult" value="true"/> <parameter key="use_div" value="true"/> <parameter key="reciprocal_value" value="true"/> <parameter key="use_square_roots" value="true"/> <parameter key="use_exp" value="true"/> <parameter key="use_log" value="true"/> <parameter key="use_absolute_values" value="true"/> <parameter key="use_sgn" value="false"/> <parameter key="use_min" value="false"/> <parameter key="use_max" value="false"/> <process expanded="true"> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="34"> <parameter key="criterion" value="least_square"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <operator activated="true" class="apply_model" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Apply Model FS" origin="GENERATED_TUTORIAL" width="90" x="246" y="136"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Performance" width="90" x="380" y="136"> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="example set source" to_op="Decision Tree" to_port="training set"/> <connect from_op="Decision Tree" from_port="model" to_op="Apply Model FS" to_port="model"/> <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model FS" to_port="unlabelled data"/> <connect from_op="Apply Model FS" from_port="labelled data" to_op="Performance" to_port="labelled data"/> <connect from_op="Performance" from_port="performance" to_port="performance sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_performance sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Feature selection.</description> </operator> <operator activated="true" class="multiply" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Multiply Feature Set" origin="GENERATED_TUTORIAL" width="90" x="849" y="238"> <description align="center" color="transparent" colored="false" width="126">Copy feature set.</description> </operator> <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.1.000" expanded="true" height="82" name="Apply Feature Set on Validation" origin="GENERATED_TUTORIAL" width="90" x="983" y="391"> <parameter key="handle missings" value="true"/> <parameter key="keep originals" value="false"/> <parameter key="originals special role" value="true"/> <description align="center" color="transparent" colored="false" width="126">Apply feature set on validation set.</description> </operator> <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.1.000" expanded="true" height="82" name="Apply Feature Set on Training" origin="GENERATED_TUTORIAL" width="90" x="983" y="187"> <parameter key="handle missings" value="true"/> <parameter key="keep originals" value="false"/> <parameter key="originals special role" value="true"/> <description align="center" color="transparent" colored="false" width="126">Apply feature set on training set.</description> </operator> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree (2)" width="90" x="1117" y="187"> <parameter key="criterion" value="least_square"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <operator activated="true" class="apply_model" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Apply Model" origin="GENERATED_TUTORIAL" width="90" x="1251" y="289"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> <description align="center" color="transparent" colored="false" width="126">Apply final prediction model on validation set.</description> </operator> <connect from_op="Retrieve s&p-500-data" from_port="output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/> <connect from_op="Nominal to Date" from_port="example set output" to_op="Validation Set" to_port="example set"/> <connect from_op="Validation Set" from_port="partition 1" to_op="Multiply Training Set" to_port="input"/> <connect from_op="Validation Set" from_port="partition 2" to_op="Apply Feature Set on Validation" to_port="example set"/> <connect from_op="Multiply Training Set" from_port="output 1" to_op="Split Data for FS" to_port="example set"/> <connect from_op="Multiply Training Set" from_port="output 2" to_op="Apply Feature Set on Training" to_port="example set"/> <connect from_op="Split Data for FS" from_port="partition 1" to_op="Feature Selection" to_port="example set in"/> <connect from_op="Feature Selection" from_port="feature set" to_op="Multiply Feature Set" to_port="input"/> <connect from_op="Feature Selection" from_port="population" to_port="result 2"/> <connect from_op="Feature Selection" from_port="optimization log" to_port="result 3"/> <connect from_op="Multiply Feature Set" from_port="output 1" to_op="Apply Feature Set on Training" to_port="feature set"/> <connect from_op="Multiply Feature Set" from_port="output 2" to_op="Apply Feature Set on Validation" to_port="feature set"/> <connect from_op="Apply Feature Set on Validation" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Feature Set on Training" from_port="example set" to_op="Decision Tree (2)" to_port="training set"/> <connect from_op="Decision Tree (2)" from_port="model" to_op="Apply Model" to_port="model"/> <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="252"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> </process> </operator> </process>
Regards,
Lionel
5 -
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderHi Brian,Thanks for trying out the operator and sharing your experience. The reason why you do not get new features for the "usual" suspects is simply that they do not benefit much from newly generated features. Our approach also prevents so-called "feature bloat" which only adds complexity but no real benefit since this is a sure recipe for overfitting.Another side note: the simple methods (like lin reg etc) are really benefitting most. Deep learning or GBT would be able to grasp a lot of the more complex patterns already without generating new features (but may still benefit from feature selection - if not because of higher accuracy so still because of smaller feature sets and therefore faster training and scoring times).The process below should give you an idea. In general I find it easier to get good "examples" for regression / function learning.Hope that helps,Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-BETA2"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000-BETA2" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="generate_data" compatibility="9.2.000-BETA2" expanded="true" height="68" name="Generate Data" width="90" x="45" y="136"><br> <parameter key="target_function" value="one variable non linear"/><br> <parameter key="number_examples" value="1000"/><br> <parameter key="number_of_attributes" value="1"/><br> <parameter key="attributes_lower_bound" value="-20.0"/><br> <parameter key="attributes_upper_bound" value="25.0"/><br> <parameter key="gaussian_standard_deviation" value="10.0"/><br> <parameter key="largest_radius" value="10.0"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1990"/><br> <parameter key="datamanagement" value="double_array"/><br> <parameter key="data_management" value="auto"/><br> </operator><br> <operator activated="true" class="add_noise" compatibility="9.2.000-BETA2" expanded="true" height="103" name="Add Noise" width="90" x="179" y="136"><br> <parameter key="return_preprocessing_model" value="false"/><br> <parameter key="create_view" value="false"/><br> <parameter key="attribute_filter_type" value="all"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="attribute_value"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="time"/><br> <parameter key="block_type" value="attribute_block"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="value_matrix_row_start"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="random_attributes" value="3"/><br> <parameter key="label_noise" value="0.01"/><br> <parameter key="default_attribute_noise" value="0.0"/><br> <list key="noise"/><br> <parameter key="offset" value="0.0"/><br> <parameter key="linear_factor" value="1.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="multiply" compatibility="9.2.000-BETA2" expanded="true" height="103" name="Multiply" width="90" x="313" y="136"/><br> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.2.000-BETA2" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="447" y="34"><br> <parameter key="mode" value="feature selection and generation"/><br> <parameter key="balance for accuracy" value="1.0"/><br> <parameter key="show progress dialog" value="true"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="use optimization heuristics" value="true"/><br> <parameter key="maximum generations" value="30"/><br> <parameter key="population size" value="10"/><br> <parameter key="use multi-starts" value="true"/><br> <parameter key="number of multi-starts" value="5"/><br> <parameter key="generations until multi-start" value="10"/><br> <parameter key="use time limit" value="true"/><br> <parameter key="time limit in seconds" value="30"/><br> <parameter key="use subset for generation" value="false"/><br> <parameter key="maximum function complexity" value="5"/><br> <parameter key="use_plus" value="false"/><br> <parameter key="use_diff" value="false"/><br> <parameter key="use_mult" value="true"/><br> <parameter key="use_div" value="true"/><br> <parameter key="reciprocal_value" value="false"/><br> <parameter key="use_square_roots" value="true"/><br> <parameter key="use_exp" value="false"/><br> <parameter key="use_log" value="false"/><br> <parameter key="use_absolute_values" value="true"/><br> <parameter key="use_sgn" value="false"/><br> <parameter key="use_min" value="false"/><br> <parameter key="use_max" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:cross_validation" compatibility="9.2.000-BETA2" expanded="true" height="145" name="Cross Validation" width="90" x="45" y="34"><br> <parameter key="split_on_batch_attribute" value="false"/><br> <parameter key="leave_one_out" value="false"/><br> <parameter key="number_of_folds" value="2"/><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1990"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="support_vector_machine_linear" compatibility="9.2.000-BETA2" expanded="true" height="82" name="SVM (2)" width="90" x="45" y="34"><br> <parameter key="kernel_cache" value="200"/><br> <parameter key="C" value="100.0"/><br> <parameter key="convergence_epsilon" value="0.001"/><br> <parameter key="max_iterations" value="100000"/><br> <parameter key="scale" value="true"/><br> <parameter key="L_pos" value="1.0"/><br> <parameter key="L_neg" value="1.0"/><br> <parameter key="epsilon" value="0.0"/><br> <parameter key="epsilon_plus" value="0.0"/><br> <parameter key="epsilon_minus" value="0.0"/><br> <parameter key="balance_cost" value="false"/><br> <parameter key="quadratic_loss_pos" value="false"/><br> <parameter key="quadratic_loss_neg" value="false"/><br> </operator><br> <connect from_port="training set" to_op="SVM (2)" to_port="training set"/><br> <connect from_op="SVM (2)" from_port="model" to_port="model"/><br> <portSpacing port="source_training set" spacing="0"/><br> <portSpacing port="sink_model" spacing="0"/><br> <portSpacing port="sink_through 1" spacing="0"/><br> </process><br> <process expanded="true"><br> <operator activated="true" class="apply_model" compatibility="9.2.000-BETA2" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="performance_regression" compatibility="9.2.000-BETA2" expanded="true" height="82" name="Performance" width="90" x="179" y="34"><br> <parameter key="main_criterion" value="relative_error"/><br> <parameter key="root_mean_squared_error" value="false"/><br> <parameter key="absolute_error" value="false"/><br> <parameter key="relative_error" value="true"/><br> <parameter key="relative_error_lenient" value="false"/><br> <parameter key="relative_error_strict" value="false"/><br> <parameter key="normalized_absolute_error" value="false"/><br> <parameter key="root_relative_squared_error" value="false"/><br> <parameter key="squared_error" value="false"/><br> <parameter key="correlation" value="false"/><br> <parameter key="squared_correlation" value="false"/><br> <parameter key="prediction_average" value="false"/><br> <parameter key="spearman_rho" value="false"/><br> <parameter key="kendall_tau" value="false"/><br> <parameter key="skip_undefined_labels" value="true"/><br> <parameter key="use_example_weights" value="true"/><br> </operator><br> <connect from_port="model" to_op="Apply Model" to_port="model"/><br> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/><br> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><br> <connect from_op="Performance" from_port="performance" to_port="performance 1"/><br> <portSpacing port="source_model" spacing="0"/><br> <portSpacing port="source_test set" spacing="0"/><br> <portSpacing port="source_through 1" spacing="0"/><br> <portSpacing port="sink_test set results" spacing="0"/><br> <portSpacing port="sink_performance 1" spacing="0"/><br> <portSpacing port="sink_performance 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="example set source" to_op="Cross Validation" to_port="example set"/><br> <connect from_op="Cross Validation" from_port="performance 1" to_port="performance sink"/><br> <portSpacing port="source_example set source" spacing="0"/><br> <portSpacing port="sink_performance sink" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.2.000-BETA2" expanded="true" height="82" name="Apply Feature Set" width="90" x="581" y="187"><br> <parameter key="handle missings" value="true"/><br> <parameter key="keep originals" value="false"/><br> <parameter key="originals special role" value="true"/><br> </operator><br> <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/><br> <connect from_op="Add Noise" from_port="example set output" to_op="Multiply" to_port="input"/><br> <connect from_op="Multiply" from_port="output 1" to_op="Automatic Feature Engineering" to_port="example set in"/><br> <connect from_op="Multiply" from_port="output 2" to_op="Apply Feature Set" to_port="example set"/><br> <connect from_op="Automatic Feature Engineering" from_port="feature set" to_op="Apply Feature Set" to_port="feature set"/><br> <connect from_op="Apply Feature Set" from_port="example set" to_port="result 1"/><br> <connect from_op="Apply Feature Set" from_port="feature set" to_port="result 2"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="147"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> </process><br> </operator><br></process>
6
Answers
To complete my previous post, you can play with the balance for accuracy parameter :
For example by increasing its value to 1, AFE will generate a new set of attributes, will increase the global complexity of the set and in fine
decrease the final "fitness function". (vs the parameters setting of the process in my previous post)
Regards,
Lionel
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts