The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to split an attribute based on a condition on the split pattern ?
lionelderkrikor
RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
in Help
Hi,
I'm extracting usernames of e-mails and I want to split these usernames according to the
separator between the first name and the last name. (the separator is different for each username).
For example here the initial dataset :
Username
john.doe
John_Doe
I want to obtain the following dataset :
Username_1 Username_2
john doe
John Doe
For this I tried to use the Branch operator but I'm encountered an error.
Here my process :
Can you help me ?
Regards,
Lionel
I'm extracting usernames of e-mails and I want to split these usernames according to the
separator between the first name and the last name. (the separator is different for each username).
For example here the initial dataset :
Username
john.doe
John_Doe
I want to obtain the following dataset :
Username_1 Username_2
john doe
John Doe
For this I tried to use the Branch operator but I'm encountered an error.
Here my process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Username john.doe John_Doe"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="85"/> <operator activated="true" breakpoints="before" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch" width="90" x="514" y="85"> <parameter key="condition_type" value="expression"/> <parameter key="condition_value" value="[Username]==john.doe"/> <parameter key="expression" value="contains([Username],".")==TRUE"/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (3)" width="90" x="45" y="238"/> <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" breakpoints="before" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split (2)" width="90" x="179" y="136"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="split_pattern" value="[.]"/> <parameter key="split_mode" value="ordered_split"/> </operator> <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union" width="90" x="380" y="136"/> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_op="Multiply (3)" to_port="input"/> <connect from_op="Multiply (3)" from_port="output 1" to_op="Split (2)" to_port="example set input"/> <connect from_op="Multiply (3)" from_port="output 2" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Union" to_port="example set 2"/> <connect from_op="Split (2)" from_port="example set output" to_op="Union" to_port="example set 1"/> <connect from_op="Union" from_port="union" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> </operator> <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch (2)" width="90" x="648" y="85"> <parameter key="condition_type" value="expression"/> <parameter key="condition_value" value="Username==John_doe"/> <parameter key="expression" value="contains([Username],"_")==TRUE"/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split" width="90" x="179" y="136"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="split_pattern" value="[_]"/> <parameter key="split_mode" value="ordered_split"/> </operator> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_op="Split" to_port="example set input"/> <connect from_op="Split" from_port="example set output" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Multiply (2)" to_port="input"/> <connect from_op="Multiply (2)" from_port="output 1" to_op="Branch" to_port="condition"/> <connect from_op="Multiply (2)" from_port="output 2" to_op="Branch" to_port="input 1"/> <connect from_op="Branch" from_port="input 1" to_op="Branch (2)" to_port="condition"/> <connect from_op="Branch" from_port="input 2" to_op="Branch (2)" to_port="input 1"/> <connect from_op="Branch (2)" from_port="input 2" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Can you help me ?
Regards,
Lionel
Tagged:
0
Best Answers
-
kayman Member Posts: 662 UnicornSeems more like a bug with the branch operator, as it should recognize the attribute to start with.
As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.3 -
kayman Member Posts: 662 UnicornJacob, this would basically only work if you have only one sur and last name. Granted, having multiple of these would be a problem anyway, but if you want to split on all of these you need to use a more flexible pattern.5
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornI think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
2
Answers
Scott
Thanks you for your contributions.
In deed, @kayman solution is giving good results on my original dataset and solves this problem.
Once again thanks you for spending time on this problem.
Regards,
Lionel