The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Get the index of every word start index and end index in a sentence
Contributor II
in Help
Hi, i am trying to generate 2 new columns that gives start and end index of every word in a sentence and there is only 1 sentence.
Ex: This is an apple
| | | | | |
index: 0 3 5 8 11 15
Word | Start | End
This | 0 | 3
is | 5 | 6
an | 8 | 9
apple | 11 | 15
Can anyone please help me here.
Ex: This is an apple
| | | | | |
index: 0 3 5 8 11 15
Word | Start | End
This | 0 | 3
is | 5 | 6
an | 8 | 9
apple | 11 | 15
Can anyone please help me here.
Tagged:
0
Answers
You could use the Split operator with a space (or a more complex regular expression for separating the words) to put each word into its own attribute.
Then Loop Attributes to work on each attribute, determining its start and end position based on the length. Inside the loop you would use macros to keep track of the position for example.
Here is a solution:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.013"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.10.013" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="-1"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="text This is an apple."/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="split" compatibility="9.10.013" expanded="true" height="82" name="Split" width="90" x="246" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="split_pattern" value=" "/> <parameter key="split_mode" value="ordered_split"/> </operator> <operator activated="true" class="set_macro" compatibility="9.10.013" expanded="true" height="82" name="Set Macro" width="90" x="380" y="34"> <parameter key="macro" value="counter"/> <parameter key="value" value="0"/> </operator> <operator activated="true" class="concurrency:loop_attributes" compatibility="9.10.013" expanded="true" height="103" name="Loop Attributes" width="90" x="514" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="attribute_name_macro" value="attr"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="extract_macro" compatibility="9.10.013" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34"> <parameter key="macro" value="word"/> <parameter key="macro_type" value="data_value"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="%{attr}"/> <parameter key="example_index" value="1"/> <list key="additional_macros"/> </operator> <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Calculate end" width="90" x="179" y="34"> <list key="function_descriptions"> <parameter key="end" value="eval(%{counter}) + length(%{word}) - 1"/> </list> </operator> <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="313" y="136"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Word,Start,End %{word},%{counter},%{end}"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Count to the next word" width="90" x="514" y="136"> <list key="function_descriptions"> <parameter key="counter" value="eval(%{end}) + 2"/> </list> </operator> <connect from_port="input 1" to_op="Extract Macro" to_port="example set"/> <connect from_op="Extract Macro" from_port="example set" to_op="Calculate end" to_port="through 1"/> <connect from_op="Calculate end" from_port="through 1" to_port="output 1"/> <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Count to the next word" to_port="through 1"/> <connect from_op="Count to the next word" from_port="through 1" to_port="output 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> </process> </operator> <operator activated="true" class="append" compatibility="9.10.013" expanded="true" height="82" name="Append" width="90" x="648" y="85"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Split" to_port="example set input"/> <connect from_op="Split" from_port="example set output" to_op="Set Macro" to_port="through 1"/> <connect from_op="Set Macro" from_port="through 1" to_op="Loop Attributes" to_port="input 1"/> <connect from_op="Loop Attributes" from_port="output 2" to_op="Append" to_port="example set 1"/> <connect from_op="Append" from_port="merged set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>This is a very limited solution. It expects the separator to be one character. You should remove characters that you don't want to count (e. g. the dot at the end of the sentence) before putting data into this process.
You should be able to take it from here.
Regards,
Balázs
I don't understand your question. Which kind of system is this, what input does it expect?
You already have the solution for getting the start and end index of the words. Do you need to pass those?
Regards,
Balázs
start = 0
for word in sentence.split():
end = start + len(word) - 1
print(word, start, end)
start = end + 2
This will give you the start and end indices for each word.