Get the index of every word start index and end index in a sentence

Teja_Varanasi · July 2023

Hi, i am trying to generate 2 new columns that gives start and end index of every word in a sentence and there is only 1 sentence.
Ex: This is an apple
| | | | | |
index: 0 3 5 8 11 15

Word | Start | End
This | 0 | 3
is | 5 | 6
an | 8 | 9
apple | 11 | 15

Can anyone please help me here.

BalazsBarany · July 2023

Hi!

You could use the Split operator with a space (or a more complex regular expression for separating the words) to put each word into its own attribute.
Then Loop Attributes to work on each attribute, determining its start and end position based on the length. Inside the loop you would use macros to keep track of the position for example.

Here is a solution:

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.013" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="-1"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="text&#10;This is an apple."/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="split" compatibility="9.10.013" expanded="true" height="82" name="Split" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="split_pattern" value=" "/>
        <parameter key="split_mode" value="ordered_split"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="9.10.013" expanded="true" height="82" name="Set Macro" width="90" x="380" y="34">
        <parameter key="macro" value="counter"/>
        <parameter key="value" value="0"/>
      </operator>
      <operator activated="true" class="concurrency:loop_attributes" compatibility="9.10.013" expanded="true" height="103" name="Loop Attributes" width="90" x="514" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="attribute_name_macro" value="attr"/>
        <parameter key="reuse_results" value="false"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="extract_macro" compatibility="9.10.013" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
            <parameter key="macro" value="word"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="statistics" value="average"/>
            <parameter key="attribute_name" value="%{attr}"/>
            <parameter key="example_index" value="1"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Calculate end" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="end" value="eval(%{counter}) + length(%{word}) - 1"/>
            </list>
          </operator>
          <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="313" y="136">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Word,Start,End&#10;%{word},%{counter},%{end}"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Count to the next word" width="90" x="514" y="136">
            <list key="function_descriptions">
              <parameter key="counter" value="eval(%{end}) + 2"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Calculate end" to_port="through 1"/>
          <connect from_op="Calculate end" from_port="through 1" to_port="output 1"/>
          <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Count to the next word" to_port="through 1"/>
          <connect from_op="Count to the next word" from_port="through 1" to_port="output 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.10.013" expanded="true" height="82" name="Append" width="90" x="648" y="85">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Set Macro" to_port="through 1"/>
      <connect from_op="Set Macro" from_port="through 1" to_op="Loop Attributes" to_port="input 1"/>
      <connect from_op="Loop Attributes" from_port="output 2" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

This is a very limited solution. It expects the separator to be one character. You should remove characters that you don't want to count (e. g. the dot at the end of the sentence) before putting data into this process.

You should be able to take it from here.

Regards,

Balázs

Teja_Varanasi · July 2023

Hi, thank you. your solution is awesome. Now, actually what i am doing is i am sending the example set to NLP tagger. I want to get the start index and end of each word there. but it is kind of difficult can u please help there

BalazsBarany · July 2023

Hi,

I don't understand your question. Which kind of system is this, what input does it expect?

You already have the solution for getting the start and end index of the words. Do you need to pass those?

Regards,
Balázs

Teja_Varanasi · July 2023

I am working with NER model. so i used NLP tagger. i want to get the NLP tagger result and along with that start index and end index along with result

rdesai · July 2023

Hi Teja, unfortunately the NLP Tagger doesn't have that functionality implemented inside the operator but if you like, you can write a short python script calling a function that can retrieve word indices. Hope that helps!

Mia_Smith · August 2023

sentence = "This is an apple"
start = 0
for word in sentence.split():
    end = start + len(word) - 1
    print(word, start, end)
    start = end + 2

This will give you the start and end indices for each word.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Get the index of every word start index and end index in a sentence

Answers