Need help automating large demographic datasets into more manageable variables for SPSS

James_julian · January 2024

Hi,

Is there a way to shorten the titles of long demographic variables in an excel spreadsheet to make them more concise? I would need to analyze them in SPSS. Would I be able to automate this process or shorten the manual labour that's required? Unfortunately, I'm not able to share the data here.

Thank you

rjones13 · January 2024

Hi @James_julian,

Do you have a desired way of shortening the names? The simplest option would be to rename with generic names (att1, att2, and so on) but there would be ways to do more complex rules such as taking the first n characters. If you’re interested I can also comment if the subsequent analysis you’d like to do is possible in RapidMiner.

Best,
Roland

James_julian · January 2024

Hi rjones13,

Well, the names should capture the basic meaning of the variable but be kept as short as possible. For example the variable "Number of persons in semi-detached households" can be shortened to "persons_semi_HH". I'm very new to using RapidMiner. Could you please elaborate on what you mean with the simple version and steps necessary? I'd appreciate if you could also explain a bit about the complex rules.

Thanks

rjones13 · January 2024

Hi @James_julian,

An example of a simple implementation is attached at the bottom here. It takes the first 12 characters of an attribute name as it renames. There's also error handling in case the attribute name is shorter than 10 characters, or there's subsequently duplicate attribute names (which isn't allowed in RapidMiner).

For a more complex approach which takes into account the meaning of the names (which it sounds like you may need based off your example), I would perhaps think about leveraging the new Generative Models extension. You could extract the variable names as a column, transform the names into something shorter but still meaningful, and then rename based off this transformed list. However, this would perhaps be quite a complex solution and it may be quicker to manually rename them. Could you give an idea as to how many variables you're looking to transform?

Best,
Roland

<?xml version="1.0" encoding="UTF-8"?><process version="10.3.000">

  <context>

    <input/>

    <output/>

    <macros/>

  </context>

  <operator activated="true" class="process" compatibility="10.3.000" expanded="true" name="Process">

    <parameter key="logverbosity" value="init"/>

    <parameter key="random_seed" value="2001"/>

    <parameter key="send_mail" value="never"/>

    <parameter key="notification_email" value=""/>

    <parameter key="process_duration_for_mail" value="30"/>

    <parameter key="encoding" value="UTF-8"/>

    <process expanded="true">

      <operator activated="true" breakpoints="after" class="retrieve" compatibility="10.3.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">

        <parameter key="repository_entry" value="//Samples/data/Golf"/>

      </operator>

      <operator activated="true" class="concurrency:loop_attributes" compatibility="10.3.000" expanded="true" height="82" name="Loop Attributes" width="90" x="179" y="34">

        <parameter key="attribute_filter_type" value="all"/>

        <parameter key="attribute" value=""/>

        <parameter key="attributes" value=""/>

        <parameter key="use_except_expression" value="false"/>

        <parameter key="value_type" value="attribute_value"/>

        <parameter key="use_value_type_exception" value="false"/>

        <parameter key="except_value_type" value="time"/>

        <parameter key="block_type" value="attribute_block"/>

        <parameter key="use_block_type_exception" value="false"/>

        <parameter key="except_block_type" value="value_matrix_row_start"/>

        <parameter key="invert_selection" value="false"/>

        <parameter key="include_special_attributes" value="false"/>

        <parameter key="attribute_name_macro" value="loop_attribute"/>

        <parameter key="reuse_results" value="true"/>

        <parameter key="enable_parallel_execution" value="true"/>

        <process expanded="true">

          <operator activated="true" class="handle_exception" compatibility="10.3.000" expanded="true" height="82" name="Handle Exception" width="90" x="112" y="34">

            <parameter key="add_details_to_log" value="false"/>

            <process expanded="true">

              <operator activated="true" class="generate_macro" compatibility="10.3.000" expanded="true" height="82" name="Generate Macro" width="90" x="45" y="34">

                <list key="function_descriptions">

                  <parameter key="new_name" value="cut(%{loop_attribute},0,4)"/>

                </list>

              </operator>

              <operator activated="true" class="blending:rename" compatibility="10.3.000" expanded="true" height="82" name="Rename" width="90" x="179" y="34">

                <list key="rename attributes">

                  <parameter key="%{loop_attribute}" value="%{new_name}"/>

                </list>

                <parameter key="from_attribute" value=""/>

                <parameter key="to_attribute" value=""/>

              </operator>

              <connect from_port="in 1" to_op="Generate Macro" to_port="through 1"/>

              <connect from_op="Generate Macro" from_port="through 1" to_op="Rename" to_port="example set input"/>

              <connect from_op="Rename" from_port="example set output" to_port="out 1"/>

              <portSpacing port="source_in 1" spacing="0"/>

              <portSpacing port="source_in 2" spacing="0"/>

              <portSpacing port="sink_out 1" spacing="0"/>

              <portSpacing port="sink_out 2" spacing="0"/>

            </process>

            <process expanded="true">

              <connect from_port="in 1" to_port="out 1"/>

              <portSpacing port="source_in 1" spacing="0"/>

              <portSpacing port="source_in 2" spacing="0"/>

              <portSpacing port="sink_out 1" spacing="0"/>

              <portSpacing port="sink_out 2" spacing="0"/>

            </process>

          </operator>

          <connect from_port="input 1" to_op="Handle Exception" to_port="in 1"/>

          <connect from_op="Handle Exception" from_port="out 1" to_port="output 1"/>

          <portSpacing port="source_input 1" spacing="0"/>

          <portSpacing port="source_input 2" spacing="0"/>

          <portSpacing port="sink_output 1" spacing="0"/>

          <portSpacing port="sink_output 2" spacing="0"/>

        </process>

      </operator>

      <connect from_op="Retrieve Golf" from_port="output" to_op="Loop Attributes" to_port="input 1"/>

      <connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>

      <portSpacing port="source_input 1" spacing="0"/>

      <portSpacing port="sink_result 1" spacing="0"/>

      <portSpacing port="sink_result 2" spacing="0"/>

    </process>

  </operator>

</process>

James_julian · January 2024

Hi Rjones13,

Thanks for clarifying the approaches you mentioned. Hmm, that's what I feared with the more complex way, as I'm guessing it would include more complex programming as well.

The number of variables depends on each group I'm looking at. Some groups have around 50-100 variables, where others have several hundred.

Thanks

James_julian · January 2024

Hi, @rjones13

Would you mind elaborating on the complex solution please? I may be able to get some help with the programming it requires.

Thanks!

rjones13 · January 2024

Hi @James_julian,

I’ll look to build a small example. It might take a couple of days if that’s okay?

Best,
Roland

James_julian · January 2024

HI Roland,

Yes, that is fine thanks.

If you need me to PM you with some abbreviations to consider I can do that if it will help.

rjones13 · January 2024

Hi @James_julian,

Would the following process work for you? You'll need to install the Generative Models extension - see the steps here. Then replace the Titanic dataset with your dataset. Of course, further processing can be done to modify the output once it has been abbreviated. I have used a Summarization model which seemed to give reasonable results, but I was also debating using Text2Text Generation - have a play around here as I may have missed a more suitable model on Huggingface.

Let me know how this works for you. Always happy to explore if you'd like to perform subsequent analysis in RapidMiner.

Best,

Roland

<?xml version="1.0" encoding="UTF-8"?><process version="10.3.000">

  <context>

    <input/>

    <output/>

    <macros/>

  </context>

  <operator activated="true" class="process" compatibility="10.3.000" expanded="true" name="Process">

    <parameter key="logverbosity" value="init"/>

    <parameter key="random_seed" value="2001"/>

    <parameter key="send_mail" value="never"/>

    <parameter key="notification_email" value=""/>

    <parameter key="process_duration_for_mail" value="30"/>

    <parameter key="encoding" value="UTF-8"/>

    <process expanded="true">

      <operator activated="true" class="retrieve" compatibility="10.3.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="34">

        <parameter key="repository_entry" value="//Samples/data/Titanic"/>

        <description align="center" color="transparent" colored="false" width="126">Replace this with your dataset</description>

      </operator>

      <operator activated="true" class="transpose" compatibility="10.3.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>

      <operator activated="true" class="rename_by_generic_names" compatibility="10.3.000" expanded="true" height="82" name="Rename by Generic Names" width="90" x="514" y="289">

        <parameter key="attribute_filter_type" value="all"/>

        <parameter key="attribute" value=""/>

        <parameter key="attributes" value=""/>

        <parameter key="use_except_expression" value="false"/>

        <parameter key="value_type" value="attribute_value"/>

        <parameter key="use_value_type_exception" value="false"/>

        <parameter key="except_value_type" value="time"/>

        <parameter key="block_type" value="attribute_block"/>

        <parameter key="use_block_type_exception" value="false"/>

        <parameter key="except_block_type" value="value_matrix_row_start"/>

        <parameter key="invert_selection" value="false"/>

        <parameter key="include_special_attributes" value="false"/>

        <parameter key="generic_name_stem" value="att"/>

      </operator>

      <operator activated="true" class="blending:select_attributes" compatibility="10.3.000" expanded="true" height="82" name="Select column names" width="90" x="313" y="34">

        <parameter key="type" value="include attributes"/>

        <parameter key="attribute_filter_type" value="one attribute"/>

        <parameter key="select_attribute" value="id"/>

        <parameter key="select_subset" value=""/>

        <parameter key="also_apply_to_special_attributes_(id,_label..)" value="true"/>

        <description align="center" color="transparent" colored="false" width="126"/>

      </operator>

      <operator activated="true" class="generative_models:summarization" compatibility="1.0.000" expanded="true" height="82" name="Summarization" origin="GENERATED_TUTORIAL" width="90" x="447" y="34">

        <parameter key="conda_environment" value="rm_genai"/>

        <parameter key="use_local_model" value="false"/>

        <parameter key="model" value="Falconsai/text_summarization"/>

        <parameter key="name" value="summary"/>

        <parameter key="prompt" value="[[id]]"/>

        <parameter key="min_target_length" value="2"/>

        <parameter key="max_target_length" value="5"/>

        <parameter key="device" value="Automatic"/>

        <parameter key="device_indices" value="0"/>

        <parameter key="temperature" value="1.0"/>

        <description align="center" color="transparent" colored="false" width="126">Apply the model to the prompts and deliver the results</description>

      </operator>

      <operator activated="true" class="blending:select_attributes" compatibility="10.3.000" expanded="true" height="82" name="Drop original" width="90" x="45" y="289">

        <parameter key="type" value="include attributes"/>

        <parameter key="attribute_filter_type" value="one attribute"/>

        <parameter key="select_attribute" value="summary"/>

        <parameter key="select_subset" value=""/>

        <parameter key="also_apply_to_special_attributes_(id,_label..)" value="true"/>

      </operator>

      <operator activated="true" class="transpose" compatibility="10.3.000" expanded="true" height="82" name="Transpose back" width="90" x="179" y="289"/>

      <operator activated="true" class="blending:select_attributes" compatibility="10.3.000" expanded="true" height="82" name="Drop id" width="90" x="313" y="289">

        <parameter key="type" value="exclude attributes"/>

        <parameter key="attribute_filter_type" value="one attribute"/>

        <parameter key="select_attribute" value="id"/>

        <parameter key="select_subset" value=""/>

        <parameter key="also_apply_to_special_attributes_(id,_label..)" value="true"/>

      </operator>

      <operator activated="true" class="rename_by_generic_names" compatibility="10.3.000" expanded="true" height="82" name="Rename by Generic Names (2)" width="90" x="514" y="187">

        <parameter key="attribute_filter_type" value="all"/>

        <parameter key="attribute" value=""/>

        <parameter key="attributes" value=""/>

        <parameter key="use_except_expression" value="false"/>

        <parameter key="value_type" value="attribute_value"/>

        <parameter key="use_value_type_exception" value="false"/>

        <parameter key="except_value_type" value="time"/>

        <parameter key="block_type" value="attribute_block"/>

        <parameter key="use_block_type_exception" value="false"/>

        <parameter key="except_block_type" value="value_matrix_row_start"/>

        <parameter key="invert_selection" value="false"/>

        <parameter key="include_special_attributes" value="false"/>

        <parameter key="generic_name_stem" value="att"/>

      </operator>

      <operator activated="true" class="numerical_to_polynominal" compatibility="10.3.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="648" y="289">

        <parameter key="attribute_filter_type" value="all"/>

        <parameter key="attribute" value=""/>

        <parameter key="attributes" value=""/>

        <parameter key="use_except_expression" value="false"/>

        <parameter key="value_type" value="numeric"/>

        <parameter key="use_value_type_exception" value="false"/>

        <parameter key="except_value_type" value="real"/>

        <parameter key="block_type" value="value_series"/>

        <parameter key="use_block_type_exception" value="false"/>

        <parameter key="except_block_type" value="value_series_end"/>

        <parameter key="invert_selection" value="false"/>

        <parameter key="include_special_attributes" value="false"/>

      </operator>

      <operator activated="true" class="append" compatibility="10.3.000" expanded="true" height="103" name="Append" width="90" x="648" y="34">

        <parameter key="datamanagement" value="double_array"/>

        <parameter key="data_management" value="auto"/>

        <parameter key="merge_type" value="all"/>

      </operator>

      <operator activated="true" class="blending:rename_by_example_values" compatibility="10.3.000" expanded="true" height="82" name="Rename by Example Values" width="90" x="782" y="34">

        <parameter key="row_number" value="1"/>

      </operator>

      <operator activated="true" class="parse_numbers" compatibility="10.3.000" expanded="true" height="82" name="Parse Numbers" width="90" x="916" y="34">

        <parameter key="attribute_filter_type" value="all"/>

        <parameter key="attribute" value=""/>

        <parameter key="attributes" value=""/>

        <parameter key="use_except_expression" value="false"/>

        <parameter key="value_type" value="nominal"/>

        <parameter key="use_value_type_exception" value="false"/>

        <parameter key="except_value_type" value="file_path"/>

        <parameter key="block_type" value="single_value"/>

        <parameter key="use_block_type_exception" value="false"/>

        <parameter key="except_block_type" value="single_value"/>

        <parameter key="invert_selection" value="false"/>

        <parameter key="include_special_attributes" value="false"/>

        <parameter key="decimal_character" value="."/>

        <parameter key="grouped_digits" value="false"/>

        <parameter key="grouping_character" value=","/>

        <parameter key="infinity_representation" value=""/>

        <parameter key="unparsable_value_handling" value="skip attribute"/>

      </operator>

      <connect from_op="Retrieve Titanic" from_port="output" to_op="Transpose" to_port="example set input"/>

      <connect from_op="Transpose" from_port="example set output" to_op="Select column names" to_port="example set input"/>

      <connect from_op="Transpose" from_port="original" to_op="Rename by Generic Names" to_port="example set input"/>

      <connect from_op="Rename by Generic Names" from_port="example set output" to_op="Numerical to Polynominal" to_port="example set input"/>

      <connect from_op="Select column names" from_port="example set output" to_op="Summarization" to_port="input 1"/>

      <connect from_op="Summarization" from_port="result 1" to_op="Drop original" to_port="example set input"/>

      <connect from_op="Drop original" from_port="example set output" to_op="Transpose back" to_port="example set input"/>

      <connect from_op="Transpose back" from_port="example set output" to_op="Drop id" to_port="example set input"/>

      <connect from_op="Drop id" from_port="example set output" to_op="Rename by Generic Names (2)" to_port="example set input"/>

      <connect from_op="Rename by Generic Names (2)" from_port="example set output" to_op="Append" to_port="example set 1"/>

      <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Append" to_port="example set 2"/>

      <connect from_op="Append" from_port="merged set" to_op="Rename by Example Values" to_port="example set input"/>

      <connect from_op="Rename by Example Values" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>

      <connect from_op="Parse Numbers" from_port="example set output" to_port="result 1"/>

      <portSpacing port="source_input 1" spacing="0"/>

      <portSpacing port="sink_result 1" spacing="0"/>

      <portSpacing port="sink_result 2" spacing="0"/>

    </process>

  </operator>

</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Need help automating large demographic datasets into more manageable variables for SPSS

Answers