The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Export "Data To Similarity" results to a CSV

ClaraCabaClaraCaba Member Posts: 9 Contributor II
edited November 2018 in Help
Hi!

I am working with text mining in Rapidminer and the following problem has arised:

I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.

Is there any way to export that table?

Thank you very much!
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Thank you very much, that worked perfectly.

    I am facing now another problem, though.

    After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?

    Thank you in advance.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Hi,

    Thank you very much!

    However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?

    Thank you.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    a general idea is to use Cross Distance, it is a bit more flexible.

    For your question:
    Do i understand it correctly, that you have the distance twice in like this

    ID1  ID2  SIM
    2      1      0.5
    1      2      0.5
    My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like

    SmallNumber _ BigNumber

    This results in this:

    if([FIRST_ID]>[SECOND_ID],
    concat(str([FIRST_ID]),"_",str([SECOND_ID])),
    concat(str([SECOND_ID]),"_",str([FIRST_ID]))
    )
    Afterwards you can use Remove Duplicates on this. See attached Process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
          <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
            <list key="function_descriptions">
              <parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]&gt;[SECOND_ID],&#10;&#9;concat(str([FIRST_ID]),&quot;_&quot;,str([SECOND_ID])),&#10;&#9;concat(str([SECOND_ID]),&quot;_&quot;,str([FIRST_ID]))&#10;)"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="IdToRemoveDuplicates"/>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
          <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
          <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Thank you very very very much!!! :D

    That worked perfectly.
  • sangeet171188sangeet171188 Member Posts: 8 Contributor II

    And how to get count of similar looking sets( Text field). For the below set I want count like

    ABC is good text -----3

    XYZ is great -----------2

     

    FIRST SECOND SIMILARITY textfield

    1               2                    1           ABC is a good text

    3                8                    1           ABC is a good text

    4                9                      1          ABC is a good text

    12              32                    1            XYZ is great 

    31              77                    1            XYZ is great

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Can't you use an Aggregate operator for this?

  • sangeet171188sangeet171188 Member Posts: 8 Contributor II

    Thanks Thomas. Results achieved.

Sign In or Register to comment.