The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Analyze two notepad files

SneimehSneimeh Member Posts: 1 Learner I
dears,
I am too newbie to Rapidminer. 
i have been asked to compare the similitry of the content of two notepad files using Rapidminer. 
now i have installed and put the two notpad files on the design page and added the operator modeling data to similarity but i dont know how to connect the two files into the operator and how to see the results. how am I able to test the similitry from two files1.txt and file2.txt using the design area, and to see te results, 

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @Sneimeh I'm sorry no one has chimed in here. Can you please post the notepad files so we can see what they look like?
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    If the similarity is to be measured line by line then this is hard. If it is to be measured based on the number of words appearing in both documents then this is easy. What you can do, which may be hard for a newbie, is to do something like this (also see the attached process file):



    So, read both documents in (here I simply create them and turn then into a single attribute example set), then parse the first document into its binary representation (1 if the word appears and 0 if it does not), then use the word list from the first document as a start list for parsing the second document again in a binary representation. Now in the second parse you will only detect the presence of words which appeared in the first document, so that if you get 1 it means the word appeared in both and if you get zero it means the second document does not have such a word. Now if you want some measure of similarity you can add all ones and you get the number which tells you how many words appeared in both. Well this is a starting point :)

    Jacob


  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You can also use the Cross Distances operator after putting the documents into examplesets as noted in the first part of @jacobcybulski' s solution above.  This will indicate whether the documents are the same (distance will be zero) or the higher the value, the father apart they are.  Read the explanation and the tutorial of the Cross Distances operator for more information.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.