The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Hi All, I need to remove duplicates from each cell of an attribute/column. Do we have a quick soluti

AchintAchint Member Posts: 5 Learner I
I am new to rapidminer and i'm working on a huge project in my company, hence require your help here.

The below is what i need to implement 

Problem:
COLUMN
A|B|V|A|B
C|V|B|C
E|R|T|Y|E

Solution required:
COLUMN
A|B|V
C|V|B
E|R|T|Y

I need a solution as above where i am removing the duplicate entries in the cell separated by "|".

Appreciate your help on this.

Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    edited December 2018
    @lionelderkrikor has been beating me to it :blush:
    But then again my process handles the missings and works for arbitrary amount of items.  So I consider this an "even" :wink:
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @IngoRM,

    Let's be objective : I avour my defeat ... ;) Great process ! 

    I did not think about Generate Aggregate.

    Regards and ... Congratulations

    Lionel
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Haha :smiley:  That's how I constantly feel: there is always another operator I did not think of :smiley:
  • AchintAchint Member Posts: 5 Learner I
    @IngoRM: Thanks alot for the solution to this along with the attachements. Its quite a process to follow but understandable. :) 
  • AchintAchint Member Posts: 5 Learner I
    @lionelderkrikor : Thank you as well for your time on this solution. :) 
  • AchintAchint Member Posts: 5 Learner I
    edited January 2019
    Hi @IngoRM:

    Hope you are doing well. 

    Thanks for the solution you provided me earlier. The example you have sent me with data and RMP file contains only "One column" in the data file. Although i have multiple columns in the data file but need to remove duplicate from only the specified column. How do we select that particular attribute for removing the duplicates and loop attribute only for that column not all?
    Please find the attached excel file with multiple columns and with the column to be worked upon highlighted in Yellow.(two highlighted in Orange are concatenated to new column in Yellow) 

    Attached in the rmp file you provided, great if you can provided the change required in it when connected to the example set as attached.

    Hoping to find a way to make this possible as well from your side.

    Looking forward to hearing from you! Thanks a lot and a happy new year.   

    Regards,
    Achint Kr
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You can try using Ingo's solution inside a Loop Attributes operator and specify only the set of attributes that are relevant with either the subset selection or using a regular expression if they have a similar naming convention.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi again,
    Yes, you could use a loop.  But since you mentioned that you only have one column for which you want to perform the "deduplication", it might be easier just to divide the data into two parts (one with that column only and one with all the other columns).  You can then perform the process above on the selected column and join the results later on.
    Attached is the modified process including some annotations as well as the data this process runs on (it wasn't immediately obvious to me in your data which one is the column you want to work on so I used the original data here again - I am sure you can adapt to your data).
    Hope this helps,
    Ingo
Sign In or Register to comment.