The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

deleting data based on 2 conditions - not just filter

thesoletravelerthesoletraveler Member Posts: 3 Learner III
edited February 2020 in Help

Hi there

 

I have only started using Rapidminer so am quite basic - apologies.

 

I would like to reduce my data to include only data that meets specific conditions. When I use the filter operators they are returning the data that i want removed. My dilemma is this. I want to delete all data that is a Yes, or a 1 (in first column) and has a value greater than 150 (2nd column). With the filter it returns the data for 80 of the 750 entries that meets those conditions but I want this data deleted and to keep the 670 entries not the 80. I hope I'm being clear.

 

Thank you I have spent hours trying different operators and searching the community and Youtube.

Best Answer

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi again @thesoletraveler,

     

    If I understand good, you just have to apply the 2 filters conditions and check invert filter : 

    Does this process answer to your need : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="85">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Double_filters\diabetes.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Pregnancies.true.integer.attribute"/>
    <parameter key="1" value="Glucose.true.integer.attribute"/>
    <parameter key="2" value="BloodPressure.true.integer.attribute"/>
    <parameter key="3" value="SkinThickness.true.integer.attribute"/>
    <parameter key="4" value="Insulin.true.integer.attribute"/>
    <parameter key="5" value="BMI.true.real.attribute"/>
    <parameter key="6" value="DiabetesPedigreeFunction.true.real.attribute"/>
    <parameter key="7" value="Age.true.integer.attribute"/>
    <parameter key="8" value="Outcome.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="85">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Outcome.eq.1"/>
    <parameter key="filters_entry_key" value="Insulin.ge.150"/>
    </list>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @thesoletraveler,

     

    Can you share your dataset and explain with an example what you want to obtain, please ?

     

    Regards,

     

    Lionel

  • thesoletravelerthesoletraveler Member Posts: 3 Learner III

    Hi

     

    In the outcome column of the raw data set you will see either 1 (true) or 0 (false) which I know how to switch. In the insulin column there are numerous values. Based on literature (and my directive), it is argued that anyone with insulin over 150 is on insulin therapy. So when building a model to predict diabetes I want to eliminate anyone who is 1 or true (they have diabetes) with an insulin reading above 150 as they already have diabetes and are receiving therapy, therefore not an accurate data set to include in building my model.

     

    As mentioned previously, I want to delete these 70+ data results so I have data from approx 680 (needing additional work in themselves) who may or may not have diabetes and who are not receiving insulin therapy.

     

    I believe it adds far more value to my model if insulin results are more relevant. Thanks in advance for your assistance!

     

    C

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi @thesoletraveler

    another way to solve this is the expression option of Filter Examples, which is very flexbile and powerful.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • thesoletravelerthesoletraveler Member Posts: 3 Learner III

    Thank you so much Lionel. So much to learn in this program! I really appreciate this community and their time which has been helpful for me in my analytics study. All the best.

     

    C

Sign In or Register to comment.