Sample One row within a group
Hi Experts,
I have a table with PatientID, the day of their stay and max vital signs for the day.
I want to create a process that randomly samples one day for each patient.
Table Structure :
PatientID Day Number Max_Temp Max_Resp Max_SBP Max_HR
ABC 1 98.7 32 90 72
ABC 2 98.8 33 95 75
ABC 3 95 35 90 78
DEF 1 98.7 32 90 72
DEF 2 95 35 90 78
the output of my process should have one day for each patient picked randomly and should look like as below :
PatientID Day Number Max_Temp Max_Resp Max_SBP Max_HR
ABC 2 98.8 33 95 75
DEF 1 98.7 32 90 72
Methods I have tried :
- I have tried to use sample operator and use balance data option but it requires me to mention each PatientID in
the parameter list (sample size per class).This is impossible because there are more than 50000 patientID - Using R-code(Execute R) will solve this, but trying to find if there is a way in Rapidminer to solve it.
I am looking for a more automated method to achieve it in Rapidminer
Please let me know if you need more info.
Thanks in advance
Answers
You can sort your datset by a random variable (which you can add if you need to using "Generate Attributes") and then simply use "Remove Deuplicates" to get rid of records based on the patient id. This should give you one random day per patient in the resulting dataset.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120 - pretty elegant solution! however, why would you want to sort dataset by a random variable beforehand?
Vladimir
http://whatthefraud.wtf
@kypexin Sorting by a random variable should help ensure it doesn't systematically keep the same day for each patient.(I'm not 100% sure what the internal logic is for removing duplicates but it might conceivably be related to the order in which they appear, so if your dataset is sorted by the patient/day, that could lead to non- random sampling results.)
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts