The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Count examples in overlapping time frame
Hi RM friends,
My data set consists of ID, a time of admission, a time of discharge. I would like to calculate the number of examples (patients in casu) that are simultaneously admitted. As such a new attribute should provide me the number of patients admitted for each patient (example) during his/her admission time frame.
Thanks a lot!!!
Sven
My data set consists of ID, a time of admission, a time of discharge. I would like to calculate the number of examples (patients in casu) that are simultaneously admitted. As such a new attribute should provide me the number of patients admitted for each patient (example) during his/her admission time frame.
Thanks a lot!!!
Sven
Tagged:
0
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi Sven,
Here a solution using a Python script :
First, at the beginning of the python script, you have to set :
- the names of your attributes
- the date format
As a result, you obtain an exampleset like that :
The process is in attached file.
Hope this helps,
Regards,
Lionel
7 -
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornYou're welcome, Sven.
Keep us informed of the results of your work ! (it it's not confidential of course)
Regards,
Lionel5 -
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi Sven,
Sven, to be honest, I am pessimist.... Here the results of my investigations and experimentations :
I timed the process for 100 examples : the duration is 10 seconds.(my PC = quad-core / 16 Go RAM)
Your whole dataset has around 69000 examples, so I would say in first approximation
that the duration for the whole dataset is (69000*10)/100 = 6900 seconds = around 2 hours. ==> This is obviously not the case.
So I suspect the complexity of this algorithm to be proportional to NxN (and not proportional to N) where N is the number of your examples. In this case, the time duration will be (for the whole dataset) :
10s x 690 x 690 = 4 761 000 seconds = 1322, 5 hours = 55 days !!!!
That 's why the process duration is so long...
Moreover I remember of a thread where an user observed that the execution of a Python script inside RapidMiner is significantly slower than the same script in a Python Notebook. ==> So I will try to execute the script directly in a Python notebook to see if there is an acceleration of the execution.
And finally to answer to your question, the algorithm needs to go through the entire dataset to find the overlaps, thus, from my point of view, it is impossible to run the process by steps...
Regards,
Lionel
5 -
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi,
this is a problem that you can reduce by executing in batches if you can sort and separate your data.
For example, it doesn't make sense to compare records from different years (unless some patients are in care for years).
Instead of making n^2 comparisons (for a large n), you could make 10 joins of (n / 10)^2, which can be some orders of magnitude faster.
Regards,
Balázs5
Answers
PatientId DateOfEntry DateOfExit
Dortmund, Germany
In fact I want to know for each patient how many patients are in at time of admission and at discharge. This might give me insight if length of stay is related with number of patients admitted.
Sven
sounds like you are looking for a "generic join" functionality where you can specify the join criterion. In this case, the criterion would be:
a.id <> b.id and overlaps(a.admission, a.discharge, b.admission, b.discharge)
Assuming that both a and b refer to copies of your patient example set (a self join).
Check out this contribution: https://community.rapidminer.com/discussion/33908/generic-join-script
You can write the overlaps() function yourself and specify the above join criterion to self-join your example set, and then aggregate by the a.id to count the number of the joined patients. You'll need to rename the attributes in the second copy before joining.
Here are some example "overlaps" implementations (depends on your requirement, e.g. discharge time could be missing, meaning that the patient is still there):
https://stackoverflow.com/questions/17106670/how-to-check-a-timeperiod-is-overlapping-another-time-period-in-java
Regards,
Balázs
Sven
lionelderkrikor,
Your python script worked with the example you provided. I considered this as a solution. However with the dataset I am using "sehid" is the (group), "aankomstdt" admissiondatetime and "ontslagdt" dischargedatetime.
With the python script I adapted the variables but the final count is zero for all examples. What am I doing wrong here?
Hopefully you could help me?
Cheers
Sven
More seriously, yes, of course, I can help you : In reality "sehid" is your "id" not the new "group" variable I introduced.
The "group" variable allows to build group of patients in order to study the overlaps within each group.
By assimiling the "id" to "group" there is in deed zero for all examples....
In your case, you have apriori only one group.
The bad new is that after execution, the process I shared is raising an error if there is only one group.
The good new is that there is a (far-fetched) workaround.
Grant me one hour, the time I build a new version of the process and check that it works with the data you provided.
Sorry for the inconvenience..
See you soon !
Regards,
Lionel
As said in my previous post, I found a workaround.
It consists to extract the first example from your dataset, assigning it group "B"
and append it at the end of your dataset (in order the global dataset has 2 groups)
Given you have a huge dataset and time computation is very long. I used a Filter Example Range operator
to execute the process only on a fraction of your data to check the execution. A priori the process works fine...and gives more relevant results !
To execute the process on your whole dataset, please remove/disable the last Filter Example Range operator (between
the Append operator and the Execute Python operator.
The working process is in attached file.
Keep me informed !
Regards,
Lionel
Lionel,
I gave the process a try on the full example range, its running now for 1 day and 20 hours with only 1.2 GB consumption stable over the entire period. What do you think, just let it run untill the finish (in that case, how long would it require by your estimation?) OR there a way to run the process in steps?
Cheers
Sven
Hi Lionel and Balazs
Thanks for the reply, I also thought that overlap is only computable in one batch because patiënt admission is a continuous. Each split of the dataset can bias missing cases that overlap between the subsets. Interesting to "feel" the impact of dimensionality on calculation time. I try to reconstruct how a human brain tries to look for overlap, I wonder if looping with ascending or descending times could not reduce possible combinations. Although theoretically all combinations are possible in overlap, this is only the case if admission and discharge differences are between zero and indefinite which is not realistic. Maybe the number of combinations can be reduced starting from median, average length of stay which could already cover x % of the cases calculated in a fraction of the time?
Thanks anyway!!!!
Sven
you can create "fuzzy" batches, like taking 10 % of the data, calculating minimum and maximum entry and exit times, and filtering the candidate dataset accordingly. Then you would remove duplicates that inevitably appear in the result.
Regards,
Balázs
Hi,
What is your opinion on interlaps (https://brentp.github.io/interlap/)?
Regards
Sven
it sounds good (without having enough Python knowledge to check it).
Actually, my first idea was to solve your problem by exporting the data to a PostgreSQL database and joining there. Postgres has an interval datatype and matching operators, with index support. I just didn't post that idea because I assumed you would be easily able to solve it in Studio. Maybe now's the time to try it.
Regards,
Balázs