identify consecutive dates/values
Hi everyone,
I'm trying to use rapidminer to filter examples in an exampleset with at least 3 consecutive dates (see below). So more specifically, I want to filter those ID's that have dates where there is no more than -+ 1 month between two dates.
id | date |
1 | 20/01/2018 |
1 | 21/02/2018 |
2 | 19/01/2018 |
2 | 19/02/2018 |
2 | 21/04/2018 |
2 | 21/05/2018 |
3 | 22/03/2018 |
3 | 22/04/2018 |
3 | 23/05/2018 |
3 | 22/07/2018 |
3 | 23/08/2018 |
3 | 22/09/2018 |
4 | 20/01/2018 |
4 | 21/02/2018 |
4 | 21/03/2018 |
5 | 19/01/2018 |
5 | 20/02/2018 |
5 | 19/05/2018 |
5 | 19/06/2018 |
5 | 20/08/2018 |
5 | 20/09/2018 |
6 | 22/01/2018 |
6 | 21/02/2018 |
6 | 20/04/2018 |
6 | 20/05/2018 |
6 | 20/07/2018 |
6 | 21/08/2018 |
6 | 20/09/2018 |
In order to do that I created a script in where I generated a date_id that was made up of the year + month, and lagged on date_id in order to calculate the difference between two dates (which could not be more than 1). Working with days was not really an option since the transaction was/is not always exactly one month apart. So my data looked more or less like this:
id | date | date_id | date_id-1 | difference |
1 | 20/01/2018 | 20180 | ? | ? |
1 | 21/02/2018 | 20181 | 20180 | 1 |
2 | 19/01/2018 | 20180 | ? | ? |
2 | 19/02/2018 | 20181 | 20180 | 1 |
2 | 21/04/2018 | 20183 | 20181 | 2 |
2 | 21/05/2018 | 20184 | 20183 | 1 |
3 | 22/03/2018 | 20182 | ? | ? |
3 | 22/04/2018 | 20183 | 20182 | 1 |
3 | 23/05/2018 | 20184 | 20183 | 1 |
3 | 22/07/2018 | 20186 | 20184 | 2 |
3 | 23/08/2018 | 20187 | 20186 | 1 |
3 | 22/09/2018 | 20188 | 20187 | 1 |
4 | 20/01/2018 | 20180 | ? | ? |
4 | 21/02/2018 | 20181 | 20180 | 1 |
4 | 21/03/2018 | 20182 | 20181 | 1 |
5 | 19/01/2018 | 20180 | ? | ? |
5 | 20/02/2018 | 20181 | 20180 | 1 |
5 | 19/05/2018 | 20184 | 20181 | 3 |
5 | 19/06/2018 | 20185 | 20184 | 1 |
5 | 20/08/2018 | 20187 | 20185 | 2 |
5 | 20/09/2018 | 20188 | 20187 | 1 |
6 | 22/01/2018 | 20180 | ? | ? |
6 | 21/02/2018 | 20181 | 20180 | 1 |
6 | 20/04/2018 | 20183 | 20181 | 2 |
6 | 20/05/2018 | 20184 | 20183 | 1 |
6 | 20/07/2018 | 20186 | 20184 | 2 |
6 | 21/08/2018 | 20187 | 20186 | 1 |
6 | 20/09/2018 | 20188 | 20187 | 1 |
So in my example ID's 4 en 6 should be included. My plan was to use an aggregate function on 'difference' and compare them to the number of examples in each loop on ID, to idenify consecutive dates (because the aggregate would be equal to the number of examples -1, since every first row is empty). However, this method is only good for identifying cases that have only exactly three dates that are consecutive dates, or cases where every date is consecutive. So, I'm a bit at loss of what I could do to make this work (maybe I'm making things needlessly complicated, but I have no knowledge of a function in RM that allows to identify consecutive values or dates).
Lise
Answers
I think you have overcomplicated this. There is a datediff function in RapidMiner that you can use to do something very similar. Just lag the original date (not your polynominal id version) and then use the datediff function, and you can then convert it to whatever unit you want (it's calculated in milliseconds) and filter for whatever period gap you want (e.g., 1 month).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts