The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What is the best way to build an incremental load to prepare data
dllanos wants to know the answer to this question, "I'm trying to see how is the process of an incremental load, for a data that over time will continue to grow"
Tagged:
1
Answers
If its form a DB the best way will be by managing a GUID and a timestamp.
If its from a web source like twitter you will need the max id of the last time you pulled data from that source.
If the data is extracted from files you will need to keep track of what you had already processed and what is new.
Give us more information of what the sorce of the data and then we can begin to process the information.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
There are 4 kinds of fields that will help us performing incremental load, here are these from best to worst:
- Creation Date, Date.
- Modification Date, Date.
- Deletion Date, Date.
- ID, Numerical.
Here is how to use these approaches:- If there is a creation date, then we can use Aggregate and read the last date on the database to perform INSERT.
- If there is a modification date, then we can use the last date on the Creation Date to perform UPDATE.
- If there is a deletion date, then we can use the last date on the Creation Date to perform DELETE. Haven't seen anyone who uses the deletion date without the other two.
- If there are no dates because the system doesn't have auditing, then we can rely upon a numerical ID. Notice that ID's are created with sequences, and sequences are prone to lack of synchronization (at least in some databases). In that case, the Aggregate is still required, just on the ID field.
This is basically summing up what my great senseis @Telcontar120 and @MarcoBarradas already told you.If this is not what you are looking for, please, break this glass ONLY in case of emergency:
Again, this is overkill, as it performs up to 3 operations per row. I have to do something similar: upserts on a table that has 4.7 Gb of floating point numbers per hour on a little pet project (Dear @sgenzer, please insert recorded laughs here) that I'm doing. I ended up creating the table dynamically and updating the view to point to the last one, because that was the best way I could go back to recover my hours of sleep.
Still, this is not a common approach, not recommended. Only a seasoned DBA or a totally insane and fearless person could give this one to you. I am in the second class of people.
Hope this helps,
Rodrigo.