The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Querying TERADATA takes too long
RapidMinerUser12
Member Posts: 11 Learner I
in Help
Hi all,
I have a very big TERADATA database that has more than 35 000 000 rows.
When I query the data for 1 000 000 rows, the Read Database operator executes in 12s, and when I tried to select all rows, the process ran for >40 minutes and I had to stop it.
My question is, is this waiting time normal? If not, how can I shorten it and import all of the data from TERADATA in RapidMiner?
I want to do the ETL in RapidMiner.
Thank you in advance.
I have a very big TERADATA database that has more than 35 000 000 rows.
When I query the data for 1 000 000 rows, the Read Database operator executes in 12s, and when I tried to select all rows, the process ran for >40 minutes and I had to stop it.
My question is, is this waiting time normal? If not, how can I shorten it and import all of the data from TERADATA in RapidMiner?
I want to do the ETL in RapidMiner.
Thank you in advance.
Tagged:
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi,
there are a few tricks you can do.
The first is using LIMIT ... OFFSET ... in SQL. However, this is often inefficient, so talk to your database admins.
Another would be finding a nominal attribute with 10 to 100 possible values and loop these, selecting the subset with the current value. This works if the distribution in your table is more or less balanced. (One value is not like 90 % of the table.)
If you have numeric IDs, you can use a modulo function to only select IDs ending with 0, 1, 2, 3, etc.
It would be interesting for RapidMiner developers if you could benchmark the performance with 1, 5, 10, 15 etc. million rows and write here the threshold where it becomes impracticable. Maybe they could optimize the memory handling with that information.
Regards,
Balázs6
Answers
This is probably a memory issue. RapidMiner works by reading complete data sets into the main memory on your computer. If you overwhelm the existing memory with data (and this sounds like you're doing that), everything gets slow, e. g. because of swapping.
It's better to process the 35 M rows in batches, like you did with the 1 Mio rows. You would for example use one of the Loop operators.
With this kind of big data, always try to do as much as possible inside the database. It is better at filtering, joining and sorting than a separate in-memory process can ever be.
You don't even have to learn SQL for this if you use the In-Database Processing extension.
Regards,
Balázs
Thanks for your swift answer.
We have 256 GB of memory on our machines. The In-Database Processing doesn't work with Teradata.
Our requirements are so that we do everything from ETL in RapidMiner, not with queries.
Can you explain further how we can process the data in batches? We have to have some pointer-like indicator that tells the database where to start the next batch of data.
Thank you in advance.