Performance Issues
Good morning,
I am having serious performance issues when it comes to handle large databases. I configured an oracle connection and when I try to filter a table (about 15 Million rows) by a simple attribute, I get "Retrieve [tablename]..." message and it just does nothing. When I access the table with a standard sql editor, I get the results almost instantly. I don't know if this common when handling this large volumes of data or it should just work ok.
Thanks in advance.
Best Answers
-
Edin_Klapic Employee-RapidMiner, RMResearcher, Member Posts: 299 RM Data Scientist
Hi @agurruchaga,
From your problem description where it states that the process outputs "Retrieve [tablename]" I assume you are using the Retrieve Operator and apply the Filter Examples Operator afterwards. The Retrieve Operator always fetches the full dataset and the filters are applied afterwards.
Thus, I suggest to use the Read Database Operator and write the SQL query with your desired filter option in there. This way the filtering is executed directly in the database and only the relevant Examples are loaded in RapidMiner.
Best,
Edin
3 -
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
Hi!
No need to create a temporary table, you can use SQL IN.
You'd aggregate the IDs from your Excel table using the Aggregate operator and the concatenation function. Then you would use Replace to change the separator to a comma.
For example, you get a value like "13,25,45,133". You then use Extract Macro to convert this to a macro named "filter" that you can use in the SQL query.
The SQL query would look like this:
SELECT ...
FROM bigtable
WHERE key IN (%{filter})
Regards,
Balázs
3
Answers
I am interested in any of the developers' comments on this issue. Ever since the "Stream Database" operator was deprecated, I have also had problems with large datasets. Are there any suggested remedies or plans to bring back "Stream Database"?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hello Edin,
This worked like a charm! Thank you!
Greetings.
Hello again,
I was thinking about a similar but not same situation : Imagine I import a 100 row Excel sheet to Rapidminer that I wan't to join with a Table(15.000.000 rows) in my DB. If i wan't to perform a Join, the first step is to retrieve the database, so I would have the same problem I stated before, but can't use sql in this case since the 100 row table is not part of the DB. Is there a similar operator that allows to join without having to retrieve the entire DB?
Thanks!
Hi,
the solution to the second problem is actually very similar to the first one. You can use Read Excel and Write Database to import your excel sheet into the DB. Then you can simply join with a sql command.
Best,
Sebastian
Hi Sebastian,
I thought about that solution but the problem is that i do not have privileges to write in that database.
I will have to think about another solution.
Thanks!
You might be able to create a temporary table.
http://devzone.advantagedatabase.com/dz/webhelp/advantage9.0/adssql/using_temporary_tables_in_sql_statements.htm
To be honest, I'm not sure how to implement this solution (maybe @BalazsBarany has a -better- idea). My guess is that you could convert the excel table to a CREATE TABLE statement, then use the Execute SQL operator to create the temporary table, make the join and finally retrieve the data.
If you make it work, I'll be very interested in seeing the solution
Sorry for the late reply, I think this solution is super! Fast and easy to use, thank you very much guys!
Edit : It seems my oracle db has a 1.000 elements limitation in a list, so this might not be a solution in some cases.
Greetings
Hi,
you can use Loop Batches or something similar in that case ;-) Then you would make as many queries as necessary and union or append the results.
Regards,
Balázs