The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

[Solved] Out of Memory with Big Data; adding RAM and sampling didn't help.

BradJCoxBradJCox Member Posts: 2 Contributor I
edited November 2019 in Help
I'm trying to determine if RapidMiner might be easier to work with than Google BigQuery and GraphCHI for a big data project. The full test case is the ASA project at http://stat-computing.org/dataexpo/2009/the-data.html, but the problem surfaces with just the 2008 flight data test case at http://stat-computing.org/dataexpo/2009/the-data.html, which is about 632.2 MB when cleaned.

This is a CSV file that has also been imported into MySQL. Similar problems reading from either CSV or MySQL.

I've edited RapidMinerGUI like this to give it 2gb RAM on an 8gb machine. Didn't help; made no noticeable difference.
    MAX_JAVA_MEMORY=2000
Near as I can tell, RapidMiner is trying to load the whole database regardless of the Sampling process step which specifies 1000 rows. This happens both via MySQL and via CSV, although MySQL generally fails with a "attempting to reuse connection after closed" error, presumably secondary to running out of RAM.

A confusing factor is that I keep getting a Sampler error to the effect of (from memory) SampleSet contains too few records, 1000 is required, which I think means it hasn't tried to determine the actual record count yet and is working from flakey metadata.

I've invested a weekend just getting this far and am close to giving up. Can someone help me get out of the weeds? Thanks!

Here's the latest process:
  • <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <process expanded="true" height="190" width="413">
         <operator activated="true" class="stream_database" compatibility="5.2.008" expanded="true" height="60" name="Stream Database" width="90" x="45" y="30">
           <parameter key="connection" value="mysql"/>
           <parameter key="table_name" value="flights"/>
           <parameter key="label_attribute" value="ArrDelay"/>
           <parameter key="id_attribute" value="TailNum"/>
         </operator>
         <operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="Sample" width="90" x="179" y="30">
           <parameter key="sample_size" value="1000"/>
           <list key="sample_size_per_class"/>
           <list key="sample_ratio_per_class"/>
           <list key="sample_probability_per_class"/>
           <parameter key="use_local_random_seed" value="true"/>
         </operator>
         <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
           <parameter key="name" value="ArrDelay"/>
           <parameter key="target_role" value="label"/>
           <list key="set_additional_roles">
             <parameter key="ArrDelay" value="label"/>
           </list>
         </operator>
         <connect from_op="Stream Database" from_port="output" to_op="Sample" to_port="example set input"/>
         <connect from_op="Sample" from_port="example set output" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
And the stacktrace, which is what makes me think the sample operator is ignored:
ug 27, 2012 4:37:17 PM com.rapidminer.tools.jdbc.DatabaseHandler executeStatement
INFO: Executing query: 'SELECT *
FROM `flights`'
Exception in thread "RemoteProcess-Updater" Exception in thread "ProgressThread" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1649)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1426)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2924)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2619)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1788)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2209)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2569)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1521)
at com.rapidminer.tools.jdbc.DatabaseHandler.executeStatement(DatabaseHandler.java:1258)
at com.rapidminer.operator.io.DatabaseDataReader.getResultSet(DatabaseDataReader.java:116)
at com.rapidminer.operator.io.DatabaseDataReader.createExampleSet(DatabaseDataReader.java:124)
at com.rapidminer.gui.tools.dialogs.wizards.dataimport.DataImportWizard$1.run(DataImportWizard.java:73)
at com.rapidminer.gui.tools.ProgressThread$2.run(ProgressThread.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.<init>(LinkedList.java:78)
at com.rapidminer.repository.remote.RemoteRepository.getAll(RemoteRepository.java:482)
at com.rapidminer.repository.gui.process.RemoteProcessesTreeModel$UpdateTask.run(RemoteProcessesTreeModel.java:129)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    2GB RAM is not that much - please try a larger value.
    You can safely ignore the "error" on the Sample operator. It is simply a meta data error, which occurs because the Stream Database operator does not report how many data rows it will return before it has been executed. During execution everything should be fine, since the metadata is only used *before* executing the process to detect *potential* problems.

    From the stacktrace it seems that you used some kind of wizard. Which one is it? Did you try to import the complete database table into the repository? In that case, the wizard tries indeed to read the complete table.

    Best,
    Marius
Sign In or Register to comment.