The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Slow sparse file loading with sparse datamanagement
Hi,
I want to load some vectors with RapidMiner 5.0.001 RC from a file in sparse format for a similarity computation.
The file contains approx. 14 million non-zero entries in 9900 vectors. Each vector has a dimension of 9900. So per vector about 1400 components are non-zero.
For this task I used a read_sparse operator, followed by a data_to_similarity operator. The datamanagement property of read_sparse is set to int_sparse_array to save memory. If the process is started it will be stuck while reading the sparse file. After waiting for 20 minutes I terminated the process.
Switching to the int_array datamanagement the read_sparse operator finished in 30 seconds.
To check if the read_sparse operator works at all with int_sparse_array I measured the time for reading one example in readExamples() in MemoryExampleTable.java. It takes about 80 seconds for 20 lines. So it will take 11 hours to read all 9900 vectors.
Is the creation of an int_sparse_array really that slow or am I doing something wrong?
Although the non-sparse datamanagement works I need the sparse representation as the process will be applied to bigger datasets later.
And I know that even if the sparse reader finishes the similarity operator will take hours to compute its results but it might be replaced by a faster algorithm someday.
My computer has a Pentium M 1,6 GHz CPU and 2 GB RAM. According to the System Monitor of RapidMiner the JVM reserved 1.1 GB (max = total) but uses only 10% of it while running the process (during the first 20 minutes).
The process file looks like this:
Tobias
I want to load some vectors with RapidMiner 5.0.001 RC from a file in sparse format for a similarity computation.
The file contains approx. 14 million non-zero entries in 9900 vectors. Each vector has a dimension of 9900. So per vector about 1400 components are non-zero.
For this task I used a read_sparse operator, followed by a data_to_similarity operator. The datamanagement property of read_sparse is set to int_sparse_array to save memory. If the process is started it will be stuck while reading the sparse file. After waiting for 20 minutes I terminated the process.
Switching to the int_array datamanagement the read_sparse operator finished in 30 seconds.
To check if the read_sparse operator works at all with int_sparse_array I measured the time for reading one example in readExamples() in MemoryExampleTable.java. It takes about 80 seconds for 20 lines. So it will take 11 hours to read all 9900 vectors.
Is the creation of an int_sparse_array really that slow or am I doing something wrong?
Although the non-sparse datamanagement works I need the sparse representation as the process will be applied to bigger datasets later.
And I know that even if the sparse reader finishes the similarity operator will take hours to compute its results but it might be replaced by a faster algorithm someday.
My computer has a Pentium M 1,6 GHz CPU and 2 GB RAM. According to the System Monitor of RapidMiner the JVM reserved 1.1 GB (max = total) but uses only 10% of it while running the process (during the first 20 minutes).
The process file looks like this:
my vectors.aml file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="566" width="685">
<operator activated="true" class="read_sparse" expanded="true" height="60" name="Read Sparse" width="90" x="112" y="255">
<parameter key="format" value="no_label"/>
<parameter key="attribute_description_file" value="vectors.aml"/>
<parameter key="datamanagement" value="int_sparse_array"/>
<list key="prefix_map"/>
</operator>
<operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="255">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Read Sparse" from_port="output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
and just a few samples of vectors.dat:
<?xml version="1.0" encoding="windows-1252" standalone="no"?>
<attributeset default_source="vectors.dat" encoding="windows-1252">
<id name="id" valuetype="integer"/>
<attribute name="dim" sourcecol="1" sourcecol_end="9945" valuetype="integer"/>
</attributeset>
Thanks in advance,
id:1 2:7 3:1 5:2 7:61 8:1 10:1 11:44 12:2 13:1 14:2 16:1 ...
id:2 1:7 3:1 4:27 5:1695 6:268 7:12457 8:961 9:46 10:35 ...
...
Tobias
0
Answers
thank you for this detailed report. I tried to reproduce the problem, but I didn't succeed. I have build this process for getting sparse data, but loading was always very fast: Any comments on that?
Greetings,
Sebastian
your example works but it generates very little data, 1000x50 values if I interpret this correctly. The file size of the generated data is 370KB. The file size of my 9500 vectors of dimension 9500 with 85% sparseness (14,000,000 of 9500*9500 values are non-zero) is about 120MB.
I modified your example a little bit to be more close to my input file. The process is not as slow as with my original data but is is slow (will "only" take one hour to read all data).
Note that an ID is generated for each vector (=data row). The ID attribute seems to cause the long loading time. If it is removed loading is performed much faster.
As some vectors hold zero for all attributes and hence are not written to the data file I need those IDs. Maybe empty lines can be used instead, I don't know. Any way adding IDs should not slow down loading that much.
Bye
Tobias