The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Creating SVDs in X-Validation operator very slow"
text_miner
Member Posts: 11 Contributor II
I am trying to setup a process in RapidMiner for text mining that uses SVDs. I have compared the time it takes to create SVDs using the entire dataset and for only a training set (within the training subprocess of an X-Validation operator). (Both processes I used are detailed below.) Using the entire dataset, the entire process finishes within a minute or so. When running the process with an X-Validation operator, the time increases dramatically; after 45 minutes the SVDs had not been created. Any ideas on why creating SVDs is taking so much longer inside the X-Validation operator?
For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available from http://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html (mini_newsgroups.tar.gz).
Entire Dataset:
Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.
For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available from http://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html (mini_newsgroups.tar.gz).
Entire Dataset:
X-Validation:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="521" width="614">
<operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
<list key="text_directories">
<parameter key="comp.graphics" value="/misc_datasets/mini_newsgroups/comp.graphics"/>
<parameter key="comp.windows.x" value="/misc_datasets/mini_newsgroups/comp.windows.x"/>
</list>
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="200"/>
<process expanded="true" height="650" width="1092">
<operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="73" y="30"/>
<operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter by Length" width="90" x="380" y="30">
<parameter key="min_chars" value="2"/>
<parameter key="max_chars" value="50"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter by Length" to_port="document"/>
<connect from_op="Filter by Length" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF" width="90" x="313" y="165"/>
<operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="447" y="165">
<parameter key="return_preprocessing_model" value="true"/>
<parameter key="dimensions" value="100"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Generate TFIDF" to_port="example set input"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 3"/>
<connect from_op="Generate TFIDF" from_port="example set output" to_op="SVD" to_port="example set input"/>
<connect from_op="SVD" from_port="example set output" to_port="result 1"/>
<connect from_op="SVD" from_port="preprocessing model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.
Any help would be greatly appreciated. Thanks!
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="521" width="614">
<operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
<list key="text_directories">
<parameter key="comp.graphics" value="/misc_datasets/mini_newsgroups/comp.graphics"/>
<parameter key="comp.windows.x" value="/misc_datasets/mini_newsgroups/comp.windows.x"/>
</list>
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="200"/>
<process expanded="true" height="650" width="1092">
<operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="73" y="30"/>
<operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter by Length" width="90" x="380" y="30">
<parameter key="min_chars" value="2"/>
<parameter key="max_chars" value="50"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter by Length" to_port="document"/>
<connect from_op="Filter by Length" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="300">
<process expanded="true" height="650" width="614">
<operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF" width="90" x="45" y="30"/>
<operator activated="true" class="materialize_data" expanded="true" height="76" name="Materialize Data" width="90" x="179" y="30">
<parameter key="datamanagement" value="double_sparse_array"/>
</operator>
<operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="313" y="30">
<parameter key="return_preprocessing_model" value="true"/>
<parameter key="dimensions" value="100"/>
</operator>
<operator activated="true" class="logistic_regression" expanded="true" height="94" name="Logistic Regression" width="90" x="447" y="30"/>
<connect from_port="training" to_op="Generate TFIDF" to_port="example set input"/>
<connect from_op="Generate TFIDF" from_port="example set output" to_op="Materialize Data" to_port="example set input"/>
<connect from_op="Materialize Data" from_port="example set output" to_op="SVD" to_port="example set input"/>
<connect from_op="SVD" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="SVD" from_port="preprocessing model" to_port="through 1"/>
<connect from_op="Logistic Regression" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
<process expanded="true" height="650" width="547">
<operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF (2)" width="90" x="45" y="30"/>
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="313" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" expanded="true" height="76" name="Performance" width="90" x="380" y="165">
<parameter key="main_criterion" value="f_measure"/>
<parameter key="AUC (optimistic)" value="true"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="true"/>
<parameter key="fallout" value="true"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="true"/>
<parameter key="false_negative" value="true"/>
<parameter key="true_positive" value="true"/>
<parameter key="true_negative" value="true"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="true"/>
<parameter key="youden" value="true"/>
<parameter key="positive_predictive_value" value="true"/>
<parameter key="negative_predictive_value" value="true"/>
<parameter key="psep" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Generate TFIDF (2)" to_port="example set input"/>
<connect from_port="through 1" to_op="Apply Model" to_port="model"/>
<connect from_op="Generate TFIDF (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="source_through 2" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
I would guess the problem arises, because there are less examples. This might produce a matrix conditioned worse, so that either the SVD algorithm hangs or needs a longer time to compute the results. Did you try to change the random seed? A new distribution of the examples on the folds might solve the problem.
Greetings,
Sebastian
Thanks for the reply. After trying different seed values I was still getting the same problem. So I investigated a little further and found the solution.
The issue was due to missing values being introduced into the dataset after calculating TFIDF values for the term-by-document matrix. Since only a subset of the data was used in training each fold, there were certain attributes (i.e., terms) that had zero occurrences for all examples. For those attributes, the TFIDF operator put missing values ("?") for all examples of that term.
The solution was to use the Replace Missing Values operator after the TFIDF operator to replace all missing values with zero. After replacing the missing values, the SVD operator worked without a problem.
Thanks again for the reply!
ok, then it seems to be a good idea to throw a warning, that it cannot cope with missing values. I will note that down.
Greetings,
Sebastian
I agree, a warning would be nice.
In addition, another thing to consider is changing the TFIDFFilter class to set zeros for columns without any counts. Although the missing values can currently be changed to zeros with the Replace Missing Values operator, this (1) requires the use of another operator and (2) changes the order of attributes in the matrix. While the first point is not a big deal, I imagine the second point may cause problems. For example, consider creating SVDs with a training set and then wanting to map (i.e., fold-in) examples from the testing set into the pre-existing latent semantic space. (This example assumes the training and testing set applied TFIDF separately (although in reality, the IDF values from the training set would probably be applied to the testing set...) and the sets have different attributes with zero counts.) To fold in these new "pseudo documents", the order of the attributes should be the same between the two sets.
Listed below is the TFIDFFilter class with two simple changes to set zeros for columns without any counts. The first change is on line 106 and just makes sure at least one document has a count for the current term before trying to calculate IDF. The second change adds an OR to line 118-119; the value is set to zero if IDF is zero for the current term. Thanks!
I will add this and it will be included in the upcoming final version.
Anyway, usually we use the TFIDF filter of the Process Documents operator, where this error does not arise as far as I know.
Greetings,
Sebastian