The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"how to save the clustered result into two folders"
huaiyanggongzi
Member Posts: 39 Contributor II
I have a set of documents stored in a single folder. I run an unsupervised clustering algorithm, like K-means to construct two groups. Here is the workflow I created. Is there an approach that can separate the original folder into two folders based on the clustering result? In other words, I want to put the files belonging to cluster 1 into one folder and put the files belonging to cluster 2 into another folder.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="370" width="656">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="75">
<list key="text_directories">
<parameter key="NotResponsive" value="D:\User1\datamining\Data\training Sets"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="5"/>
<parameter key="prune_above_absolute" value="5000000"/>
<parameter key="parallelize_vector_creation" value="true"/>
<process expanded="true" height="380" width="674">
<operator activated="true" class="text:tokenize" compatibility="5.1.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="120"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="5.1.011" expanded="true" height="76" name="Clustering" width="90" x="305" y="84"/>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
first of all, filter the clustered dataset by the "cluster" attribute with Filter Examples. Then you can use "Loop Values" to loop over the "metadata_path" attribute. Loop Values creates an iteration macro which contains the current value, i.e. in this case the path of the document. You can use it as the "file" parameter of Move File. The choice of the second one is up to you and based on the cluster value.
Of course, instead of manually filtering each cluster value in the first step, you could use a second Loop Values to loop the cluster values.
Best,
Marius
if you don't have the Move File operator, please update RapidMiner to the latest version (5.2.008). You'll find an explanation of Loop Values in this thread.
Best, Marius