[SOLVED] postprocessing based on predictions

TheBen · July 2012

Can I connect the predictions to execute some actions like renaming the file, copy to new directory locations (sorting), or executing a programm?

MariusHelf · July 2012

Can you please describe more detailed what you want to do? If you already have a process as a starting point, you could also post that one. Please have a look at the link in my signature. I'm sure that we will be able to help you!

Best,
~Marius

TheBen · July 2012

1. Describe what you are doing
-> I want to classifiy PDF documents into multiple categories by there text content.

2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
-> There exists a database with multiple PDF documents already classified. hundreds of scanned PDF documents with OCR. attributes/categories 150 (cutomer A, B, C..., topics printer, monitor, laptop, etc...)

3. Describe which results or actions you are expecting.
-> The programm should learn the classes and then apply this model to the unknown documents. Finally, it should perform some actions like renaming and moving the files from the income directory into the database. (sequence: learn pdfs -> classify into multiple categories -> analyse unknown pdfs -> perform action depending on the predictions for each category)

4. Describe which results you actually get.
-> load pdf content into example set
-> train, verify performance
-> apply model to "unknown" labeled exampleset
-> RESULT: show data view table and the prediction(label) column) as result
-> HOW TO?: perform action (my solution: export an excel table with the results, then use a short java programm to rename the files)

MariusHelf · July 2012

Hi Ben,

now that's actually an excellent problem description!

You can use Loop Examples, Extract Macro and Execute Program to solve your problem. Please have a look at the attached process. You have to adapt the command line in Execute Program to your operating system.
If you are not familiar with macros in RapidMiner and have problems understanding the process, don't hesitate to ask again!
Note: you could actually also use Execute Script to execute javascript, but that requires knowledge of the java api.
Note2: you can search in the operator list. If you enter e.g. "execute" you see all operators with execute in their name. That way you can search for operators even if you don't know where they are located in the hierarchy.

Best,
~Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
    <process expanded="true" height="415" width="748">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.005" expanded="true" height="76" name="Process Documents (Training)" width="90" x="45" y="120">
        <list key="text_directories">
          <parameter key="class1" value="C:\Users\mhelf\Documents\schulungen\4 - Text and Web Mining\Data\files - newsgroup"/>
          <parameter key="class2" value="C:\Users\mhelf\Documents\schulungen\4 - Text and Web Mining\Data\files - various encodings"/>
        </list>
        <process expanded="true" height="546" width="658">
          <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.005" expanded="true" height="76" name="Process Documents (Application)" width="90" x="45" y="300">
        <list key="text_directories">
          <parameter key="class1" value="C:\Users\mhelf\Documents\schulungen\4 - Text and Web Mining\Data\files - newsgroup"/>
          <parameter key="class2" value="C:\Users\mhelf\Documents\schulungen\4 - Text and Web Mining\Data\files - various encodings"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" name="Tokenize (2)"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.2.009" expanded="true" height="76" name="Naive Bayes" width="90" x="313" y="30"/>
      <operator activated="true" class="apply_model" compatibility="5.2.009" expanded="true" height="76" name="Apply Model" width="90" x="447" y="120">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.2.009" expanded="true" height="76" name="Loop Examples" width="90" x="581" y="120">
        <process expanded="true" height="546" width="658">
          <operator activated="true" class="extract_macro" compatibility="5.2.009" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
            <parameter key="macro" value="class"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="prediction(label)"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="5.2.009" expanded="true" height="60" name="Extract Macro (2)" width="90" x="179" y="30">
            <parameter key="macro" value="path"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="metadata_path"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="execute_program" compatibility="5.2.009" expanded="true" height="76" name="Execute Program" width="90" x="380" y="30">
            <parameter key="command" value="your_system_command_to_move &quot;%{path}&quot; &quot;destination_path/%{class}&quot;"/>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Extract Macro (2)" from_port="example set" to_op="Execute Program" to_port="through 1"/>
          <connect from_op="Execute Program" from_port="through 1" to_port="example set"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents (Training)" from_port="example set" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Process Documents (Training)" from_port="word list" to_op="Process Documents (Application)" to_port="word list"/>
      <connect from_op="Process Documents (Application)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

TheBen · July 2012

ok, thanks this works fine, but if I replace "documents from files" with "loop files", "read document" and "Set Role" like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="false" class="loop_files" compatibility="5.2.008" expanded="true" height="76" name="Loop Files" width="90" x="45" y="75">
    <parameter key="directory" value="C:\trainpdfs\"/>
    <parameter key="filtered_string" value="file name (last part of the path)"/>
    <parameter key="file_name_macro" value="file_name"/>
    <parameter key="file_path_macro" value="file_path"/>
    <parameter key="parent_path_macro" value="parent_path"/>
    <parameter key="recursive" value="true"/>
    <parameter key="iterate_over_files" value="true"/>
    <parameter key="iterate_over_subdirs" value="false"/>
    <parameter key="parallelize_nested_process" value="false"/>
    <process expanded="true" height="650" width="1080">
      <operator activated="false" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="504" y="30">
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="pdf"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <connect from_port="file object" to_op="Read Document" to_port="file"/>
      <connect from_op="Read Document" from_port="output" to_port="out 1"/>
      <portSpacing port="source_file object" spacing="0"/>
      <portSpacing port="source_in 1" spacing="0"/>
      <portSpacing port="sink_out 1" spacing="0"/>
      <portSpacing port="sink_out 2" spacing="0"/>
    </process>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="false" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents (2)" width="90" x="179" y="75">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prunde_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.05"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="parallelize_vector_creation" value="false"/>
    <process expanded="true" height="632" width="1080">
      <operator activated="false" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="504" y="30">
        <parameter key="mode" value="non letters"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="English"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <connect from_port="document" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
      <portSpacing port="source_document" spacing="0"/>
      <portSpacing port="sink_document 1" spacing="0"/>
      <portSpacing port="sink_document 2" spacing="0"/>
    </process>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="175" y="266">
    <parameter key="name" value="metadata_file"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
  </operator>
</process>

then the "Neural Net" operator doesn't learn the correct connections (predictions are almost all the same). In contrast, if I use "Naive Bayes" then the same szenario works (predictions are mostly accurate).

I want to use "Neural Net" instead of "Naive Bayes". The reason for this is that with the "documents from file" implementation I get better accuracy and confidence levels...

EDIT: I found differences in the example sets:
- the exampleset from the "documents from file" has a "label"-attribute of role "label" and type "polynominal"
- the exampleset from the "loop files, process documents and set role" has a "metadata_file"-attribute of role "label" and type "nominal"

How to change the type?

TheBen · July 2012

I would realy like to get this done. It is a great program. There is a manual for 40€. If it really helps me to work with the rapidminer tool to implement the pdf classifier I wouldn't hesitate to buy it.

Nils_Woehler · July 2012

Hi,

why do you want to use a neural net if naive bayes works also? Have you tried to use a SVM? Most of the time it gives better results for text processing than other algorithms.
Nominal and polynominal should not create any errors.

Do you mean "How to Extend RapidMiner 5.0"? You should only buy it if you want to create a new extension for yourself.

Best,
Nils

TheBen · July 2012

If I use SVM then I get this error:

Support Vector Machine cannot handle polynominal label.

EDIT: Working solution: I embedded the SVM into a "Classification by Regression" operator. Now it works fine. even with a few examples the classification is correct. That's great!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] postprocessing based on predictions

Answers