The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Split a single xml file into several docs or example set

mohammadrezamohammadreza Member Posts: 23 Contributor II
edited February 2020 in Help
Hi. I am new to RapidMiner text plugin.

I have an XML file consisting of <document> elements. Each document tag contains one document as follows:
<documents>
    <document>
        <id> 1 </id>
        <text>...............</text>
    </document>
    <document>
        <id> 1 </id>
        <text>...............</text>
    </document>
    ...
</documents>
I think I have to split them first and extract documents to be able to construct the word vector. Is there any way to do that?

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Is there any reason not to use read xml and convert the example set to a document afterwards?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Thanks Martin,

    I think read XML operator is the wise option, but I need to do some text classification after that. That's why I wanted to work with documents through text plugin. Assuming that according to your explanation I use Read XML, is this any way to work with text plugin? I mean how should I connect the output of read XML to some operator like "Process Document" or any other operator to allow me do the tokenization, stemming and make word vector?

    Thanks
  • frasfras Member Posts: 93 Contributor II
    Hi, try this as a starting point:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.2.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
            <parameter key="text" value="&lt;documents&gt;&#10;    &lt;document&gt;&#10;        &lt;id&gt; 1 &lt;/id&gt;&#10;        &lt;text&gt; content_A &lt;/text&gt;&#10;    &lt;/document&gt;&#10;    &lt;document&gt;&#10;        &lt;id&gt; 2 &lt;/id&gt;&#10;        &lt;text&gt; content_B &lt;/text&gt;&#10;    &lt;/document&gt;&#10;    ...&#10;&lt;/documents&gt;"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_value" value="SOURCE01"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="6.1.000" expanded="true" height="60" name="Cut Document (10)" width="90" x="112" y="165">
            <parameter key="query_type" value="Regular Region"/>
            <list key="string_machting_queries">
              <parameter key="empty" value="&lt;Family.&lt;/Family&gt;"/>
            </list>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries">
              <parameter key="empty" value="&lt;document.&lt;/document&gt;"/>
            </list>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
            <process expanded="true">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="6.2.000" expanded="true" height="76" name="Loop Collection (2)" width="90" x="246" y="75">
            <parameter key="set_iteration_macro" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:documents_to_data" compatibility="6.1.000" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="75">
                <parameter key="text_attribute" value="text"/>
              </operator>
              <connect from_port="single" to_op="Documents to Data" to_port="documents 1"/>
              <connect from_op="Documents to Data" from_port="example set" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="6.2.000" expanded="true" height="76" name="Append (2)" width="90" x="380" y="75"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="6.1.000" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="514" y="75">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document (2)" from_port="output" to_op="Cut Document (10)" to_port="document"/>
          <connect from_op="Cut Document (10)" from_port="documents" to_op="Loop Collection (2)" to_port="collection"/>
          <connect from_op="Loop Collection (2)" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Data (2)" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Thank you indeed Fras. I will try your solution and let you know about the results ASAP. I think your solution is more efficient if I can adapt it because, I designed the RM process with read XML operator (as Martin suggested), and I ran out of the memory with even a 32 GB of RAM. My XML file is about just 160 MB but the de-serialization process take a lot of RAM in Read XML. So I wanna try your approach and inform you if it could handle my 160MB XML file with the size of 16 0MB. Thanks again.
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Hi Fras. I am trying your solution for reading my 160 MB XML fille. I got stuck in dealing with the following XML schema which has more than one <text> node in each document.
    <documents>
       <document>
           <id> 1 </id>
           <message>
                   <author>..........</author>        
                   <text>...............</text>
           </message>
           <message>
                  <text>...............</text>
                  <text>...............</text>
           </message>
       </document>
       <document>
           <id> 2 </id>
           <message>
                   <author>..........</author>        
                   <text>...............</text>
           <message>                        
       </document>
       ...
    </documents>
    In previous solution (Martin's Solution) I used ReadXML operator and set the property "XPath for attribute"  to extract all of the <text> nodes for each document. But in the new solution, as you explained, the "Cut Document" operator nicely separates each document and then it is passed through "loop collection" operator. This is where I need to extract all of the <text> nodes in the document (e.g. via XPath). and convert them to one attribute for my example set. But I cannot get all of the <text> nodes for each document. Do you think if there is any solution to do this?

    Thanks in advance.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    looks to me like a xpath can solve this.
    Have you tried the import wizard?

    Sadly i got no time to try it myself. But i guess it works

    best
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Thanks for the answer Martin; XPATH do solve this problem in "ReadXML" operator. But Read XML cannot handle a 160 MB file. So I am playing around with Fras' solution. And I need to use XPATH in that one. Any idea please?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    the file size should be no problem for read xml.
    The wizard might get slow, because it caches the file at some point. But it still works
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Hi Martin. That's interesting about ReadXML. But I used it on my 160 MB of XML data and I waited for 2 days and 4 hours (totally 52 hours) on a system with 32GB of memory. After 52 hours, the process was still busy with ReadXML so I stopped it thinking that something is wrong. So do you think that I should have waited more or maybe something is wrong with big files? As an experiment, I splitted the file into several peaces and I got results after 9 hours. In neither of cases I used the import wizard, so I am sure that my XPATH expressions are correct. This experiment might be helpful for others. Please let me know what you think about this experiment.
  • xmlguyxmlguy Member Posts: 1 Learner III
    Why not use a tool designed for splitting xml? Over on stackexchange  an answer to the following question lists some tools:
    http://stackoverflow.com/questions/700213/xml-split-of-a-large-file/7823719#7823719
Sign In or Register to comment.