The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Parsing XML"

jshillerjshiller Member Posts: 5 Contributor II
edited May 2019 in Help
I've been experimenting with the REST API from LastFM. My query to the API asks for artists similar to Bono.

Here's the XML file that the query generates:
http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&;artist=bono&api_key=b25b959554ed76058ac220b7b2e0a026

I'm trying to parse the XML file and generate output that provides "artist" and "match" for each of the 100 entries in the XML file. The current output generates 200 rows containing the URL I'm querying, the full contents of the page, and the name of the attributes I setup with XPATH queries. The output I want to see is a different artist name and associated match number on each row. Any advice on how to achieve this is greatly appreciated.

Thanks,
Jamie

This is what I want to see in the Data View:

image

This is what I currently see in the Data View:

image

Here's my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
   <process expanded="true" height="628" width="736">
     <operator activated="true" class="web:process_web" compatibility="5.0.4" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
       <parameter key="url" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
       <list key="crawling_rules">
         <parameter key="0" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
       </list>
       <parameter key="add_pages_as_attribute" value="true"/>
       <parameter key="max_pages" value="1"/>
       <process expanded="true" height="481" width="788">
         <operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document" width="90" x="70" y="46">
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries"/>
           <list key="regular_expression_queries"/>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="name" value="/h:lfm/h:similarartists/h:artist/h:name"/>
             <parameter key="match" value="/h:lfm/h:similarartists/h:artist/h:match"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
           <process expanded="true" height="463" width="702">
             <connect from_port="segment" to_port="document 1"/>
             <portSpacing port="source_segment" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_port="document" to_op="Cut Document" to_port="document"/>
         <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="write_database" compatibility="5.0.8" expanded="true" height="60" name="Write Database" width="90" x="246" y="30">
       <parameter key="connection" value="AWS RDS"/>
       <parameter key="table_name" value="artists"/>
       <parameter key="overwrite_mode" value="append"/>
     </operator>
     <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Database" to_port="input"/>
     <connect from_op="Write Database" from_port="through" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this is really advanced parsing. Normally I would not post a complete process but simply outlying the way to go, but it's a great example of what one can do with the Text Processing and Web Extension in combination. So here's this very cool process:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
        <process expanded="true" height="628" width="736">
          <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
            <list key="query_parameters"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="artist" value="h:lfm/h:similarartists/h:artist"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="279" width="743">
              <operator activated="true" class="text:extract_information" compatibility="5.0.7" expanded="true" height="60" name="Extract Information" width="90" x="335" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="name" value="//h:name/text()"/>
                  <parameter key="match" value="//h:match/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.0.7" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
            <parameter key="create_word_vector" value="false"/>
            <process expanded="true" height="261" width="743">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian
  • jshillerjshiller Member Posts: 5 Contributor II
    Sebastian,

    Thanks so much for providing the complete process! This helps a lot.

    Best,

    Jamie
Sign In or Register to comment.