Extract data from XML files

Lei · November 2021

I have many XML files. They have similar structure but are different in some details.

The xml structure is similar as follow:

<article>
<art-front>

<title>Integrated phytoremediation</title>

</titlegrp>
<abstract>

<p>Phytoremediation is green rehabilitation technology .</p>
</abstract>
</art-front>
<art-body>
<section>
<title>One thing</title>
<p>the main technologies 1...</p>
<p>the main technologies 2...</p>
</section>
<section>

<title>Others</title>
<subsect1>
<p>the main technologies 3...</p>
<p>the main technologies 4...</p>
<p>the main technologies 5...</p>
</subsect1>

</section>
</art-body>
<art-back>
<biblist title="References">
<citauth>

</citauth>

</biblist>

</art-body>

</abstract>

The xml file differences take place between <art-body> and </art-body>. Some xml files have four <section>, some have five..., the numbers of <p> in <section> tag also can be different. In addition, some xml files have not <subsect> contents, only have multiple <section> contents.

I want to extract <art-front> and <art-body> contents, but not <art-back> content.

I know that read xml operator can be used to extract content from xml file and also read document operator can finish it. Because my xml files are not totally same, I have no idea to deal with it. Is there any way to do that?

Thanks

BalazsBarany · November 2021

Hi!

In these cases I usually build the process with multiple Read XML operators.

One would extract the common information, e. g. from the constant header. Another the variable information, like the repeating entries. I can then join the results e. g. based on the file name or some other common attribute.

Use the most specific XPath for selecting what you need in each Read XML and figure out which join is the best for the task.

Regards,
Balázs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extract data from XML files

Best Answer