The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extract data from XML files
I have many XML files. They have similar structure but are different in some details.
The xml structure is similar as follow:
<article>
<art-front>
<subsect1>
<p>the main technologies 3...</p>
<p>the main technologies 4...</p>
<p>the main technologies 5...</p>
</subsect1>
The xml structure is similar as follow:
<article>
<art-front>
<titlegrp>
<title>Integrated phytoremediation</title>
</titlegrp>
<abstract>
<abstract>
<p>Phytoremediation is green rehabilitation technology .</p>
</abstract>
</art-front>
<art-body>
<section>
<title>One thing</title>
<p>the main technologies 1...</p>
<p>the main technologies 2...</p>
</section>
<section>
<title>Others</title></abstract>
</art-front>
<art-body>
<section>
<title>One thing</title>
<p>the main technologies 1...</p>
<p>the main technologies 2...</p>
</section>
<section>
<subsect1>
<p>the main technologies 3...</p>
<p>the main technologies 4...</p>
<p>the main technologies 5...</p>
</subsect1>
</section>
</art-body>
<art-back>
<biblist title="References">
<citauth>
</art-body>
<art-back>
<biblist title="References">
<citauth>
<fname>H.</fname>
<surname>Ali</surname>
</citauth>
</biblist>
</art-body> </abstract>
The xml file differences take place between <art-body> and </art-body>. Some xml files have four <section>, some have five..., the numbers of <p> in <section> tag also can be different. In addition, some xml files have not <subsect> contents, only have multiple <section> contents.
I want to extract <art-front> and <art-body> contents, but not <art-back> content.
I know that read xml operator can be used to extract content from xml file and also read document operator can finish it. Because my xml files are not totally same, I have no idea to deal with it. Is there any way to do that?
Thanks
The xml file differences take place between <art-body> and </art-body>. Some xml files have four <section>, some have five..., the numbers of <p> in <section> tag also can be different. In addition, some xml files have not <subsect> contents, only have multiple <section> contents.
I want to extract <art-front> and <art-body> contents, but not <art-back> content.
I know that read xml operator can be used to extract content from xml file and also read document operator can finish it. Because my xml files are not totally same, I have no idea to deal with it. Is there any way to do that?
Thanks
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi!
In these cases I usually build the process with multiple Read XML operators.
One would extract the common information, e. g. from the constant header. Another the variable information, like the repeating entries. I can then join the results e. g. based on the file name or some other common attribute.
Use the most specific XPath for selecting what you need in each Read XML and figure out which join is the best for the task.
Regards,
Balázs0