The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Parsing out plain text from the Reuters RCV1 corpus - XPath, XML
I have got a question regarding reading out the node content with xpath from several xml files out. I am fully aware that there are masses of resources on the internet on this issue and please believe me it really drives me crazy. I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. i post the structure here as an example:
The final goal of my task is to transfer these several thousand files into a table or csv respectively. I am doing this by addressing the different node contents via der xpath address. this is absolutely no problem for all points but one, the content of <text></text>. with //newsitem/text/p/node() he always only delivers the first paragraph. what i would be looking for however would be to extract all the plain text from all paragraphs. this means the csv files should looks approximately like that:
title, headline, date, text, location titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere
the respective paragraph should thus be collapsed. with the query /newsitem/text i get the whole textbody however with all the tags which is annoying with so many files.
pleas could somebody be so nice how to achieve the described goal via adressing it with xpath. the problem is also that i have to parse out other information too at the same time. thus plain text and attributes should be in the same row of the table.
tank you very much,
a desperate xml/xpath newbie
<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
<code code="MEX">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
</code>
</codes>
<codes class="bip:topics:1.0">
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>
The final goal of my task is to transfer these several thousand files into a table or csv respectively. I am doing this by addressing the different node contents via der xpath address. this is absolutely no problem for all points but one, the content of <text></text>. with //newsitem/text/p/node() he always only delivers the first paragraph. what i would be looking for however would be to extract all the plain text from all paragraphs. this means the csv files should looks approximately like that:
title, headline, date, text, location titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere
the respective paragraph should thus be collapsed. with the query /newsitem/text i get the whole textbody however with all the tags which is annoying with so many files.
pleas could somebody be so nice how to achieve the described goal via adressing it with xpath. the problem is also that i have to parse out other information too at the same time. thus plain text and attributes should be in the same row of the table.
tank you very much,
a desperate xml/xpath newbie
Tagged:
0
Answers
two add-on questions:
1) i am sure i am not the only one here working with the reuters corpus volume 1 & 2: do you know of a way to read in the whole corpus more efficiently e.g. in a database?
2) although it is just text if also 800k files, rapidminer has enormous memory problems to parse in the files on a 4gm ram machine. is that normal? due to that i had to manually split up the files into six parts with a little more than 600m each. parsing in one of this splits takes around 70 minutes. i suspect that the memory runs full and the system slows down. isn't there away to tell rapidminer to serialize the whole process .... so to read in one file extract the information i need by xpath and write the first result into a csv e.g. other software does this and seems to be much more efficient although unable to achieve what i can do with rapidminer.
every help would be very very much appreciated!
best regards,
amdk
Is it what you're after?
JEdward.
Thank you for your post. But it is really strange because the "Read XML" Operator seems not to exist anymore after I updated Rapidminer today in the morning. Your operator just appears as a dummy. Is this unique to my machine, my set up? Nevertheless, I tried the Read XML operator before and it doesn't work for me as I want to read in 800k xml files instead of just one. therefore i used the process documents from files operator. then the "generate extract" operator to read out the paths i need into a csv file (with the write csv operator). as i said all parts but one work like a charme. the problem is the text part as it is split up by paragraphs. and it seems as this is a problem for rapidminers xpath parser.
I'm using what sounds like the same method on some other XML documents.
If I manage to get a few moments today I'll try to have a quick look at what differences there are between my XML documents & process and yours.
Thanks,
JEdward.