The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Xpath Problem regarding multiple extracts
Hi. I can´t manage to extract several - for example - titles or other text included in a html-doc with the rapidminer xpath-processor. If you use the xpath querry //h:title/text() it only extracts the first title in the extract information operator and the following metadata-attribute, while other xpath-visualizers (xpather etc.) will show f.e. all three titles with this querry. Is this a Rapidminer-Problem, an operator-limitation or just stupidity?
0
Answers
I also faced the problem before, that for XPath and regular expressions only the first matches are delivered. I am not sure if there is a possibility to get a collection of all matches in RapidMiner (Java should allow this easily).
What operator do you use for the extraction? I worked around this problem by using the document type (instead of ExampleSet) and extracting all matches as single documents in a collection for the original document. You can easily achieve this by using the "Cut Document" operator. This worked well in my case, but I would also want to know if there are other possibilities to handle multiple matches.
Regards,
Matthias
actually there should be no problem to use multiple Cut documents inside each other. Memory consumption of Documents should be fairly low...
Actually we could add an option to include all matches inside a meta information. I will note this down for the next version.
Greetings,
Sebastian
if you have the forum page split into documents I suppose that for each posting the HTML structure should always be the same. In this case you could directly address the XPatch matches you desire for a special information using the proper predicates.
Here a simple example: If you want to grab the author you can use the default first match: /div/div - but you won't get the other information this way.
Using predicates you can extract them easily into different attributes (which in this case might be an advantage because you get named entities instead of a list of matches): Note that the use of a numeric predicate [1] simply is a short-hand for the boolean predicate [position()=1].
Regards,
Matthias
thx, that´s what i tried to do. But there´s a Problem if the forum post contains a unknown number of n citations (which is f.e. queried by a //div[@style=italic] xpath-expression) or - in your example - n authors. The [1] will show you the first author, but how to handle multiple authors (for a unknown n].
Usually, xpath queries (f.e. starting with //) will show you every author or citations which fits to the xpath automatically, but Rapidminer just writes the first existing author in the metadata-attribute (with the extract information op; the cut-doc-op seems to work correct, although you have to find a way to merge the cutted n documents together --> every single of the n citation has to be assigned to the correct posting it belongs to ???).
if you have an unknown number of relevant elements the predicate of course isn't of much help. The use of "Cut Documents" is the only way I know, that RapidMiner can deliver some sort of the usual enumeration of multiple matches.
To assign the citations (inner "Cut Documents" operator) to the postings (outer "Cut Documents" operator) you could perhaps assign an id (unique for each posting, equal for all citations belonging to a posting). The use of a counting variable for the outer "Cut Documents" should do the job. Working with the document type might be hard in this case - perhaps you should consider a conversion to ExampleSets.
Regards,
Matthias
has this option been implemented yet? I am just working with xmls that have multiple matches (e.g. http://dblp.uni-trier.de/rec/bibtex/journals/umuai/ParamythisWM10.xml with 3x <author>).
sorry, but I don't think so. But if you are familiar with Java, you probably could add this very easily yourself?
Greetings,
Sebastian