Generate meta data for a document
Dear RapidMiniers,
I'd like to know whether it is possible to generate meta date for a document?`Here is the context:
In my web/text mining project I have an example set, in which one attribute is a URL request link. I need to mine the content to which this link points and store it back in the original example set.
My current implementation is: I fork my process such that my original example set flows through one branch, while in the other branch I loop over the values of my URL request link attribute, access the content using the Get Page operator, and extract the information I need from there. The problem is: how do I merge the new example set I got here with the original one?
My idea was to use the merge the two example sets on the URL request link as a unique identifier. However, I don't know how to do it :-( What I tried is to use the URL I get as a part of meta data attacher dot the output of Get Operator. However, that URL is slightly modified with respect to the one I used as the input to Get Page:
URL in the meta data of the Get Page output:
Although it looks like only a small part of the link is modified, according to one of your developers, we cannot have any guarantee on the modification pattern.
My next idea is to somehow attach the input URL which I can access as %{loop_value} macro as meta data of the document I get out of Get Page. However, I didn't find a way to do this.
Does anyone have any idea how I could go about my problem? Any inputs would be much appreciated!
Cheers,
Snežana
Answers
Hi @s_nektarijevic,
cant you use the Annotation feature for this?
~Martin
Dortmund, Germany
Dear Martin,
Many thanks for your reply and suggestion!
Perhaps this could solve my problem, but I don't see how :-( I am new to RapidMiner and still struggling with getting the data I need in the format I want.
What I do is: I loop over URL request links and retrieve the content, which gives me a collection of documents. What I'd need to do is to add a piece of information to each of these documents that will point me back to the corresponding %{loop_value}. I then use Process Documents operator to transform my document collection into an example set. If I add an annotation per document, I don't see how I can add my annotation into the same example set in which I store the result of Process Documents. Is that possible to do?
I am attachnig my code too, should it be usefull in any sense.
Thanks a lot for looking into this!
Snežana
Hi all,
I'd like to conclude this topic. I found a solution that doesn't require me to generate meta data. Sorry for the noise :-)
Snežana