The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Generate meta data for a document

s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II
edited December 2018 in Help

Dear RapidMiniers,

 

I'd like to know whether it is possible to generate meta date for a document?`Here is the context:

 

In my web/text mining project I have an example set, in which one attribute is a URL request link. I need to mine the content to which this link points and store it back in the original example set.

 

My current implementation is: I fork my process such that my original example set flows through one branch, while in the other branch I loop over the values of my URL request link attribute, access the content using the Get Page operator, and extract the information I need from there. The problem is: how do I merge the new example set I got here with the original one?

 

My idea was to use the merge the two example sets on the URL request link as a unique identifier. However, I don't know how to do it :-( What I tried is to use the URL I get as a part of meta data attacher dot the output of Get Operator. However, that URL is slightly modified with respect to the one I used as the input to Get Page:

 

Input link:
http://s2027422842.t.en25.com/e/er?utm_campaign=Recently%20Posted%20Guidance%20Documents%206%2F4%2F2018&utm_medium=email&utm_source=Eloqua&s=2027422842&lid=3746&elqTrackId=C6E3FEC14CE46AD6E741C4220F54E866&elq=78713a7a038c4be4a166b543d4ed17a9&elqaid=3762&elqat=1

 

URL in the meta data of the Get Page output:

http://s2027422842.t.eloqua.com/e/er?utm_campaign=Recently%20Posted%20Guidance%20Documents%206%2F4%2F2018&utm_medium=email&utm_source=Eloqua&s=2027422842&lid=3746&elqTrackId=C6E3FEC14CE46AD6E741C4220F54E866&elq=78713a7a038c4be4a166b543d4ed17a9&elqaid=3762&elqat=1

 

Although it looks like only a small part of the link is modified, according to one of your developers, we cannot have any guarantee on the modification pattern.

 

My next idea is to somehow attach the input URL which I can access as %{loop_value} macro as meta data of the document I get out of Get Page. However, I didn't find a way to do this.

 

Does anyone have any idea how I could go about my problem? Any inputs would be much appreciated!

Cheers,

Snežana

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi @s_nektarijevic,

    cant you use the Annotation feature for this?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II

    Dear Martin,

     

    Many thanks for your reply and suggestion!

     

    Perhaps this could solve my problem, but I don't see how :-( I am new to RapidMiner and still struggling with getting the data I need in the format I want.

     

    What I do is: I loop over URL request links and retrieve the content, which gives me a collection of documents. What I'd need to do is to add a piece of information to each of these documents that will point me back to the corresponding %{loop_value}. I then use Process Documents operator to transform my document collection into an example set. If I add an annotation per document, I don't see how I can add my annotation into the same example set in which I store the result of Process Documents. Is that possible to do?

    I am attachnig my code too, should it be usefull in any sense.

     

    Thanks a lot for looking into this!

     

    Snežana

  • s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II

    Hi all,

     

    I'd like to conclude this topic. I found a solution that doesn't require me to generate meta data. Sorry for the noise :-)

     

    Snežana

Sign In or Register to comment.