The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Web page selection.

ratheesanratheesan Member Posts: 68 Maven
Hi,
How can I select the contents of a particular web page  using RM.I tried it with crawler,but getting more  pages than I specified.

Thanks,
Ratheesan

Answers

  • fischerfischer Member Posts: 439 Maven
    Hi,

    the question is unclear. What exactly do you mean by "contents"? Do you want only a specific (list of) web pages? Do you want to extract information from the Web page?
    Please specify?

    Cheers,
    Simon
  • ratheesanratheesan Member Posts: 68 Maven
    Hi Simon,
    I want to extract information from web page.If I can copy the contents in the web page as a text file,then I will apply text mining algorithms.So now I need to copy the web page in to a text file.

    Thanks
    Ratheesan.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I guess you might change the "max_depth" parameter to zero. The crawler shouldn't then follow any links.

    With RapidMiner 5 there will soon be a web mining extension making this more easily.

    Greetings,
    Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hai,

    I have tried with the above method and I saved it as a text file. The saved text contains html tags and image url's etc... Is there any way to save only the texts (the text that is seen by a user when he opens a web page).

    Thanks,
    Ratheesan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    with 5.0 this would be easy, in 4.x you can only set the TextInput to contenttype html, so that all tags are filtered out.

    Greetings,
      Sebastian
Sign In or Register to comment.