The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Mining online reviews for sentiment analysis"

janjanjanjan Member Posts: 1 Learner III
edited June 2019 in Help
I am trying to capture reviews about a specific product from amazon in order to do sentiment analysis by applying a classification model to predict positive or negative attitudes.  Two questions:

1)  Regarding getting the data: How do you limit the crawl to just the reviews. The reviews for the product are several pages long, each page link looks like this:
http://www.amazon.com/Rainbow-Loom-Twistz-Bandz/product-reviews/B00DMC6KAC/ref=cm_cr_pr_btm_link_2?ie=UTF8&;pageNumber=2&showViewpoints=0&sortBy=byRankDescending

...with the pageNumber number in the link changing based on the page number of course. I want to crawl just these pages, but each review page has tons of other links eg to amazon.com, to online ads etc.  Is there a character (like *) that I can use instead of the page number to specify that I only want to crawl only these links?

2) How can I get individual reviews (several on a page) into its own text document (or maybe its own field in a database record) so it can be classified?

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    I suppose you are using the Crawl Web operator to crawl the pages. That operator supports regular expressions in the crawling rules. You'll find tons of documentation for regular expressions on the web. The wildcard for an arbitrary amount of digits is \d+  (\d = one digit, + means one or more of them).

    To split the reviews one option would be to use Process Documents on the crawled pages, and use Split Documents to split the complete site into single reviews.

    Best regards,
    Marius
  • sourabhchoudharsourabhchoudhar Member Posts: 6 Contributor II
    Hi Marius

    I want to get previous year news from web using Crawl web operator. I am applying web crawling but it is providing me results of few months back, Even I increase the depth but still. Can you guide me how can I refine my Search to get best Historic data from websites?

    Thanks
    Sourabh Choudhary
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Sourabh,

    that depends completely on the websites - you have to define the correct crawling rules, maybe combined with filters on the retrieved documents afterwards.
    Unfortunately there is no general rule, you really have to look into the structure of the websites.

    Best regards,
    Marius
  • sourabhchoudharsourabhchoudhar Member Posts: 6 Contributor II
    Hi Marius,

    Thanks for your Suggestions. I am trying over the combinations of filters with Crawling rules. ASAP I will be able to do exactly what I want, I will share at forum.


    Regards

    Sourabh
  • sourabhchoudharsourabhchoudhar Member Posts: 6 Contributor II
    Hi Marius

    I want to search for the related valuable information about specific key word or key name on the web(social Media & Forums, Blogs, Search Engines, News websites,News Blogs etc.)using Rapidminer. Please help me How can I do it..

    Thanks

    Sourabh
Sign In or Register to comment.