The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting data From WEb pages
hi,
I am trying to extract data from HTML pages . I tried with both Regular expressions and Xpath queries .
I was ,able to extract some details by using Xpath queries, but since the html page from which i am extracting is so complex ,that i am not able to make out the tag hierarchy.So its very diffficult to specify the XPath queries , for all the data
Is there any other method to find out the hierarchy of the html , so that i can extract the details using Xpath queries.
regards,
siju sony mathew
I am trying to extract data from HTML pages . I tried with both Regular expressions and Xpath queries .
I was ,able to extract some details by using Xpath queries, but since the html page from which i am extracting is so complex ,that i am not able to make out the tag hierarchy.So its very diffficult to specify the XPath queries , for all the data
Is there any other method to find out the hierarchy of the html , so that i can extract the details using Xpath queries.
regards,
siju sony mathew
0
Answers
you might try to solve your problems by using tags with certain attribute values as anchors for your xpath querry. For example div tags with a class, id or name attribute.
For easier orientation in the DOM tree, you could use a DOM explorer available for every browser. It shows the DOM tree in a explorer like fashion, making orientation easier. Some even support selection of tags by clicking in the according area of the web page itself.
Greetings,
Sebastian
Thankyou for your suggestion ,I was able to extract data from some intranet RSS feeds.
But i am having 2 problems now
1)With the user agent i am using ( ie the rapid miners default user agent), i am not able to crawl internet rss feeds.Is there any user agent by which we wud be able to crawl sites....I am trying to crawl www.ndtv.com, but i am not able to do the same with the rapid mminers default user agent.........Is there any method to find out which user agent is being supported by a website.
2)If the webpage is not having wellformed HTML format, is there any way to extract the data as , xpath queries would work only with wellformed HTML pages
greetings,
Siiju
most sites should support one of the most common browsers, especially the Internet Explorer. If this does not work, the site might exclude crawlers in the robots.txt
If XPath does not work, you could use Regular Expressions for specifying interesting regions.
Greetings,
Sebastian