The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Web Mining
newbierapid
Member Posts: 6 Contributor II
Hai All,
I am new to RM. Currently I am using RM5.0 version. My objective is to crawl web(Using Crawling operator) and I am able to save the URLs by giving the regular expression rules into an excel file.Now the problem is I am not able to see the content related to each URLs. After geting the content I have to eliminate html contents in each page.
Can anyone suggest how to proceed further. It will be great if someone can explain with operator names in process order.
Thanks
I am new to RM. Currently I am using RM5.0 version. My objective is to crawl web(Using Crawling operator) and I am able to save the URLs by giving the regular expression rules into an excel file.Now the problem is I am not able to see the content related to each URLs. After geting the content I have to eliminate html contents in each page.
Can anyone suggest how to proceed further. It will be great if someone can explain with operator names in process order.
Thanks
0
Answers
I'm not really sure where the problem lies, since the description is a bit vague. You are using the "Crawl Web" operator and get URLs but no contents? then use the "add pages as attribute" parameter and you will get both. But I have no clue how regular expressions and an Excel file should be related to this... Perhaps you might provide some more details about what you have done (perhaps post your process XML) and where you couldn't achieve further goals.
Regards
Matthias
Sorry for less informatino regarding this. I have used Crawler operator to crawl a website. I have followed the way you suggested, Now I am able to get the URLs listed. I would like to see the content in each url ,kindly excuse if its a silly question. After geting the content I have to remove each tags in that page and do further processing .Here I am posting my XML code.
Thanks in advance
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="605" width="692">
<operator activated="true" class="web:crawl_web" compatibility="5.1.003" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
<parameter key="url" value="http://www.asklaila.com/search/Bangalore/-/shopping malls/?searchNearby=false&amp;ac=true"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_text" value=".*Shopping Malls.*"/>
</list>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Documents and Settings\Sudheendra\Desktop\b"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="4"/>
<parameter key="user_agent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 1.1.4322)"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
sorry I'm a bit confused... Link and website content are already there (the latter is contained in the attribute Page). If you want to get rid of the HTML markup you might do something like this: Regards
Matthias