"Web Mining crawling prices of an internet page"
Guys,
I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten
First I want to collect the pages:
https://www.buscape.com.br/cerveja?pagina=1
https://www.buscape.com.br/cerveja?pagina=2
...
https://www.buscape.com.br/cerveja?pagina=200
Follow my process below
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
<parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="cerveja"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
<list key="function_descriptions">
<parameter key="page" value="%{page}"/>
</list>
</operator>
<connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Crawl Web" from_port="example set" to_port="output 2"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Loop" from_port="output 1" to_port="result 2"/>
<connect from_op="Loop" from_port="output 2" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.
But, somehow when I copy paste it from google, it doesn't work.
Can you guys please help me create a simple example of process ?
Thanks in advance.
Best Answer
-
luiz_vidal Member Posts: 14 Contributor II
Ugh,
After almost giving up I was able to retrieve the piece of data I want, the thing is that it brings only the first that it finds..
I need to find a way to fetch all products names and prices
1
Answers
Hi Luiz-Vidal,
I came across that issue a few days ago.
Just copy&paste the xml from google wont work due to namespace
Google gives
//*[@id="product_383527"]/div/div[1]/div[3]/div[1]/a/span for the first product: Paulistânia Puro Malte Premium Lager Garrafa 600 ml 1 Unidade and
//*[@id="product_383527"]/div/div[2]/div[1]/div[1]/a/span for the price 14,99
In RM you have to use //*[@id="product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span
and //*[@id="product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span
See the discussion here: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Extracting-Information-With-XPath/td-p/9883
Cheers
miner
Hey,
Thanks for your reply
Although I still can't make it..
Any idea what am I doing wrong?
I´m not quite sure.
The website is using product-id for reference.
For the first product I took it was //*[@id="product_383527"] - assuming the id is changing for every product the xpaht is only working for this specific product.
Then you would have to go up the tree to get a "non-id-related" node and then pick the detail from there.
That would be /html/body/main/div[3]/div/div[3]/section?
Sorry,
I know nothing about xpath, I've been trying all day to get ..
<input value="Brahma Pilsen Lata 350 ml 1 Unidade" name="productName" type="hidden">
<input value="6.75" name="priceProduct" type="hidden">
I try, try try and the extract document returs me only true or false or ?
name="productName"], it returns TRUE or FALSE.. but what I want is the value for productName and for priceProduct.. which will probably have to be return on a list.. or a huge string to be split.. I dont know yet.
A victory would be just getting one value returned
Hi @luiz_vidal
xpath can be a mess...
A good way to test xpath-strings is to use google docs where you can quickly copy the xpath from chrome to the spreadsheet and test the result. This is much faster than testing the structure in RM.
On Youtube you find a lot of tutorials to xpath and google docs.
My recommendation is the video of community member el chief - find it here: https://www.youtube.com/watch?v=UG6223p9fZE
Cheers
miner
Overall,
It was a matter of getting to know how to use xpath and configuring it correctly along the operators.
Thanks for your help
"xpath can be a mess..."
Definately agree, but it's powerful when it works.
Help me please. Which Currency is best to mine. https://en.bitcoinwiki.org/wiki/Web_mining here it is written that experts advice "monero".
Hi,
Trying the XPaths in a shell environment can make things faster.
A simple command line tool is XML Shell:
http://www.xmlsh.org/CommandXPath
You can also find the same functionality in Python's scrapy, but it is overkill for your actual needs.
Regards,
Sebastian