The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Never results in "Process Documents from Web"
Hello,
I have a problem with the operator "Process Documents from web". No matter how I set the operator, URLs will never be found, although a few months ago the process still worked and the URL structure didn't change.
I tried it with different domains, unfortunately the Rapidminer never finds URLs.
What could be the reason? It would be great if someone could help me !
Greetings
Tim
Greetings
Tim
Tagged:
0
Best Answer
-
kayman Member Posts: 662 UnicornYou could use the loop operator for this specific case.
Your site has 45 listings pages (670 links, 15 by page), so set up a loop for 45 iterations, and do something per loop
Attached example would load the next page per iteration (url/s[page_number]), get all the links (<a spans) and leave the ones needed.
This way you can build up a list of all the links, and that list can then be used to start crawling the other sites.<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34"> <parameter key="number_of_iterations" value="45"/> <parameter key="iteration_macro" value="page"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="313" y="34"> <parameter key="url" value="https://www.gelbeseiten.de/reisebueros/berlin/s%{page}"/> <list key="query_parameters"/> <list key="request_properties"/> </operator> <operator activated="true" breakpoints="after" class="subprocess" compatibility="9.0.003" expanded="true" height="82" name="Extract Links" width="90" x="447" y="34"> <process expanded="true"> <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="45" y="136"> <list key="replace_dictionary"> <parameter key="\r?\n" value=" "/> <parameter key="[ ]+" value=" "/> </list> </operator> <operator activated="true" class="text:keep_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Keep Document Parts (2)" width="90" x="45" y="34"> <parameter key="extraction_regex" value="(?s)<a .*?>"/> </operator> <operator activated="true" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="179" y="34"/> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="313" y="34"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="link" value="<.>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <connect from_port="segment" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34"> <parameter key="text_attribute" value="link"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="link"/> </operator> <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34"> <list key="filters_list"> <parameter key="filters_entry_key" value="link.does_not_contain.www\.gelbeseiten\.de"/> <parameter key="filters_entry_key" value="link.contains.class="link""/> </list> </operator> <operator activated="true" class="replace" compatibility="9.0.003" expanded="true" height="82" name="Replace" width="90" x="849" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="link"/> <parameter key="replace_what" value="^.*?href="(.*?)".*"/> <parameter key="replace_by" value="$1"/> </operator> <connect from_port="in 1" to_op="Replace Tokens" to_port="document"/> <connect from_op="Replace Tokens" from_port="document" to_op="Keep Document Parts (2)" to_port="document"/> <connect from_op="Keep Document Parts (2)" from_port="document" to_op="Combine Documents" to_port="documents 1"/> <connect from_op="Combine Documents" from_port="document" to_op="Cut Document" to_port="document"/> <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/> <connect from_op="Replace" from_port="example set output" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> </operator> <connect from_op="Get Page" from_port="output" to_op="Extract Links" to_port="in 1"/> <connect from_op="Extract Links" from_port="out 1" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> </process> </operator> </process>
6
Answers
As far as I know they work still as before (at least for me :-)) so unless your network has changed it should be ok.
Can you still access the marketplace? This usually is a good indication that you can at least access the internet through RM. If not check your preferences -> proxy
Another possible scenario is that your site changed protocol, and is no longer using http but https. So while the url might still look the same at first glance your request might get blocked.
If I check the URL structure of old processes and it hasn't changed, it should still work, right?
Your current expression is /*, which basically means take everything that ends with a /, and I don't care how many slashes there are.
Using /.* (dot star) you state 'give me anything available behind the slash, as many times it occurs.'
One thing I always recommend is to get at least the main page, or one of the links directly before trying the crawl logic. This way you are assured you can already get the page one way or another
URL's i want to crawl: https://www.gelbeseiten.de/gsbiz/*
URL's i want to crawl: https://www.gelbeseiten.de/gsbiz/*
URL's i want to crawl: https://www.gelbeseiten.de/gsbiz/*
URL i want to crawl: https://www.gelbeseiten.de/gsbiz/
I can't post pictures because i'm still new in the community. So here is a link to Google Drive. There are pictures of the process.
https://drive.google.com/drive/folders/1PWt9zS2azBoR5DAhwI8Y17zetBTauUJ1?usp=sharing
Scott
As for rules, If I recall right this is handled with setting the max crawl depth, try by changing it to 3 or more.
When 2 it will take main page and the next one, with 3 it will also take the next one and so on.
@sgenzer No problem!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I already wish you a Merry Christmas !
Scott