The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Dealing with JSON (downloading files with Crawl Web)
Hi everyone.
Since very new to RapidMiner I have a few questions for you.
Here is what I try to do:
I have an excel file filled with URLs and each URL of those is going to be crawled. Everything went fine till now. All the tests with html pages went perfect. Now, the problem is that my URLs are giving me json files. I'm trying to store those files but I get no results.
Her is my process:
Thanks a lot in advance,
Loky.
Since very new to RapidMiner I have a few questions for you.
Here is what I try to do:
I have an excel file filled with URLs and each URL of those is going to be crawled. Everything went fine till now. All the tests with html pages went perfect. Now, the problem is that my URLs are giving me json files. I'm trying to store those files but I get no results.
Her is my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Any of you have any ideas for me? maybe some User Agent tricks so I can actually "see" those json files as text?
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="550" width="1150">
<operator activated="true" class="retrieve" compatibility="5.1.011" expanded="true" height="60" name="Retrieve (2)" width="90" x="112" y="300">
<parameter key="repository_entry" value="URLs"/>
</operator>
<operator activated="true" class="loop_examples" compatibility="5.1.011" expanded="true" height="94" name="Loop Examples" width="90" x="514" y="300">
<process expanded="true" height="969" width="547">
<operator activated="true" class="extract_macro" compatibility="5.1.011" expanded="true" height="60" name="Extract Macro" width="90" x="380" y="300">
<parameter key="macro" value="website_url"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="statistics" value="max"/>
<parameter key="attribute_name" value="A"/>
<parameter key="example_index" value="%{example}"/>
</operator>
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="380" y="390">
<parameter key="url" value="%{website_url}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*"/>
<parameter key="store_with_matching_url" value=".*"/>
</list>
<parameter key="output_dir" value="C:\Users\ls\Desktop\test"/>
<parameter key="extension" value="json"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_depth" value="1"/>
<parameter key="domain" value="server"/>
<parameter key="max_page_size" value="5000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 "/>
</operator>
<connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
<connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
<portSpacing port="source_example set" spacing="234"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve (2)" from_port="output" to_op="Loop Examples" to_port="example set"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="54"/>
</process>
</operator>
</process>
Thanks a lot in advance,
Loky.
0
Answers