The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Using rapidminer as a crawler"
Hi
I would like to use rapid-miner as a webcrawler. I want to give the program a list of Urls from a text file. Then Rapidminer should go thru, and extract from each url specific links, which then should be stored in another text file.
Can you do this with rapid-miner, please?
I would like to use rapid-miner as a webcrawler. I want to give the program a list of Urls from a text file. Then Rapidminer should go thru, and extract from each url specific links, which then should be stored in another text file.
Can you do this with rapid-miner, please?
Tagged:
0
Answers
am investigating selenium+chrome to allow ajax/javascript scraping too
Can you do this in MS Excel?
http://lmgtfy.com/?q=rapidminer+web+crawling+
http://rapid-i.com/rapidforum/index.php?action=printpage;topic=2753.0
How do I read my urllistfile into an example set?
Read Document --> Documents to Data --> Loop Examples
read in a document --> documents to Data --> extract macro --> Loop Examples
this is the underlying code:
Is it correct so far?
How do you connect the webcrawler into this circle, please?
- The "Get Pages" operator accepts a list of links as reference but doesn't save to file.
- The "Crawl Web" operator saves pages to files but only accepts an URL as a fixed parameter.
Could any of the more advanced users help out?
Gruesse,
Joao G.
the process setup does not make much sense until now.
You have to be aware of what you want to do. If you want to start with a single URL and then automatically catch links from the document and follow them to a maximum depth of traversal you need a Crawler. Otherwise, if you already have a complete of URLs you want to retrieve, you don't need to crawl and should use the links for retrieving the corresponding websites directly.
Since you mention a list of URLs I guess you don't need the "Crawl Web" operator. "Get Page" or "Get Pages" are more appropriate in this case. I don't know in which format the links are stored in your file. It would be easy, if you have the URLs in a table form like in CSV or XLS files. They can simply be read as example sets. If you have them as single lines in an ordinary text file you have to convert them to build a proper example set (please post an example from the list for further advices). After that you can use the operator "Get Pages" to retrieve the entire webpage for each URL in your list (the HTML code will be added as an attribute).
If you want to write the websits to files (as Joao mentioned) I would suggest "Get Page" instead, since you have to use a loop anyway.
After converting the URLs to an example set just add "Loop Examples" and go to the inner process of this operator (double-click it). Here you need "Extract Macro" to get the current URL. Add a "Get Page" operator (be sure to check the execution order of the operators ("Extract Macro" has to be first) and use the extracted macro value as URL parameter. The operator delivers a single document which can be written to disk via the "Write Document" operator.
This is just an outline of how to solve the task. If you need further help, please post your URL list (or an extract) let us know which parts are still unclear.
Regards
Matthias
P.S. If you use the "Modify" option in your first post, you don't have to add a new one every 10 minutes
I did:
Read Excel (followed the import wizard) --> Loop examples
Inside the Loop examples I did:
exa --> Extract macro --> Get Pages --> exa.
This is the correponding code and I get green light with both:
With "Get Page" I could do just "Write Document" but I dont know how to use "Get Page" inside the "Loop examples" It has no input-connection and just one out. I tried to connect Extract Macro --> exa and Get Page --> out. But I did not seem to give me green lights.
And with "Get Pages" I get green lights, but I dont know how to get files after the "Loop Examples"
My start is an URL file with one Url per cell (in .xls) or per line (in txt.)
http://www.double.de
http://www.singel.de
http://www.tripple.de
And I would like to retrieve all pages of these url with depth 1
http://www.double.de
http://www.double.de/C
http://www.singel.de
http://www.singel.de/A
http://www.tripple.de
http://www.tripple.de/8
...
Probably I will need the crawler for that. But when I could be able to use the "Get Page", I believe I could use the "Web Crawl" as well. True?
you are right, you need to crawl, if you want to follow links to a certain depth (in case of depth 1 you could also extract them via XPath or regular expressions, but the crawler is more comfortable). And you are also right in your assumption that "Crawl Web" and "Get Page" have to be included into the process in a similar way. If you use "Get Pages" you don't need the loop, you should conntect this to "Read Excel" directly. But this is just for clarification, since your list of URLs is not complete (sub-pages are not contained).
It seems that you are not really aware of the macro concept. If you set "Extract Macro" to macro type "data_value", you get the value for a defined row/attribute for a single row/example. The column is addressed by the parameter "attribute name" and the row is addressed by "example index". You could set the index to a discrete value as "1" but the loop provides a macro, which is automatically increased for each example considered. You can choose a name for this control variable via parameter "iteration macro" for "Loop Examples" (default is "example"). If you want to use a macro/variable value somewhere, just type %{macro_name}. I built a small example to illustrate this (you should be able to see how to include the crawler operator and how to feed a URL to it). The first operator "Subprocess" just generates some artificial data, as it might be delivered by "Read Excel" (in your case replace it by the "Read Excel" operator again). The operator "Delay" is optional and might be useful if you connect to pages from a similar URL multiple times (to avoid HTTP requests fired rapidly and perhaps to be banned in return).
Regards
Matthias
Thank you very much. It does run, but there is now a new problem. Storing a page always overwrites the page that was stored before. So I get one page in the end. When the crawler finds more pages to store from the current url it does store multiples files. But across several urls, each finding overwrites the old one.
In your code I changed the urls of "Set Data" and ran it with two simple crawling rules. The run-process is finished quickly, the log says it stored two pages but you get only one page, because the first gets overwritten. I think something is wrong with the variables in the iteration macro
there is nothing wrong with the macros. This just works as it should. but not as intended
The crawler receives the same output directory parameter setting for every loop execution. So of course the files are overwritten while looping. To avoid this, you have to set some specific value for each iteration. I extended the process with an example for this. I extracted the domain as specific property and used this as a subfolder for the file output. This is again done by using a macro (second "Extract Macro" operator and appending the macro value to the "output dir" parameter of the "Crawl Web" operator. Edit: I forgot that the output directories for the crawler had to be existent before trying to save the files. Otherwise no data is written to the disk. I did this with the "Execute Program" operator, but this command is only valid for Windows operating systems. If you are working with another OS, you have to adapt the command.
Best regards
Matthias
great job. It took some while till I understood your program.
EDIT:
It still does not work roundly. There is something not exact with your regular expression. When the input url is:
http://www.abc.de/ (with a slash at the end)
then it works perfectly,
but when the slash at the end is missing, then the "Executive Program" process fails generating a error message:
Process ´cmd.exe/c "md C:\Users\Home\Desktop\Sites\?´" excited with error code 1.
So i think you have to change the regular expression somehow
Regards
Ben
extracting the domain with regular expressions was just a quick example to show how to possibly generate specific folders. You can use anything else you want (maybe a counting macro variable). I wanted to leave some work to you , but if you want to use the domain regex, try this: Regards
Matthias