The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Downloading PDFs
Hi, I'm using the "Crawl Web" process to download PDF documents on a Windows 7 Pro machine, using Version 5.3.008 of Rapidminer. Is there a way of getting Rapidminer to download the documents in question without modifying them? The resulting files that I am getting are corrupted in two or more different ways.
When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."
When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.
Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.
I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.
Can anyone help?
Thanks!
When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."
When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.
Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.
I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.
Can anyone help?
Thanks!
Tagged:
0
Answers
Best regards,
Marius
Sorry for the confusion. I simply meant that if RapidMiner is trying to download a pdf via a direct URL, such as :
www.website.com/folder1/otherfolder/filename.pdf
Downloading the pdfs manually via right-click options works fine. I can also do it via another WGet application. It is just when trying to get RapidMiner to download the documents that I get the problems mentioned.