The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Can I get solution for job related web scrawling from SEEK,INDEED?
Hi ,
I am doing research on job skills assessment as academic project.
I am looking for web crawling script or solution for Job post details like job roles,location,skills and knowledge from job portal web sites like Indeed, Seek.Kindly help me in this matter.
I am doing research on job skills assessment as academic project.
I am looking for web crawling script or solution for Job post details like job roles,location,skills and knowledge from job portal web sites like Indeed, Seek.Kindly help me in this matter.
Tagged:
0
Answers
There are plenty of things you can do:
- Use the Get Pages operator from RapidMiner.
- Use Python Extension and program your own script with scrapy (It's easier than what you think)
- Use Python Extension and program your own script with Selenium Web Browser and BeautifulSoup (it's harder to do and requires some more software but has better results if your pages are generated with JavaScript).
- Use a tool named "Sitesucker" and configure it to retrieve the data into RapidMiner. Then you can analyze the data inside RapidMiner coming from files.
This is what I could come up with.All the best,
Rod.
I really appreciate you for helping me.
Have you downloaded the pages you want to scrape on? And, do you have some HTML knowledge? Let's build your database first. I already gave you several solutions you can count on to retrieve pages. Then we will go for other processes.
What will you do to download your data?
All the best,
Rod.
If you have your webpages downloaded already, do you have these as files inside of a directory, files inside many directories, or as entries in a database?
The first thing we need to do is to make these look like entries in a database (or in a RapidMiner Studio exampleset). For that, you need to do the following (Let's use just one file to build our process, then we will use loops to open all files, ok?).
First, pick a file, open it with your browser, read the code and identify the HTML structure. You may help yourself with the "Inspect Element" feature of Firefox and Chrome. Are you able to identify, inside an HTML file, how the job offers are identified? An example:
You then can know that if you read all the <div> elements with class jo, you can have all the divs that contain job offers, which is what we are looking for.
BTW, I forgot: did you ask for permission to the website owners do this? Some of them don't really like users to crawl their webpages.
All the best,
Rod.