Automated server-based web scraping application to develop
$250-750 USD
In Progress
Posted over 11 years ago
$250-750 USD
Paid on delivery
Developer needed to develop an "intelligent" server based automated web scraping application which
can identify from a large list of website URLs (over 200k), business websites from non-business websites.
(a business website is a website which belongs to a business providing services)
The proposed way to do this is to
1) develop a server-based application which will have the following instructions:
a) verify whether the URL corresponds to an active website
b) browse the website and identify "intra site" links (internal links)
c) determine whether the text of the link includes a particular keyword (from a pre-determined set of keywords - such
as "about us", "services", "company", "clients"...)
for example: www. website .com/[login to view URL] - this link will give a "positive" result since the word
"services" appears in the link. (the word "services" would have been pre-determined by the user)
2) a web interface with the following user features:
from the web interface, the user must be able to:
- upload a list of URLs to scrape (up to 200k or more if possible)
- add keyword/remove keyword
- start the "mining" process, pause it, stop it, resume it
A real-time count of URLs processed with count of active websites, positive results, negative results - needs
to be displayed.
- download the URL list of active websites, positive-identified websites and negative ones
IMPORTANT NOTES:
The application needs to be multi-threaded efficient for max processing speed
PLEASE ONLY BID IF YOU ARE THE DEVELOPER. (NO AGENCIES PLEASE)
PLEASE INDICATE IN PMB WHAT DEVELOPMENT LANGUAGE YOU INTEND TO USE
Thanks for your bid
I would like to work on this project. Planning on using Ruby on Rails and MySQL for the web server and Nokogiri (very popular Ruby gem for web scrapping). I would use background jobs so the application is usable during the actual scrapping. Keeping record of previous run, so users can download the files with the sites list any time they want. Will save data in batches to database to provide stop/pause/resume functionalities.
$720 USD in 7 days
4.6 (2 reviews)
1.6
1.6
4 freelancers are bidding on average $693 USD for this job