Develop a web crawler in Python
£250-750 GBP
Paid on delivery
We are academics at Imperial College London and New York University conducting research on startup companies and their legal policies on the web.
The task is to write a web crawler in Python 3 that gets historical information about company websites from the Wayback machine ([login to view URL]).
We will also need a [login to view URL] file that shows any Python dependencies, and a Jupyter demo notebook that runs the crawler on a sample of inputs.
There will be more work available for a freelancer who completes this task to a high standard.
*** DESCRIPTION OF DESIRED CODE ***
The crawler should have two high-level functions.
FUNCTION 1: find_snapshots
Please use one of the Wayback APIs for this function if possible ([login to view URL]).
Input:
A website (e.g., [login to view URL])
A list of two dates in YYYYMMDD format (e.g., [20150202, 20150802])
Output:
A dictionary of links to all snapshots available on wayback between the given dates (return empty dictionary if no snapshots available).
Example output:
out_dict = {'20150202' : '[login to view URL]://[login to view URL]'}
FUNCTION 2: get_snapshot_info
Input:
A link to a snapshot (e.g., a value of the dictionary returned by find_snapshots)
A keyword (e.g., 'privacy')
Output:
A dictionary with information about the snapshot, obtained as follows
a) Visit snapshot link. Download homepage HTML code (discard garbage such as 404 errors). It is particularly important that this step deals with redirects that sometimes happen on wayback (HTML 302 returns and pop-ups are common) and does not return garbage in those cases.
b) Extract all hyperlinks from the homepage HTML (use of BeautifulSoup preferred). Find all hyperlinks containing the keyword EITHER in their text OR in the URL. Example of such a hyperlink: [login to view URL]://[login to view URL]
c) Visit all hyperlinks containing the keyword and download their HTML code (discard garbage such as 404 errors)
Example output:
out_dict = { ‘homepage_download’ : True, # boolean flag for whether download in step a) is successful
‘homepage_html’ : string # string containing HTML code of homepage downloaded in step a)
‘keyword_links’ : [‘xxx','yyy','zzz'], # list of links found to contain keyword in step b) (return empty list if none found)
‘keyword_download' : [True, False, True], # list of boolean flags for whether downloads in step c) is successful
'keyword_html' : [string, string, string]} # list of strings containing HTML code of keyword pages downloaded in step c)
*** EVALUATION AND PROJECT COMPLETION ***
DEFINITION OF SUCCESS:
For a given set of inputs (website, dates, keyword), we define success in two stages:
The crawler is successful in stage 1 if homepage_download = True for at least one snapshot found in the given date range (and if the HTML is not garbage)
The crawler is successful in stage 2 if keyword_download = True for at least one subpage of a snapshot (and if the HTML is not garbage)
EVALUATION OF FREELANCER OUTPUT:
We will give you a list of 500 inputs (website, dates, keyword) for development. We will test your results on a list of 500 different inputs.
Human trials have the following success rates with keyword = 'privacy':
- Stage 1 success: 60% if the website is a startup company in the given date range, and 90% if it is a mature company.
- Stage 2 success: 50% if the website is a startup company in the given date range, and 80% if it is a mature company.
For completion of the project, you must achieve the following
- success rates are reasonably close to human trials
- source code is clean and well documented
- Jupyter demo notebook is clean and well documented
- [login to view URL] allows code to be run without errors
Project ID: #20270382
About the project
Awarded to:
Hi, my name is Selim. I am from Solihull, UK. I read your `Develop a web crawler in Python` project descriptions carefully before bidding. I checked the target url, and your requirements as well... I got what you need More
43 freelancers are bidding on average £526 for this job
Hello. I have worked wayback before and can show you a demo and codes if you are interested. Its in Python 2.7 but its not hard to switch to Python 3. Thanks, Helmot
Hello sir Thanks for your detailed job description. I have got full understanding from the job description and am very clear about the task. I have 9 years of experience about web scraping and am suitable for th More
Sir/Ma'am, I am senior python developer and have been working for 3 years now. I have done such works previously for amazon, instagram, aliexpress etc. and can deliver it to you in less than a day. I can also help More
100% Completion Rate and 5 Stars Dear, employer. My name is Lee, I am an experienced web developer, and web scraping expert. I have good experiences in web scraping using PHP, Python, Java and so on. I read your job More
Hello! I'm interested in making your project for historical snapshots parsing using Python. I'm ready to make a script that will do both stages of work described in your project. Just please, provide me with sample inp More
Hi, I am an experienced python developer and I can complete the task using Jupyter notebook. I use BeautifulSoup for all my scraping projects. I will use the wayback api for first function and selenium for the second o More
Greetings, I am an experienced professional scrapper and have done similar projects in the past. Same can be verified from my profile. Let me allow to assist you with your requirements. Thanks
Hi, I have gone through your requirement to scrape lots of websites. I am EXPERT in building scraping tools /scripts. Hence, I can SURELY work on your project. I am having 4 YEARS of EXPERIENCE in developing PHP-PYTHO More
Hi there, I have read through the project description. I can help you complete the project using python scripting. I will be looking forward to hear from you. Please contact me on PM for details.
Hey there! Python crawler, Scrape I can do any sort of data mining or web scraping that you need to be done in a reasonable amount of time. I have years of Python, HTML, CSS and JavaScript experience under my belt a More
Hi, I’ve carefully gone through your job post. I have more then 4+ years experience in Python development.I am very much interested in your project with all of your requirements. I feel very confident on your project a More
Hi,dear. I am a senior software developer. I am very familiar with web scraping. I have just checked your project description, I am able to complete this project. I am looking forward to your response. Thanks.
Hi, I have experience in Web scraping using Python 2 and 3. Please contact me for more details. Thanks,