Develop a web crawler in Python

Completed Posted 4 years ago Paid on delivery
Completed Paid on delivery

We are academics at Imperial College London and New York University conducting research on startup companies and their legal policies on the web.

The task is to write a web crawler in Python 3 that gets historical information about company websites from the Wayback machine ([login to view URL]).

We will also need a [login to view URL] file that shows any Python dependencies, and a Jupyter demo notebook that runs the crawler on a sample of inputs.

There will be more work available for a freelancer who completes this task to a high standard.

*** DESCRIPTION OF DESIRED CODE ***

The crawler should have two high-level functions.

FUNCTION 1: find_snapshots

Please use one of the Wayback APIs for this function if possible ([login to view URL]).

Input:

A website (e.g., [login to view URL])

A list of two dates in YYYYMMDD format (e.g., [20150202, 20150802])

Output:

A dictionary of links to all snapshots available on wayback between the given dates (return empty dictionary if no snapshots available).

Example output:

out_dict = {'20150202' : '[login to view URL]://[login to view URL]'}

FUNCTION 2: get_snapshot_info

Input:

A link to a snapshot (e.g., a value of the dictionary returned by find_snapshots)

A keyword (e.g., 'privacy')

Output:

A dictionary with information about the snapshot, obtained as follows

a) Visit snapshot link. Download homepage HTML code (discard garbage such as 404 errors). It is particularly important that this step deals with redirects that sometimes happen on wayback (HTML 302 returns and pop-ups are common) and does not return garbage in those cases.

b) Extract all hyperlinks from the homepage HTML (use of BeautifulSoup preferred). Find all hyperlinks containing the keyword EITHER in their text OR in the URL. Example of such a hyperlink: [login to view URL]://[login to view URL]

c) Visit all hyperlinks containing the keyword and download their HTML code (discard garbage such as 404 errors)

Example output:

out_dict = { ‘homepage_download’ : True, # boolean flag for whether download in step a) is successful

‘homepage_html’ : string # string containing HTML code of homepage downloaded in step a)

‘keyword_links’ : [‘xxx','yyy','zzz'], # list of links found to contain keyword in step b) (return empty list if none found)

‘keyword_download' : [True, False, True], # list of boolean flags for whether downloads in step c) is successful

'keyword_html' : [string, string, string]} # list of strings containing HTML code of keyword pages downloaded in step c)

*** EVALUATION AND PROJECT COMPLETION ***

DEFINITION OF SUCCESS:

For a given set of inputs (website, dates, keyword), we define success in two stages:

The crawler is successful in stage 1 if homepage_download = True for at least one snapshot found in the given date range (and if the HTML is not garbage)

The crawler is successful in stage 2 if keyword_download = True for at least one subpage of a snapshot (and if the HTML is not garbage)

EVALUATION OF FREELANCER OUTPUT:

We will give you a list of 500 inputs (website, dates, keyword) for development. We will test your results on a list of 500 different inputs.

Human trials have the following success rates with keyword = 'privacy':

- Stage 1 success: 60% if the website is a startup company in the given date range, and 90% if it is a mature company.

- Stage 2 success: 50% if the website is a startup company in the given date range, and 80% if it is a mature company.

For completion of the project, you must achieve the following

- success rates are reasonably close to human trials

- source code is clean and well documented

- Jupyter demo notebook is clean and well documented

- [login to view URL] allows code to be run without errors

Python Web Scraping Software Architecture PHP Data Mining

Project ID: #20270382

About the project

43 proposals Remote project Active 4 years ago

Awarded to:

seaanddream

Hi, my name is Selim. I am from Solihull, UK. I read your `Develop a web crawler in Python` project descriptions carefully before bidding. I checked the target url, and your requirements as well... I got what you need More

£750 GBP in 10 days
(356 Reviews)
8.8

43 freelancers are bidding on average £526 for this job

helmot

Hello. I have worked wayback before and can show you a demo and codes if you are interested. Its in Python 2.7 but its not hard to switch to Python 3. Thanks, Helmot

£500 GBP in 7 days
(240 Reviews)
8.3
zhangyingtai

Hello sir Thanks for your detailed job description. I have got full understanding from the job description and am very clear about the task. I have 9 years of experience about web scraping and am suitable for th More

£555 GBP in 5 days
(129 Reviews)
7.6
mananraja

hey, I have read what you need and checked the website you mentioned. I can make a PYTHON scraper script to get this done. I will also fulfill your 4 requirements that you listed at the end of your description. I have More

£250 GBP in 2 days
(374 Reviews)
7.5
Guptapuru304

Sir/Ma'am, I am senior python developer and have been working for 3 years now. I have done such works previously for amazon, instagram, aliexpress etc. and can deliver it to you in less than a day. I can also help More

£250 GBP in 2 days
(83 Reviews)
7.6
p4logics

Dear Sir, I am interested in your project. I'm senior Core Java, J2ee, Javafx, Spring boot, Spring JPA, Hibernate, Angular developer. I'm also expert in web scrapping using java selenium, jsoup and python. I assure, More

£500 GBP in 7 days
(91 Reviews)
7.4
C3guru

Hello. I am a talented Web scraping solution developer. Especially, I've mastered selenium and scrapy with python. You can see my profile that finished a lot of scraping jobs. I've just reviewed your requirements and More

£500 GBP in 7 days
(51 Reviews)
7.2
zeke

I wrote many web crawlers. This is my favorite type of job. I am absolutely confident I can finish this project to your satisfaction and on time. Available to start immediately and finish as soon as possible. Looking f More

£500 GBP in 7 days
(211 Reviews)
7.6
alexwmsoft

100% Completion Rate and 5 Stars Dear, employer. My name is Lee, I am an experienced web developer, and web scraping expert. I have good experiences in web scraping using PHP, Python, Java and so on. I read your job More

£500 GBP in 7 days
(44 Reviews)
6.6
kunitsynartem

Hello! I'm interested in making your project for historical snapshots parsing using Python. I'm ready to make a script that will do both stages of work described in your project. Just please, provide me with sample inp More

£500 GBP in 7 days
(47 Reviews)
6.4
rajorshi1001

Hi, I am an experienced python developer and I can complete the task using Jupyter notebook. I use BeautifulSoup for all my scraping projects. I will use the wayback api for first function and selenium for the second o More

£300 GBP in 7 days
(63 Reviews)
6.1
farooq4161

Greetings, I am an experienced professional scrapper and have done similar projects in the past. Same can be verified from my profile. Let me allow to assist you with your requirements. Thanks

£750 GBP in 8 days
(75 Reviews)
6.4
esolzpk

HI I have gone through the requirements in detail and i have few questions is I am specialize in website design and development and are excited for the opportunity to work with you in accomplishing your goals. We h More

£555 GBP in 6 days
(26 Reviews)
6.1
smsaurabhv

‌Hi, I have gone through your requirement to scrape lots of websites. I am EXPERT in building scraping tools /scripts. Hence, I can SURELY work on your project. I am having 4 YEARS of EXPERIENCE in developing PHP-PYTHO More

£250 GBP in 3 days
(130 Reviews)
6.2
damilareisaac

Hi there, I have read through the project description. I can help you complete the project using python scripting. I will be looking forward to hear from you. Please contact me on PM for details.

£500 GBP in 7 days
(55 Reviews)
6.2
maryumakhter5

Hey there! Python crawler, Scrape I can do any sort of data mining or web scraping that you need to be done in a reasonable amount of time. I have years of Python, HTML, CSS and JavaScript experience under my belt a More

£500 GBP in 10 days
(36 Reviews)
5.7
BestService222

Hi, I’ve carefully gone through your job post. I have more then 4+ years experience in Python development.I am very much interested in your project with all of your requirements. I feel very confident on your project a More

£750 GBP in 10 days
(39 Reviews)
5.7
naishodayo

Hi,dear. I am a senior software developer. I am very familiar with web scraping. I have just checked your project description, I am able to complete this project. I am looking forward to your response. Thanks.

£500 GBP in 7 days
(4 Reviews)
4.8
arajdhar

Hello, From the given description, it seems that you want to perform the following activities as part of developing the Web Crawler: 1. Use the Wayback Availability JSON API to determine whether a snapshot for a give More

£250 GBP in 7 days
(8 Reviews)
4.8
friendzsoft

Hi, I have experience in Web scraping using Python 2 and 3. Please contact me for more details. Thanks,

£277 GBP in 5 days
(6 Reviews)
4.2
DrPeixoto

Greetings friend. I already have the your code written and working according to your specifications. Please send me a message and i will immediately send it to you. You can also send me a sample of the inputs lis More

£250 GBP in 1 day
(15 Reviews)
4.2