In Progress

248837 Web-based website scraper

We are looking for a web scraper to be built as a C#, ASP.NET 3 Web Application Project (not WebSite Project), LINQ

to SQL, MS SQL 2005 database and Visual Studio 2008. Rather than using regular expressions the scraper should parse

the pages using HTMLAgility pack. [url removed, login to view]

Project needs to be started right away and be finished in about one week.

The basic program flow of the scraper should be as follows:

-Read from the database a list of starting URL's

-Scan the page for product information

-info to scrape - product name, description, retail price, sale price, brand, product url, image url, in stock,

sizes, colors, sku, scrape date, expiration date (if applicable)

-Insert the information into a database table

-Go to subsequent pages to scan and insert to database until they are all scraped

The web based program needs to have the following features:

-Be able to scrape just one product on a given page (product detail page) or scrape a series of products on a page

and then all subsequent pages.

-Example of one product [url removed, login to view]

-Example of many products to scrape with other pages to drill down to and scrape

[url removed, login to view]

-The program needs to be able to run on a schedule and also on-demand.

-Insert gathered data into an MS SQL 2005 database. We will provide the table schemas.

-Scraper should not insert duplicate items but if price/size/color has changed it should add it as a new entry

while keeping a reference to original item it is duplicating. These new updates should be flagged somehow so we

know they are new changes.

-Scraper should be able to detect "bad" data or page layout changes so we know to update the scraper.

-Scraper needs to be an asynchronous and multithreaded application. Since many sites and pages are being scraped we

need to be able to see the progress as it is running. And since many page hits will be required it needs to be

multithreaded.

-Scraper should be able to run behind a proxy server if necessary

-Every site we scrape will need to have its own “template” which lets the scraper know how to find the data to

extract. This is where HTMLAgility pack will be used. If it's easier to do this using regular expressions then that

can be used.

-We should be able to easily create new “templates” for other pages we want to scrape in the future. And the

scraper should be smart enough to know when a template doesn't match the given site it's scraping.

-Along with the scraping templates we need a way to specify how the scraper can go to the next page and all

following pages until they are all scraped. We must be able to specify this for each website.

-Provide a function with the following signature that will be able to figure out the domain being scraped, pick the

appropriate “template” to use and also know how to get to subsequent pages. This is assuming we have a predefined

list of templates to use when the project is finished.

public WebScrape void ScrapeWebsite(WebsiteUrl websiteUrl)

-WebScrape – a class or struct that represents all the data that was scraped

-WebsiteUrl – the url of website to scrape defined as below

public struct WebsiteUrl

{

public const string Zappos = “[url removed, login to view]”;

public const string Nordstrom = “[url removed, login to view]”;

public const string Gap = “[url removed, login to view]”;

// this struct also needs a GetEnumerator defined since we will need

// to iterate through all the members in it

}

The “ScrapeWebsite” method should be part of a class named “MyScraper”. This is how it should be called

// the function should be able to figure out by the page structure it's a single product

[url removed, login to view](“[url removed, login to view]”);

// the function should be able to figure out by the page structure it's a list of products

// with other pages to navigate through to

[url removed, login to view](“[url removed, login to view]”);

Skills: .NET, Anything Goes, SQL

See more: zappos. com, zappos . com, zappos com, www zappos com, www regular expressions info, www nordstrom com, web site templates sale, web site templates for sale of, website structure templates, web sites templates for sale, website layout for sale, web scraping application, web sale template, web pages templates html, web layout for sale, web layout for retail, web future studio, want to create new brand name, using regular expressions, using expressions

About the Employer:
( 0 reviews ) Cypress,

Project ID: #1995093