This project consists of developing a script to parse the contents of a publicly available web site and pull into a local database blocks of content and related information from the pages.
## Deliverables
This project parses and stores the lyrics contained on the site http://www.metrolyrics.com. The spider must store in the database:
URL
Lyrics
Artist Name
Song Writer(s)
Source (GN/ML) (as there are two types of lyrics pages on this site)
Album
Publisher Credit(s)
It must be able to revisit the site and grab new content added since previous parsings (ie., grab already-parsed content URLs and do not revisit these pages).
It should also run quickly and efficiently - this should not be something that takes weeks to parse. (Which may affect your choice in programming languages)
We should be able to run this either on our Windows XP machines or through servers. XP machines are preferred.