Title: Multithreaded External Website Source Parser (Regular Expressions)
Project Type: Programmed Software, Windows (x64) Application, (Including Source Code)
Language: C#.NET, C, OR VB.NET (Visual Basic 2010) Or Other Programming Language for Executable Application
Interface: Screenshots of the interface design are attached to the Project Resources, and should be followed similarly
Budget: $150 total, with (1) proposed 'milestone' payment of $75 for a completed single threaded version that is limited to saving only the first 100 results of each type of information collected, and only use a single regular expression input file. The first milestone payment will be made after the demonstration application is reviewed. The speed however, even in the first milestone demonstration application, must meet certain 'speed' standards and programmer must have the knowledge of how the speed/accuracy will inncrease/decrease in comparison to the full multithreaded version, given 25~ mbps internet connection and a 5000~ passmark score PC running Win x64 and 8GB RAM.
Project Summary: This project description is for a program with the purpose of parsing external website source codes. The user will import a list of regular expressions, one per line named [url removed, login to view] in the following format:
beginning text##!##ending text
Whatever text in the source code is between the 'beginning text' and 'ending text' (where '##!##' is in the input file) will be appended to the file [url removed, login to view]
The application will also read a second file with a list of URLs (one per line) named [url removed, login to view]
If multiple matches of the regular expression are found within the same page's source code, each one will be appended to the output file ([url removed, login to view])
Getting Started: This project is best suited for someone who has already developed this application in part or wholly, though it is quite straight forward to anyone familiar with the scraping, data mining, crawling, etc of websites.
Speed, accuracy, and scalability: This software will be run on an approximate 25~ mbps internet connection (mega BIT), and 5000~ passmark score cpu running Winx64 with 8GB of RAM. The acceptable accuracy requirement is 95%, meaning that for a list of 100 URLs where 100 regular expressions are available in the corresponding source, at least 95 (or better) should be found and appended to 95 lines in the [url removed, login to view] file. The software will make use of large flat files with several million entries in [url removed, login to view], so should not have any issues either reading large [url removed, login to view] and appending to largely growing [url removed, login to view] files.
The desired speed of the software, taking into account the 95% accuracy requirement as well as the internet and hardware specifications of its machine is approximately 1800 URLs/minute under typical web server speed conditions. The only difficulty in developing this software should be the treatment of slowly responding websites, unfound urls, and your discretion with how they are handled.
Please take a moment to review the attached project resources that contain screen shots with the recommended GUI (user interface) of the software. For any questions regarding the project, feel free to PM me any time (will check them often) and I can provide additional contact information or simply answer any inquiries you have there. Thank you and good luck.
10 freelancers are bidding on average $150 for this job
Hi. I actually can port any of available opensource regex libraries like PCRE, TRE or POSIX Grep for a faster extraction or implement fastest possible DFA parser.