While I am new to freelance work, I have designed large scale data crawlers for two different employers, gathering data from a variety of sources ranging from APIs to infrastructure to Web scraping. Unfortunately, both are still in use at private companies, and I cannot show you the code. Nevertheless, I can give their architecture in greater detail if you want.
I am highly confident I can complete this quickly, as I have a store of existing experience in exactly this area.
Technology wise, I highly recommend Python with GTK bindings. Manipulating the DOM directly is much cleaner than parsing the source, although both work well.
Process would be as follows:
- Design relational schema in DataBase (MySQL I presume), that incorporates the data points you intend on gathering. Correctly model all objects, events and relationships.
- Built data crawler in bottom up, modular process. Built the data extraction modules with the same hierarchy as your data model (ex. User extractor is primary unit, containing a board reader, pin reader (which itself contains a comment reader), follower/following reader, etc.)
- Built data transformation layer, which is greatly simplified if the crawler and data source are similar in structure.
- Finally, determine launch point for crawler. More sophisticated solution is to build a hub service for scheduling and execution management. Spring Batch Tomcat Webapp works well, if you want the crawler to run on a schedule.