Scrape text from pdf (to csv)
$30-250 USD
Paid on delivery
Data need to be extracted from a 'searchable' pdf.
Some things to know -
There are libraries in Python etc. that make it very simple to extract text.
Once text is extracted, one can use some 'keywords' as triggers to harvest the data. The name of towns have a hypen that always follows them, for instance. Similarly colon ":" and phrases like 'basic service' can be used for other data. See the sample pdf.
Details about the data -
The data are about cable systems in various US towns. Information within each town starts with name of the cable company, its address and any other information. And then goes on to describe various packages that the cable company offers.
We are interested in getting information the channels, and a few other characteristics of various cable packages, for instance 'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Internet Service'
Not all towns will have all these packages. For instance, Abbeville just has 'Basic Service' while Addison has 'Basic Service', 'Expanded Basic Service' and 'Pay Service 1'.
Each of these 'services' have further attributes (again not all attributes will be present all the time) - subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician
Output file:
We want the data in a csv. Each row will represent each town. The first column would be information about the cable company. Next we will get data for each service.
More on that -
For each service:
'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Pay Service 9', 'Pay Service 10', 'Internet Service')
Create columns corresponding to each of the attributes:
subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician)
So final column names would be something like -
basic [login to view URL], basic [login to view URL] units, basic [login to view URL] ....[login to view URL],....
Each column will carry its corresponding information. If the service is missing - assign all attribute columns missing values (leave it blank). If an attribute is missing within a service - assign it as missing (leave it blank)
A sample of the pdf is attached alongside.
Project ID: #4533554
About the project
21 freelancers are bidding on average $147 for this job
I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.
Hi. I'm a PHP programmer with experience in text parsing projects. Please provide a sample of the searchable file. Regards.
I have just completed a project about . csv file( with 17 columns and over 14000 rows). So I think I can help you to do it.
Hello, I am very interested about your project and I am ready to start now. Check the PMB please.
I can extract u the data to a csv file and will send a sample when u need, how many files are they,?thx
Dear Sir, I am with over 8 years of professional experience in the commercial world, I am very specialized in Adobe Live Cycle, Adobe Photoshop, Ms. Office, WordPress, CSS, HTML, Word, PowerPoint and Data Entry. More