Scrape text from pdf (to csv)

Cancelled Posted May 18, 2013 Paid on delivery
Cancelled Paid on delivery

Data need to be extracted from a 'searchable' pdf.

Some things to know -

There are libraries in Python etc. that make it very simple to extract text.

Once text is extracted, one can use some 'keywords' as triggers to harvest the data. The name of towns have a hypen that always follows them, for instance. Similarly colon ":" and phrases like 'basic service' can be used for other data. See the sample pdf.

Details about the data -

The data are about cable systems in various US towns. Information within each town starts with name of the cable company, its address and any other information. And then goes on to describe various packages that the cable company offers.

We are interested in getting information the channels, and a few other characteristics of various cable packages, for instance 'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Internet Service'

Not all towns will have all these packages. For instance, Abbeville just has 'Basic Service' while Addison has 'Basic Service', 'Expanded Basic Service' and 'Pay Service 1'.

Each of these 'services' have further attributes (again not all attributes will be present all the time) - subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician

Output file:

We want the data in a csv. Each row will represent each town. The first column would be information about the cable company. Next we will get data for each service.

More on that -

For each service:

'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Pay Service 9', 'Pay Service 10', 'Internet Service')

Create columns corresponding to each of the attributes:

subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician)

So final column names would be something like -

basic [login to view URL], basic [login to view URL] units, basic [login to view URL] ....[login to view URL],....

Each column will carry its corresponding information. If the service is missing - assign all attribute columns missing values (leave it blank). If an attribute is missing within a service - assign it as missing (leave it blank)

A sample of the pdf is attached alongside.

PDF PHP Web Scraping

Project ID: #4533554

About the project

21 proposals Remote project Active May 22, 2013

21 freelancers are bidding on average $147 for this job

SigmaVisual

I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.

$231 USD in 5 days
(284 Reviews)
8.2
tzo

Can help you on this. Have some prior experience with pdf parsing.

$158 USD in 3 days
(250 Reviews)
6.9
samitXI

Please check your inbox...Thanks

$185 USD in 3 days
(111 Reviews)
7.1
zeke

Available to start immediately and finish as soon as possible.

$206 USD in 2 days
(199 Reviews)
7.5
AlGordo

Experienced with scraping of data.

$100 USD in 3 days
(49 Reviews)
6.1
pablotorres

i can do it

$155 USD in 30 days
(122 Reviews)
6.0
thetidevw

Hi, i can do this for you, buth using php along with XPDF/pdftotext.

$157 USD in 3 days
(46 Reviews)
5.2
esafeguard

Hi. I'm a PHP programmer with experience in text parsing projects. Please provide a sample of the searchable file. Regards.

$150 USD in 3 days
(13 Reviews)
4.6
suriyant

I have strong experience in Python plus data extract from PDF. I can do it.

$126 USD in 3 days
(9 Reviews)
4.4
hemi

I did in depth analysis of this project. Please see more details on private message. Thanks

$333 USD in 10 days
(13 Reviews)
4.1
ideadezigner

Hello Sir, Please check your private mail box

$206 USD in 1 day
(9 Reviews)
3.7
mikecrosa

I can do it without using python , what is the deadline?

$144 USD in 2 days
(1 Review)
2.8
samic

Ready to work on it.

$144 USD in 3 days
(5 Reviews)
2.7
HoneyITSolution

Cool! Lets start this job and get it done. Thank you

$111 USD in 5 days
(2 Reviews)
2.2
huuban

I have just completed a project about . csv file( with 17 columns and over 14000 rows). So I think I can help you to do it.

$111 USD in 3 days
(1 Review)
1.4
raghu6574

I can do this

$150 USD in 3 days
(0 Reviews)
0.0
sanjoydam

Hello, I am very interested about your project and I am ready to start now. Check the PMB please.

$100 USD in 3 days
(0 Reviews)
0.0
rubanajjar

I can extract u the data to a csv file and will send a sample when u need, how many files are they,?thx

$45 USD in 2 days
(0 Reviews)
0.0
ljupcob

I have good experience on extracting data in csv. Please give me a chance.

$111 USD in 3 days
(0 Reviews)
0.0
datamaster3

Dear Sir, I am with over 8 years of professional experience in the commercial world, I am very specialized in Adobe Live Cycle, Adobe Photoshop, Ms. Office, WordPress, CSS, HTML, Word, PowerPoint and Data Entry. More

$111 USD in 1 day
(0 Reviews)
0.0