Find Jobs
Hire Freelancers

Develop a crawler/parser for wikipedia according to rules

€30-250 EUR

In Progress
Posted over 7 years ago

€30-250 EUR

Paid on delivery
SUMMARY: -------- I want a parser crawler combinatino that retrives a specific html wiki pages, strips out content, remove not needed section, attributes, tags + text. What I expect as Product Owner/Customer: - JAVA Based cmd-line-application. - You can use a Frameworks whatever you want (like JSoup or SAX) that assist you. - You can use Java 1.8 power like Streams - You can use the API from WIKI, if you want - Preparation is always UTF-8 in order to support languages ... - The crawler must work for different languages, like de, it, en, es, pt, ru - Logging FW like log4j, logback or similar - You can use mvn or gradle THE INPUT: ---------- As input you will get specific tuples of URLs, target-folder and target name [login to view URL] --> folder: de/ filename: [login to view URL] [login to view URL] --> folder: de/ filename: [login to view URL] or [login to view URL] --> folder: en/ filename: [login to view URL] [login to view URL] --> folder: en/ filename: [login to view URL] or [login to view URL]%D0%97%D0%BE%D0%BB%D0%BE%D1%82%D0%BE --> folder: ru/ filename: Золото.html [login to view URL]%D0%97%D0%BE%D0%BB%D0%BE%D1%82%D0%BE --> folder: ru/ filename: Самарий.html the method signature should look like this: boolean extractRelevantContentFromUrl(URL, folder, file); Where the return value shows the state if the processing/extraction is successfull. The extraction should take place in folder in the file, eg: extractRelevantContentFromUrl("[login to view URL]", "de", "[login to view URL]"); The resulting String should be checked and the process of generation should be stopped with if a not allowed tag is found (see below). This is especially important after the (my) processes are established. And when wikipedia changes their dom. The resulting HTML must look like this, while preserving the original tag and the original order of appearance: <h1>Header of Page</h1> <p>Text element n</p> <p>Text element n+1</p> <h2>header 2</h2> <p>Text element n</p> <p>Text element n+1</p> <h3>header 3</h3> <p>Text element n</p> <p>Text element n+1</p> <h4>header 4</h4> EXAMPLES: --------- examples are in the attachment. etc SOME RULES: ----------- What I DONT want in the resulting html-string, these are [login to view URL] criterias: - Tables (Completly Removed) - Anchors (The anchor text is still needed..) (<a href="[login to view URL]">Text</a> will be just "Text" without double quotes) - sup (Citation from wiki - completly removed, other sups and subs must stay, like in the example) - Images: Paths has to be rewritten to allow hot linking if a relative path is set (i did not see relative paths for images) - Section: Bibliography, you will need to find a way to omit these information in all languages - Section: External Links, you will need to find a way, that is flexible for all languages - Empty Tags must be removed, like <p></p> or: <div id="someid"></div>, take care of possible whitespaces (trim). - mw-editsection 's ... - toc - in general poluted html: I just want clean tags. - The extraction process must work with all language setups What I expect as deliverables: - The complete source code and pom/gradle file if FW/Libraries are used we can share via private repo - All libraries that are used in the pom/gradle dependency, that i can not get via the default repositories (like private ones) - Clean Code - Documentation where neccessary in english language - It is possible, that the method signature boolean extractRelevantContentFromUrl(URL, folder, filename) is not sufficient. If thats the case, we talk (Mail, Phone, Skype, Facetime or whatever) What I deliver to you: - 10 languages containing of - 118 elements each. You will get a Java Class/Enum that holds the tuple of these (URL, folder, filename) that you can just c&p it. - The money :-)
Project ID: 11344706

About the project

5 proposals
Remote project
Active 8 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
5 freelancers are bidding on average €187 EUR for this job
User Avatar
WISHING YOU PEACE IN YOUR LIFE. DATA Extraction experience from web-sites to name a FEW sites like facebook, stumbleupon, youtube, amazon, linkedIn, twitter, eBay and yelp. Web Scraping(httpClient,JSoup,HTMLUnit): 1 year 2 months. I have more than 8 years of experience in JAVA. The breakup of experience as per best of my knowledge: Spring2.x - 3.x : 2year 1 months Hibernate 3.x - 4.x : 2 years 4 months Struts 1.x - 2.x : 14 months HtmlUnit : 4 months XSLT/XML/DOM/JAXB : 2 months Jquery : 9 months javaScript/OOP JavaScript : 6 months Design Patterns : 8 month Database design(Oracle/MySQL) : 33 months. Working in a reputed company,Cognizant ,India ,Pune. Believe to be problem solver . "Honesty is the best policy" i really mean it. Worked on Oracle /MYSQL Database & Java Related frameworks most of the time . Basically i am an web/Desktop developer. Thanks & Regards
€200 EUR in 15 days
4.7 (18 reviews)
5.0
5.0
User Avatar
Java professional, rich experience in web scraping
€111 EUR in 5 days
5.0 (1 review)
1.9
1.9
User Avatar
Hello I am a professional programmer on PHP+MySQL+WordPress. Over the last 5 years, I have experience in web development programing. I have excellent experience in the following area - Designing, building and maintaining eCommerce websites. - Website Redesign - Website Redesign - PSD to Html - Html To Wordpress - Joomla to Wordpress - Wordpress to Joomla - Build Ecommerce website using Row Php or CMS - SEO optimized work - Strong database knowledge specifically MySQL and querying database. - Extensive knowledge of JavaScript, CSS and (X) HTML. - PHP, jQuery, AJAX, HTML5 and CSS3 as well. I guarantee high quality work, quick answers on you messages and responsibility. I would like to do job fist and then be get paid. I am great in PHP/MySQL, HTML5, Css, Css3, Bootstrap, Jquery, Ajax, wordpress with woo-commerce, megento and codeigniter framwork too, so looking forward to hearing from you Thank you for opportunity!!! Hello sir I am a professional PHP and Wordpress Developer. I possess excellent communication skills and can liaise effectively with clients. Other strong points include an ability to work as part of a team or individually, multi-task, prioritize and work to deadlines under pressure. I have worked as a part time freelance developer in many different projects in different companies. - I have 5 years experience: PHP+MySQL+Wordpress+Ecommerce site building - I worked as independent developer and like a team player. I know how to manage my time and complete the j
€222 EUR in 3 days
5.0 (2 reviews)
0.8
0.8

About the client

Flag of GERMANY
Worms, Germany
5.0
2
Payment method verified
Member since Aug 18, 2016

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.