Develop a crawler/parser for wikipedia according to rules
€30-250 EUR
In Progress
Posted over 7 years ago
€30-250 EUR
Paid on delivery
SUMMARY:
--------
I want a parser crawler combinatino that retrives a specific html wiki pages, strips out content, remove not needed section, attributes, tags + text.
What I expect as Product Owner/Customer:
- JAVA Based cmd-line-application.
- You can use a Frameworks whatever you want (like JSoup or SAX) that assist you.
- You can use Java 1.8 power like Streams
- You can use the API from WIKI, if you want
- Preparation is always UTF-8 in order to support languages ...
- The crawler must work for different languages, like de, it, en, es, pt, ru
- Logging FW like log4j, logback or similar
- You can use mvn or gradle
THE INPUT:
----------
As input you will get specific tuples of URLs, target-folder and target name
[login to view URL] --> folder: de/ filename: [login to view URL]
[login to view URL] --> folder: de/ filename: [login to view URL]
or
[login to view URL] --> folder: en/ filename: [login to view URL]
[login to view URL] --> folder: en/ filename: [login to view URL]
or
[login to view URL]%D0%97%D0%BE%D0%BB%D0%BE%D1%82%D0%BE --> folder: ru/ filename: Золото.html
[login to view URL]%D0%97%D0%BE%D0%BB%D0%BE%D1%82%D0%BE --> folder: ru/ filename: Самарий.html
the method signature should look like this:
boolean extractRelevantContentFromUrl(URL, folder, file);
Where the return value shows the state if the processing/extraction is successfull. The extraction should take place in folder in the file, eg:
extractRelevantContentFromUrl("[login to view URL]", "de", "[login to view URL]");
The resulting String should be checked and the process of generation should be stopped with if a not allowed tag is found (see below).
This is especially important after the (my) processes are established. And when wikipedia changes their dom.
The resulting HTML must look like this, while preserving the original tag and the original order of appearance:
<h1>Header of Page</h1>
<p>Text element n</p>
<p>Text element n+1</p>
<h2>header 2</h2>
<p>Text element n</p>
<p>Text element n+1</p>
<h3>header 3</h3>
<p>Text element n</p>
<p>Text element n+1</p>
<h4>header 4</h4>
EXAMPLES:
---------
examples are in the attachment.
etc
SOME RULES:
-----------
What I DONT want in the resulting html-string, these are [login to view URL] criterias:
- Tables (Completly Removed)
- Anchors (The anchor text is still needed..) (<a href="[login to view URL]">Text</a> will be just "Text" without double quotes)
- sup (Citation from wiki - completly removed, other sups and subs must stay, like in the example)
- Images: Paths has to be rewritten to allow hot linking if a relative path is set (i did not see relative paths for images)
- Section: Bibliography, you will need to find a way to omit these information in all languages
- Section: External Links, you will need to find a way, that is flexible for all languages
- Empty Tags must be removed, like <p></p> or: <div id="someid"></div>, take care of possible whitespaces (trim).
- mw-editsection 's ...
- toc
- in general poluted html: I just want clean tags.
- The extraction process must work with all language setups
What I expect as deliverables:
- The complete source code and pom/gradle file if FW/Libraries are used we can share via private repo
- All libraries that are used in the pom/gradle dependency, that i can not get via the default repositories (like private ones)
- Clean Code
- Documentation where neccessary in english language
- It is possible, that the method signature boolean extractRelevantContentFromUrl(URL, folder, filename) is not sufficient. If thats the case, we talk (Mail, Phone, Skype, Facetime or whatever)
What I deliver to you:
- 10 languages containing of
- 118 elements each. You will get a Java Class/Enum that holds the tuple of these (URL, folder, filename) that you can just c&p it.
- The money :-)
WISHING YOU PEACE IN YOUR LIFE.
DATA Extraction experience from web-sites to name a FEW sites like facebook, stumbleupon, youtube, amazon, linkedIn, twitter, eBay and yelp.
Web Scraping(httpClient,JSoup,HTMLUnit): 1 year 2 months.
I have more than 8 years of experience in JAVA.
The breakup of experience as per best of my knowledge:
Spring2.x - 3.x : 2year 1 months
Hibernate 3.x - 4.x : 2 years 4 months
Struts 1.x - 2.x : 14 months
HtmlUnit : 4 months
XSLT/XML/DOM/JAXB : 2 months
Jquery : 9 months
javaScript/OOP JavaScript : 6 months
Design Patterns : 8 month
Database design(Oracle/MySQL) : 33 months.
Working in a reputed company,Cognizant ,India ,Pune.
Believe to be problem solver . "Honesty is the best policy" i really mean it.
Worked on Oracle /MYSQL Database & Java Related frameworks most of the time .
Basically i am an web/Desktop developer.
Thanks & Regards
Hello
I am a professional programmer on PHP+MySQL+WordPress. Over the last 5 years, I have experience in web development programing.
I have excellent experience in the following area
- Designing, building and maintaining eCommerce websites.
- Website Redesign
- Website Redesign
- PSD to Html
- Html To Wordpress
- Joomla to Wordpress
- Wordpress to Joomla
- Build Ecommerce website using Row Php or CMS
- SEO optimized work
- Strong database knowledge specifically MySQL and querying database.
- Extensive knowledge of JavaScript, CSS and (X) HTML.
- PHP, jQuery, AJAX, HTML5 and CSS3 as well.
I guarantee high quality work, quick answers on you messages and responsibility.
I would like to do job fist and then be get paid.
I am great in PHP/MySQL, HTML5, Css, Css3, Bootstrap, Jquery, Ajax, wordpress with woo-commerce, megento and codeigniter framwork too, so looking forward to hearing from you
Thank you for opportunity!!!
Hello sir
I am a professional PHP and Wordpress Developer. I possess excellent communication skills and can liaise effectively with clients. Other strong points include an ability to work as part of a team or individually, multi-task, prioritize and work to deadlines under pressure.
I have worked as a part time freelance developer in many different projects in different companies.
- I have 5 years experience: PHP+MySQL+Wordpress+Ecommerce site building
- I worked as independent developer and like a team player. I know how to manage my time and complete the j