performing tasks and queries in the Hortonworks environment using Hive and Pig
$30-250 AUD
Cancelled
Posted over 8 years ago
$30-250 AUD
Paid on delivery
This assignment includes performing a number of tasks and queries in the Hortonworks
environment using Hive and Pig. The focus is on performing comparative evaluation
between Hive and Pig based on the performance (time taken, no of jobs, etc).
The data files : I will give you after confirmation, it is just csv file.
Task 1 (2 marks):
− Upload the files using Hortonworks File Browser.
− You also need to take care of the encoding and delimiters so the data is populated
under the right columns. This can be done at different levels (e.g. at command line, in
HCatalog, or when loading data from the file into the table).
− Provide screenshots.
Task 2 (6 marks):
− Find out the number of published books per publisher using pig.
− Find out the publisher with the highest number of books published using hive. Your
result should also show the number of books published by this publisher.
− Provide your hive queries, pig scripts and the screenshots of the results (and tables
Task 3 (6 marks):
− Find out the average rating for each book in both pig and hive. The results should also
show the book title, isbn, author and publisher.
− Record and compare the Hive and Pig based on the information in the logs including
the total time taken as well as other factors such as number of jobs, maps and reducers.
− Provide your hive queries, pig scripts and the screenshots of the results (and tables).
Task 4 (3 marks):
− In the Hortonworks shell, execute the second part in Task 2 using Hive, and record
your time without enabling the Tez. Then, enable the Tez and perform the same query
and compare your results.
− Disable Tez and use Vectorization for the second part in Task 2 according to the
tutorial 6. Record your time and report your findings.
− Provide the results of your experiments in table/s along with the screenshots.
Task 5 (3 marks):
− Investigate and find out how performing Task 2 and Task 3 can be further improved in
terms of response time. This can be different for Pig and Hive (also different for each
Task). Include a reference list of online sources, journal articles or conference papers
that you read and/or used
Submission Requirements:
All the following files should be uploaded to Moodle as a zip file and use the following
naming convention: FIT5043-A2-[StudentID].zip. There is a mark deduction for any
missing document.
1. An Assessment Cover Sheet for the group
2. Provide a report that includes all the documentation mentioned per each task
(Hive SQL code, pig scripts, tables, screenshots, etc) in a Word document in the
order of Tasks. Use heading and subheadings
With extensive experience on big data end to end solution implementation you can ensure capturing of detailed execution of statistics based on the performance parameters mentioned in the job.
6 years of experience in hadoop and java development.
Experienced in writing mr program ,hive,pig,Sqoop,HAWQ,Oozie scripts.
Working knowledge about NoSQL database like MONGODB,NEO4j,DYNAMODB.
Worked in ETL tools like PENTAHO.
Extensive knowledge in hive and pig