top of page
Search

Big Data Systems with Hadoop, Hive & Sqoop

  • Writer: Tejaswi Rupa Neelapu
    Tejaswi Rupa Neelapu
  • Apr 21
  • 1 min read

Tags: Hadoop Streaming, MapReduce, Hive, Sqoop, TF-IDF, Log Processing

Link: GitHub


🔍 Overview

A multi-module project using Hadoop Streaming, Hive, and Sqoop to process airline, text, and log data. Implemented custom MapReduce logic, Hive queries, and TF-IDF pipelines to simulate big data workflows across various use cases.


❓ Key Questions Answered

  • Which airline has the worst on-time performance?

  • How do Shakespeare and Austen differ in vocabulary richness?

  • Can we build a TF-IDF search index using MapReduce and Hive?


🛠️ Tools & Technologies Used

Hadoop, Hive, Sqoop, MapReduce (Streaming), HiveQL, Shell Scripting


⚙️ Methods & Implementation

  • Word Count + Text Cleaning using MapReduce

  • Imported airline delay data using Sqoop and analyzed it with Hive

  • Multi-stage MapReduce pipeline to compute TF-IDF, followed by Hive integration

  • Processed Hadoop logs to count severity levels per minute


📊 Results & Insights

  • Found that Spirit Airlines had the worst average arrival delay (~18.6 mins)

  • Shakespeare’s texts had a larger vocabulary richness ratio than Austen's

  • Identified 10,000+ unique terms and ranked them across 18 documents


 
 
 

Comments


Contact
Information


Seattle, WA 98122

206-856-4851

  • LinkedIn
  • GitHub

©Tejaswi Neelapu

bottom of page