Big Data Systems with Hadoop, Hive & Sqoop
- Tejaswi Rupa Neelapu
- Apr 21
- 1 min read
Tags: Hadoop Streaming, MapReduce, Hive, Sqoop, TF-IDF, Log Processing
Link: GitHub
🔍 Overview
A multi-module project using Hadoop Streaming, Hive, and Sqoop to process airline, text, and log data. Implemented custom MapReduce logic, Hive queries, and TF-IDF pipelines to simulate big data workflows across various use cases.
❓ Key Questions Answered
Which airline has the worst on-time performance?
How do Shakespeare and Austen differ in vocabulary richness?
Can we build a TF-IDF search index using MapReduce and Hive?
🛠️ Tools & Technologies Used
Hadoop, Hive, Sqoop, MapReduce (Streaming), HiveQL, Shell Scripting
⚙️ Methods & Implementation
Word Count + Text Cleaning using MapReduce
Imported airline delay data using Sqoop and analyzed it with Hive
Multi-stage MapReduce pipeline to compute TF-IDF, followed by Hive integration
Processed Hadoop logs to count severity levels per minute
📊 Results & Insights
Found that Spirit Airlines had the worst average arrival delay (~18.6 mins)
Shakespeare’s texts had a larger vocabulary richness ratio than Austen's
Identified 10,000+ unique terms and ranked them across 18 documents
Comments