Big Data Systems with Hadoop, Hive & Sqoop

Tejaswi Rupa Neelapu
Apr 21
1 min read

Tags: Hadoop Streaming, MapReduce, Hive, Sqoop, TF-IDF, Log Processing

Link: GitHub

🔍 Overview

A multi-module project using Hadoop Streaming, Hive, and Sqoop to process airline, text, and log data. Implemented custom MapReduce logic, Hive queries, and TF-IDF pipelines to simulate big data workflows across various use cases.

❓ Key Questions Answered

Which airline has the worst on-time performance?
How do Shakespeare and Austen differ in vocabulary richness?
Can we build a TF-IDF search index using MapReduce and Hive?

🛠️ Tools & Technologies Used

Hadoop, Hive, Sqoop, MapReduce (Streaming), HiveQL, Shell Scripting

⚙️ Methods & Implementation

Word Count + Text Cleaning using MapReduce
Imported airline delay data using Sqoop and analyzed it with Hive
Multi-stage MapReduce pipeline to compute TF-IDF, followed by Hive integration
Processed Hadoop logs to count severity levels per minute

📊 Results & Insights

Found that Spirit Airlines had the worst average arrival delay (~18.6 mins)
Shakespeare’s texts had a larger vocabulary richness ratio than Austen's
Identified 10,000+ unique terms and ranked them across 18 documents

Big Data Systems with Hadoop, Hive & Sqoop

🔍 Overview

❓ Key Questions Answered

🛠️ Tools & Technologies Used

⚙️ Methods & Implementation

📊 Results & Insights

Recent Posts

Comments

Contact
Information

🔍 Overview

❓ Key Questions Answered

🛠️ Tools & Technologies Used

⚙️ Methods & Implementation

📊 Results & Insights

Comments

Contact Information

Contact
Information