Advanced Distributed Data Processing with PySpark

Tejaswi Rupa Neelapu
Apr 21
1 min read

Tags: Apache Spark, PySpark, TF-IDF, Structured Streaming, S3, Spark SQL

Link: GitHub

🔍 Overview

This project combines batch and streaming data processing using PySpark to build a TF-IDF document index, query a search engine-like interface, and run real-time log analytics for server monitoring — simulating end-to-end scalable data workflows.

❓ Key Questions Answered

How can we compute document relevance across large text corpora efficiently?
How can we use Spark Structured Streaming to process logs in real-time?
Can Spark serve as a unified framework for search, ranking, and monitoring?

🛠️ Tools & Technologies Used

PySpark, Spark SQL, Structured Streaming, AWS S3, Livy, Math.log10

⚙️ Methods & Implementation

Built a document-term TF-IDF index from 18 documents
Implemented a search query scoring system based on normalized TF-IDF relevance
Used Spark Structured Streaming to simulate and monitor log data from two servers
Generated in-memory volume reports and persistent SEV0 logs in S3

📊 Results & Insights

Top TF-IDF: ('carroll-alice', 'alice', 11355.78)
Log volume tracking for SEV2 events showed server s2 had ~2.5x the activity of s1
Relevance function successfully retrieved the top 5 documents for any search query