Advanced Distributed Data Processing with PySpark
- Tejaswi Rupa Neelapu
- Apr 21
- 1 min read
Tags: Apache Spark, PySpark, TF-IDF, Structured Streaming, S3, Spark SQL
Link: GitHub
🔍 Overview
This project combines batch and streaming data processing using PySpark to build a TF-IDF document index, query a search engine-like interface, and run real-time log analytics for server monitoring — simulating end-to-end scalable data workflows.
❓ Key Questions Answered
How can we compute document relevance across large text corpora efficiently?
How can we use Spark Structured Streaming to process logs in real-time?
Can Spark serve as a unified framework for search, ranking, and monitoring?
🛠️ Tools & Technologies Used
PySpark, Spark SQL, Structured Streaming, AWS S3, Livy, Math.log10
⚙️ Methods & Implementation
Built a document-term TF-IDF index from 18 documents
Implemented a search query scoring system based on normalized TF-IDF relevance
Used Spark Structured Streaming to simulate and monitor log data from two servers
Generated in-memory volume reports and persistent SEV0 logs in S3
📊 Results & Insights
Top TF-IDF: ('carroll-alice', 'alice', 11355.78)
Log volume tracking for SEV2 events showed server s2 had ~2.5x the activity of s1
Relevance function successfully retrieved the top 5 documents for any search query
Comments