Advanced Distributed Data Processing with PySpark
- Tejaswi Rupa Neelapu
- Apr 21, 2025
- 1 min read
Tags: Apache Spark, PySpark, TF-IDF, Structured Streaming, S3, Spark SQL
Link: GitHub
🔍 Overview
This project combines batch and streaming data processing using PySpark to build a TF-IDF document index, query a search engine-like interface, and run real-time log analytics for server monitoring — simulating end-to-end scalable data workflows.
âť“ Key Questions Answered
How can we compute document relevance across large text corpora efficiently?
How can we use Spark Structured Streaming to process logs in real-time?
Can Spark serve as a unified framework for search, ranking, and monitoring?
🛠️ Tools & Technologies Used
PySpark, Spark SQL, Structured Streaming, AWS S3, Livy, Math.log10
⚙️ Methods & Implementation
Built a document-term TF-IDF index from 18 documents
Implemented a search query scoring system based on normalized TF-IDF relevance
Used Spark Structured Streaming to simulate and monitor log data from two servers
Generated in-memory volume reports and persistent SEV0 logs in S3
📊 Results & Insights
Top TF-IDF: ('carroll-alice', 'alice', 11355.78)
Log volume tracking for SEV2 events showed server s2 had ~2.5x the activity of s1
Relevance function successfully retrieved the top 5 documents for any search query



Comments