top of page
Search

Advanced Distributed Data Processing with PySpark

  • Writer: Tejaswi Rupa Neelapu
    Tejaswi Rupa Neelapu
  • Apr 21
  • 1 min read

Tags: Apache Spark, PySpark, TF-IDF, Structured Streaming, S3, Spark SQL

Link: GitHub


🔍 Overview

This project combines batch and streaming data processing using PySpark to build a TF-IDF document index, query a search engine-like interface, and run real-time log analytics for server monitoring — simulating end-to-end scalable data workflows.


❓ Key Questions Answered

  • How can we compute document relevance across large text corpora efficiently?

  • How can we use Spark Structured Streaming to process logs in real-time?

  • Can Spark serve as a unified framework for search, ranking, and monitoring?


🛠️ Tools & Technologies Used

PySpark, Spark SQL, Structured Streaming, AWS S3, Livy, Math.log10


⚙️ Methods & Implementation

  • Built a document-term TF-IDF index from 18 documents

  • Implemented a search query scoring system based on normalized TF-IDF relevance

  • Used Spark Structured Streaming to simulate and monitor log data from two servers

  • Generated in-memory volume reports and persistent SEV0 logs in S3


📊 Results & Insights

  • Top TF-IDF: ('carroll-alice', 'alice', 11355.78)

  • Log volume tracking for SEV2 events showed server s2 had ~2.5x the activity of s1

  • Relevance function successfully retrieved the top 5 documents for any search query

 
 
 

Comments


Contact
Information


Seattle, WA 98122

206-856-4851

  • LinkedIn
  • GitHub

©Tejaswi Neelapu

bottom of page