Project overview
A system that transforms unstructured Reddit discussions into coherent, position-by-position matchup narratives for college basketball games.
- Scrapes Reddit comments from team-specific subreddits using PRAW API
- Stores structured data in PostgreSQL with tables for users, posts, comments, teams, and games
- Uses ChromaDB vector database with SentenceTransformer embeddings for semantic search
- Generates matchup analyses with DeepSeek LLM grounded in fan sentiment
Data gathering process
Automated scraping pipeline that collects fan commentary from Reddit's CollegeBasketball subreddit.
PRAW
Reddit API
Web Scraping
Data Collection
The scraper parses game thread index posts to identify relevant discussions, then recursively collects comments, user flairs, and metadata. It handles rate limiting and stores data in structured formats for downstream processing.
Database design
Relational schema optimized for querying fan discussions by team, game, and user context.
PostgreSQL
SQLAlchemy
Data Modeling
Upsert Operations
Core tables include users (with flair affiliations), posts, comments (with threading), teams, and games. Foreign key relationships enable efficient joins for retrieving team-specific commentary while handling data conflicts through upsert logic.
RAG pipeline
Retrieval-augmented generation system that synthesizes fan insights into structured matchup reports.
ChromaDB
SentenceTransformers
DeepSeek LLM
Vector Search
Comments are embedded using GTR-T5-XL model and stored in ChromaDB. For matchup queries, relevant fan discussions are retrieved via semantic similarity, then fed to DeepSeek model with structured prompts to generate position-by-position analyses grounded in user sentiment.
Results
Delivered qualitative scouting reports that capture fan-driven narratives, providing unique insights beyond traditional statistics.
Successfully synthesized complex fan discussions into actionable matchup breakdowns, highlighting key battles and concerns that shape pre-game expectations.