Crowdsourced Sports Analysis

Generating college basketball matchup analyses by synthesizing Reddit fan commentary from team-specific subreddits, creating structured breakdowns grounded in fan perceptions.

Back to home View GitHub

Duke and UNC athletics logos on opposite sides of the Reddit icon, with basketball diagrams and charts underneath.

Project overview

A system that transforms unstructured Reddit discussions into coherent, position-by-position matchup narratives for college basketball games.

Scrapes Reddit comments from team-specific subreddits using PRAW API
Stores structured data in PostgreSQL with tables for users, posts, comments, teams, and games
Uses ChromaDB vector database with SentenceTransformer embeddings for semantic search
Generates matchup analyses with DeepSeek LLM grounded in fan sentiment

Data gathering process

Automated scraping pipeline that collects fan commentary from Reddit's CollegeBasketball subreddit.

PRAW Reddit API Web Scraping Data Collection

The scraper parses game thread index posts to identify relevant discussions, then recursively collects comments, user flairs, and metadata. It handles rate limiting and stores data in structured formats for downstream processing.

Database design

Relational schema optimized for querying fan discussions by team, game, and user context.

PostgreSQL SQLAlchemy Data Modeling Upsert Operations

Core tables include users (with flair affiliations), posts, comments (with threading), teams, and games. Foreign key relationships enable efficient joins for retrieving team-specific commentary while handling data conflicts through upsert logic.

RAG pipeline

Retrieval-augmented generation system that synthesizes fan insights into structured matchup reports.

ChromaDB SentenceTransformers DeepSeek LLM Vector Search

Comments are embedded using GTR-T5-XL model and stored in ChromaDB. For matchup queries, relevant fan discussions are retrieved via semantic similarity, then fed to DeepSeek model with structured prompts to generate position-by-position analyses grounded in user sentiment.

Results

Delivered qualitative scouting reports that capture fan-driven narratives, providing unique insights beyond traditional statistics.

Successfully synthesized complex fan discussions into actionable matchup breakdowns, highlighting key battles and concerns that shape pre-game expectations.