Skip to content

koolgax99/reddit-scrapping-praw

Repository files navigation

Reddit Data Scraping Project

Overview

This project demonstrates how to scrape data from Reddit using Python's PRAW (Python Reddit API Wrapper) library. The script allows you to extract valuable information from subreddit posts, making it easy to analyze Reddit content programmatically.

Features

  • Scrape top posts from any subreddit
  • Extract key post information:
    • Post title
    • Score
    • URL
    • Number of comments
    • Post body text
    • Creation date
    • Author information
    • Post ID

Prerequisites

Requirements

  • Python 3.7+
  • PRAW library
  • pandas
  • python-dotenv

Reddit API Credentials

To use this script, you'll need to:

  1. Create a Reddit Account
  2. Set up a Reddit Developer Application
    • Go to https://www.reddit.com/prefs/apps
    • Click "Create App" or "Create Another App"
    • Choose "script" as the application type
    • Fill in the necessary details
    • Note down the following credentials:
      • Client ID
      • Client Secret
      • User Agent

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/Reddit_Scraping.git
cd Reddit_Scraping
  1. Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  1. Install required packages:
pip install praw pandas python-dotenv

Configuration

Create a .env file in the project root with your Reddit API credentials:

REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USER_AGENT=your_user_agent
REDDIT_USERNAME=your_reddit_username
REDDIT_PASSWORD=your_reddit_password

⚠️ Important Security Note:

  • Never share your .env file publicly
  • Add .env to your .gitignore

Usage

Basic Scraping

from reddit_scraper import RedditScraper

# Initialize scraper
scraper = RedditScraper()

# Scrape top posts from a subreddit
datascience_posts = scraper.scrape_subreddit(
    subreddit_name='datascience',
    sort_by='top',
    time_filter='all',
    limit=20
)

# Save scraped data
scraper.save_to_file(datascience_posts)

Advanced Usage

# Scrape multiple subreddits
multi_subreddit_data = scraper.scrape_multiple_subreddits(
    ['datascience', 'MachineLearning', 'learnpython'],
    limit=30
)

Customization

  • Change sort_by: 'top', 'hot', 'new'
  • Modify time_filter: 'all', 'year', 'month', 'week', 'day'
  • Adjust limit to control number of posts

Ethical Considerations

  • Respect Reddit's API Terms of Service
  • Be mindful of rate limits
  • Use scraping responsibly

Troubleshooting

  • Ensure all environment variables are correctly set
  • Check your internet connection
  • Verify Reddit API credentials

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Disclaimer

This project is for educational purposes. Always respect Reddit's terms of service and API usage guidelines.

Contact

Your Name - [Your Email or LinkedIn]

Project Link: https://github.com/koolgax99/reddit-scrapping-praw

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages