Skip to content

Latest commit

 

History

History
88 lines (58 loc) · 1.58 KB

readme.md

File metadata and controls

88 lines (58 loc) · 1.58 KB

Image Scraper

This is an implementation of a simple image scraper made in Django using beautifulsoup for scraping. It has a simple frontend to view the images and download them.

Dependencies

Check Pipfile for details

  • beautifulsoup4
  • lxml
  • requests
  • cssutils
  • pillow
  • python 3.7

Setup

Clone the repository

$ git clone https://github.com/rachhek/imagescraper.git
$ cd imagescraper

Create a virtual environment and install the dependencies

$ pipenv shell
$ (imagescraper) pipenv install

Once the pipenv has finished installing, run migrations for django.

$ (imagescraper) python manage.py migrate

Run the server

$ (imagescraper) python manage.py runserver

Open the application in http://127.0.0.1:8000/

Walkthrough

Homepage
alt Screenshot 1

Example of scraping the homepage of http://unity.com
alt Screenshot 2

The Urls and images can be downloaded
alt Screenshot 4

The physical location of the images and txt file of URLs is

<path_to_project>/imagescraper/media/

alt Screenshot 3

Code

Scraper Tool

scraper_app/lib.py

Gallery

scraper_app/templates/scraper_app/scraper/index.html

Limitations

  • Cannot download images that are in the form of base64
  • Only scrapes "img" tag and "background-url" style tags
  • does not automatically scroll pages
  • might not properly scrape images for a highly dynamic websites

Logs

The logs are stored in /imgscraper/debug.log