millions-crawler

This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.

image source

Part of the homework:

Medium Article

Homework Scope

Crawl millions of webpages
Remove non-HTML pages
Performance optimization
- How many page can crawl per hour
- Total time to crawl millions of pages

Project architecture

Distributed architecture

Each spider

Spider with 台灣 E 院

Spider with 問 8 健康諮詢

Spider with Wiki

Anti-Anti-Spider

Skip robot.txt

# edit settings.py
ROBOTSTXT_OBEY = False

Use random user-agent

pip install fake-useragent

# edit middlewares.py
class FakeUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = UserAgent()
        request.headers['User-Agent'] = ua.random

DOWNLOADER_MIDDLEWARES = {
   "millions_crawler.middlewares.FakeUserAgentMiddleware": 543,
}

Result

single spider in 2023/03/21

Spider	Total Page	Total Time (hrs)	Page per Hour
tweh	152,958	1.3	117,409
w8h	4,759	0.1	32,203
wiki*	13,000,320	43	30,240

distributed spider (4 spider) in 2023/03/24

Spider	Total Page	Total Time (hrs)	Page per Hour
tweh	153,288	0.52	-
w8h	4,921	0.16	-
wiki*	4,731,249	43.2	109,492

How to use

create a .env file

bash create_env.sh

Install Redis

sudo apt-get install redis-server

Install MongoDB

sudo apt-get install mongodb

Run Redis

redis-server

run MongoDB

sudo service mongod start

Run spider

cd millions-crawler
scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki

Requirement

pip install -r requirements.txt

Reference

GitHub | fake-useragent
GitHub | scrapy
【Day 20】反反爬蟲
Documentation of Scrapy
解决 Redis 之 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...
Ubuntu Linux 安裝、設定 Redis 資料庫教學與範例
如何連線到遠端的 Linux + MongoDB 伺服器？
Scrapy-redis 之終結篇

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

millions-crawler

Part of the homework:

Homework Scope

Project architecture

Distributed architecture

Each spider

Spider with 台灣 E 院

Spider with 問 8 健康諮詢

Spider with Wiki

Anti-Anti-Spider

Result

single spider in 2023/03/21

distributed spider (4 spider) in 2023/03/24

How to use

Requirement

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

millions-crawler

Part of the homework:

Homework Scope

Project architecture

Distributed architecture

Each spider

Spider with 台灣 E 院

Spider with 問 8 健康諮詢

Spider with Wiki

Anti-Anti-Spider

Result

single spider in 2023/03/21

distributed spider (4 spider) in 2023/03/24

How to use

Requirement

Reference