Respect robots txt #45

Mews · 2024-06-19T11:16:09Z

Closes #42

Changes

Added a new networking.robots_txt submodule;
Separated some of the crawler logic to separate functions to avoid pylint's too-many-branches;
Spider.crawl now checks whether scraping a website is allowed and also respects specified crawl delay;
Added new tests for the robots_txt submodule and for the respect_robots_txt option of Spider
Increased max-attributes in .pylintrc
- This should only be a temporary fix until Use one or more Options classes #46 is addressed

tests/networking/test_robots_txt.py

indrajithi · 2024-06-19T17:26:49Z

tests/test_crawler.py

+    )
+
+    mock_urlopen.side_effect = lambda url: (
+        BytesIO(b"User-agent: *\nDisallow: /") if url == "http://notcrawlable.com/robots.txt" else


Should we make this into two separate tests and avoid this branch?

What do you mean?

Mews added 6 commits June 19, 2024 10:38

Added networking.robots_txt submodule

818128d

Increased max class attributes

003c30b

Check robots.txt file before crawling

448e2ec

Added test cases for new robots_txt submodule

8c3ac23

Added line to docstring for respect_robots_txt

6a67802

Fixed docstring for setup_robots_txt_parser

040770c

indrajithi reviewed Jun 19, 2024

View reviewed changes

Mews added 2 commits June 19, 2024 18:36

Use pytest parametrize on test_get_robots_txt_url

c566fb8

Use parametrize in test_format_url

83e30ef

indrajithi merged commit 6f5ea34 into DataCrawl-AI:master Jun 19, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect robots txt #45

Respect robots txt #45

Mews commented Jun 19, 2024 •

edited

Loading

indrajithi Jun 19, 2024

Mews Jun 19, 2024

Respect robots txt #45

Respect robots txt #45

Conversation

Mews commented Jun 19, 2024 • edited Loading

indrajithi Jun 19, 2024

Choose a reason for hiding this comment

Mews Jun 19, 2024

Choose a reason for hiding this comment

Mews commented Jun 19, 2024 •

edited

Loading