Add a flag to crawl only the root website #14

devavinothm · 2024-06-15T05:18:47Z

Explanation of Changes

Added stay_within_domain Parameter:

 This parameter is added to the __init__ method to control whether to crawl external links or not.

Stored Root Domain:

  self.root_domain stores the root domain parsed from the root_url.

Filtered Links:

  In the crawl method, before adding a link to self.crawl_set, check if the link’s domain matches the root domain when stay_within_domain is True.

I hope this solves the issue.

indrajithi · 2024-06-15T05:27:06Z

tiny_web_crawler/crawler.py

@@ -53,6 +55,8 @@ def __init__(self,
        self.max_workers: int = max_workers
        self.delay: float = delay
        self.verbose: bool = verbose
+        self.stay_within_domain: bool = stay_within_domain


Could we use the name internal_links_only as a flag because I want to provide option for crawling external links later which can be named as external_links_only. WDYT?

Sure we can do

indrajithi · 2024-06-15T05:39:16Z

Also please check this

indrajithi · 2024-06-15T06:19:32Z

tiny_web_crawler/crawler.py

    print(Fore.GREEN + f"Crawling: {root_url}")
    crawler.start()


 if __name__ == '__main__':
-    main()
+    main()


Also this missing last new line is causing lint to fail.

Also please add a test case. Thanks!

indrajithi · 2024-06-17T21:00:26Z

Closing due to inactivity. /cc @Mews @devavinothm

Add a flag to crawl only the root website

c305d8d

devavinothm mentioned this pull request Jun 15, 2024

Feature: Add option to return the crawled website body in the response #8

Closed

indrajithi reviewed Jun 15, 2024

View reviewed changes

indrajithi mentioned this pull request Jun 16, 2024

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Closed

indrajithi closed this Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a flag to crawl only the root website #14

Add a flag to crawl only the root website #14

devavinothm commented Jun 15, 2024

indrajithi Jun 15, 2024

devavinothm Jun 15, 2024

indrajithi commented Jun 15, 2024

indrajithi Jun 15, 2024

indrajithi Jun 15, 2024

indrajithi commented Jun 17, 2024

Add a flag to crawl only the root website #14

Add a flag to crawl only the root website #14

Conversation

devavinothm commented Jun 15, 2024

indrajithi Jun 15, 2024

Choose a reason for hiding this comment

devavinothm Jun 15, 2024

Choose a reason for hiding this comment

indrajithi commented Jun 15, 2024

indrajithi Jun 15, 2024

Choose a reason for hiding this comment

indrajithi Jun 15, 2024

Choose a reason for hiding this comment

indrajithi commented Jun 17, 2024