Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to crawl only the root website #14

Closed
wants to merge 1 commit into from

Conversation

devavinothm
Copy link

Explanation of Changes

  • Added stay_within_domain Parameter:

     This parameter is added to the __init__ method to control whether to crawl external links or not.
    
  • Stored Root Domain:

      self.root_domain stores the root domain parsed from the root_url.
    
  • Filtered Links:

      In the crawl method, before adding a link to self.crawl_set, check if the link’s domain matches the root domain when stay_within_domain is True.
    

I hope this solves the issue.

@@ -53,6 +55,8 @@ def __init__(self,
self.max_workers: int = max_workers
self.delay: float = delay
self.verbose: bool = verbose
self.stay_within_domain: bool = stay_within_domain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the name internal_links_only as a flag because I want to provide option for crawling external links later which can be named as external_links_only. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can do

@indrajithi
Copy link
Collaborator

Also please check this

print(Fore.GREEN + f"Crawling: {root_url}")
crawler.start()


if __name__ == '__main__':
main()
main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this missing last new line is causing lint to fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add a test case. Thanks!

@indrajithi
Copy link
Collaborator

Closing due to inactivity. /cc @Mews @devavinothm

@indrajithi indrajithi closed this Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants