Skip to content

Commit

Permalink
Merge pull request #15 from indrajithi/feature/refactor-spider-class
Browse files Browse the repository at this point in the history
Feature/refactor spider class and add pre-commit hooks
  • Loading branch information
indrajithi authored Jun 16, 2024
2 parents 410727f + 653b720 commit 89ed1c9
Show file tree
Hide file tree
Showing 7 changed files with 1,045 additions and 34 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
run: curl -sSL https://install.python-poetry.org | python3 -
- name: Install dependencies
run: |
poetry install
poetry install --with dev
- name: Run linter :pylint
run: |
poetry run pylint tiny_web_crawler
Expand All @@ -41,7 +41,7 @@ jobs:
run: curl -sSL https://install.python-poetry.org | python3 -
- name: Install dependencies
run: |
poetry install
poetry install --with dev
- name: Run tests
run: |
poetry run pytest
Expand All @@ -61,7 +61,7 @@ jobs:
run: curl -sSL https://install.python-poetry.org | python3 -
- name: Install dependencies
run: |
poetry install
poetry install --with dev
- name: Build package
run: poetry build
- name: Publish package
Expand Down
22 changes: 22 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: local
hooks:
- id: pylint
name: Run pylint
entry: poetry run pylint
language: system
types: [python]
args: ["tiny_web_crawler"]
stages: [commit]
- id: pytest
name: Run pytest
entry: poetry run pytest
language: system
types: [python]
stages: [push]
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,21 @@ Crawled output sample for `https://github.com`
```


## Contributing

Thank you for considering to contribute. If you are a first time contributor you can pick a `good-first-issue` and get started. Please feel free to ask questions.

Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.

### Dev setup

- Install poetry in your system `pipx install poetry`
- Clone the repo
- Create a venv or use `poetry shell`
- Run `poetry install --with dev`

Before raising a PR. Please make sure you have these checks covered

1. An issue is created which address the PR
2. Tests are written for the changes
3. All lint/test passes
973 changes: 973 additions & 0 deletions poetry.lock

Large diffs are not rendered by default.

13 changes: 11 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,28 @@ homepage = "http://github.com/indrajithi/tiny-web-crawler"
repository = "http://github.com/indrajithi/tiny-web-crawler"
documentation = "http://github.com/indrajithi/tiny-web-crawler"


[tool.poetry.scripts]
post_install = "scripts:post_install"

[tool.poetry.dependencies]
python = "^3.8"
validators = "^0.28.3"
beautifulsoup4 = "^4.12.3"
lxml = "^5.2.2"
colorama = "^0.4.6"
requests = "^2.32.3"
pylint = "3.0.2"

[tool.poetry.dev-dependencies]
[tool.poetry.group.dev.dependencies]
pytest = "^6.2"
responses = "^0.13.4"
pylint = "^2.7"
pylint = "^3.0.2"
mypy = "^1.10.0"
pytest-cov = "^5.0.0"
requests-mock = "^1.12.1"
pre-commit = ">=2.15,<3.0"


[build-system]
requires = ["poetry-core>=1.0.0"]
Expand Down
4 changes: 4 additions & 0 deletions script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
import subprocess

def post_install() -> None:
subprocess.run(["poetry", "run", "pre-commit", "install"], check=True)
43 changes: 14 additions & 29 deletions tiny_web_crawler/crawler.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from __future__ import annotations
from dataclasses import dataclass, field
import json
import urllib.parse
from typing import Dict, List, Optional, Set
Expand All @@ -14,7 +15,7 @@

DEFAULT_SCHEME: str = 'http://'


@dataclass
class Spider():
"""
A simple web crawler class.
Expand All @@ -26,33 +27,20 @@ class Spider():
crawl_set (Set[str]): A set of URLs to be crawled.
link_count (int): The current count of crawled links.
save_to_file (Optional[str]): The file path to save the crawl results.
max_workers (int): Max count of concurrent workers
delay (float): request delay
"""

def __init__(self,
root_url: str,
max_links: int = 5,
save_to_file: Optional[str] = None,
max_workers: int = 1,
delay: float = 0.5,
verbose: bool = True) -> None:
"""
Initializes the Spider class.
Args:
root_url (str): The root URL to start crawling from.
max_links (int): The maximum number of links to crawl.
save_to_file (Optional[str]): The file to save the crawl results to.
"""
self.root_url: str = root_url
self.max_links: int = max_links
self.crawl_result: Dict[str, Dict[str, List[str]]] = {}
self.crawl_set: Set[str] = set()
self.link_count: int = 0
self.save_to_file: Optional[str] = save_to_file
self.scheme: str = DEFAULT_SCHEME
self.max_workers: int = max_workers
self.delay: float = delay
self.verbose: bool = verbose
root_url: str
max_links: int = 5
save_to_file: Optional[str] = None
max_workers: int = 1
delay: float = 0.5
verbose: bool = True
crawl_result: Dict[str, Dict[str, List[str]]] = field(default_factory=dict)
crawl_set: Set[str] = field(default_factory=set)
link_count: int = 0
scheme: str = field(default=DEFAULT_SCHEME, init=False)

def fetch_url(self, url: str) -> Optional[BeautifulSoup]:
"""
Expand Down Expand Up @@ -201,9 +189,6 @@ def start(self) -> Dict[str, Dict[str, List[str]]]:


def main() -> None:
"""
The main function to initialize and start the crawler.
"""
root_url = 'https://pypi.org/'
max_links = 5

Expand Down

0 comments on commit 89ed1c9

Please sign in to comment.