Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: HTML tag Dump file scanner #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions src/form_scanner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Web Scraping Project

## Overview
This project is a Python-based web scraping tool designed to extract specific information from a list of URLs. The script scrapes data like the title of forms, the institution owning the form, alternate language URLs, and more. The results are saved in both JSON and CSV formats.

## Features
- Scrape web pages for specific information within forms.
- Handles various HTML structures and web page layouts.
- Outputs data in both JSON and CSV formats for easy analysis.
- Includes progress tracking for large sets of URLs.

## Prerequisites
Before you run the script, ensure you have the following installed:
- Python 3
- `requests`
- `beautifulsoup4`
- `tqdm`

You can install the required Python libraries using:

```bash
pip install requests beautifulsoup4 tqdm
```

## Installation
1. Clone the repository or download the script to your local machine.
2. Ensure you have the required libraries installed (see Prerequisites).

## Usage
1. Place your list of URLs in a text file (e.g., `forms.txt`).
2. Run the script from your command line:

```bash
python main.py
```

3. Check the output files (`scraped_data.json` and `scraped_data.csv`) for the results.

## Testing
To run the tests, execute the following command:

```bash
python -m unittest
```

## Contributing
Contributions to this project are welcome. Please fork the repository and submit a pull request with your updates.

## License
Unless otherwise noted, computer program source code of the Alpha Canada.ca is
covered under Crown Copyright, Government of Canada, and is distributed under the MIT License.

The Canada wordmark and related graphics associated with this distribution are protected under
trademark law and copyright law. No permission is granted to use them outside the parameters
of the Government of Canada's corporate identity program. For more information, see
https://www.tbs-sct.gc.ca/fip-pcim/index-eng.asp

Copyright title to all 3rd party software distributed with the Web Experience Toolkit (WET) is
held by the respective copyright holders as noted in those files. Users are asked to read the
3rd Party Licenses referenced with those assets.


MIT License

Copyright (c) 2014 Government of Canada

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial
portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20 changes: 20 additions & 0 deletions src/form_scanner/forms-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
https://www.canada.ca/en/accessibility-standards-canada/campaigns/annual-public-meeting/registration.html
https://www.canada.ca/en/accessibility-standards-canada/corporate/accessibility-standards/technical-committee-apply.html
https://www.canada.ca/en/accessibility-standards-canada/corporate/contact.html
https://www.canada.ca/en/accessibility-standards-canada/corporate/consultation/2020-priorities/form.html
https://www.canada.ca/en/administrative-tribunals-support-service/accessibility/accessibility-feedback-form.html
https://www.canada.ca/en/agriculture-agri-food/basic-search.html
https://www.canada.ca/en/agriculture-agri-food/news/2021/03/database-emergency-processing-fund-projects-in-quebec.html
https://www.canada.ca/en/agriculture-agri-food/news/2021/03/trouver-les-projets-du-fonds-durgence-pour-la-transformation-au-colombie-britannique.html
https://www.canada.ca/en/agriculture-agri-food/news/2021/04/database-emergency-processing-fund-projects-in-ontario.html
https://www.canada.ca/en/agriculture-agri-food/news/2022/02/the-government-of-canada-invests-in-clean-technology-to-support-sustainable-farming-practices.html
https://www.canada.ca/en/agriculture-agri-food/news/2022/03/backgrounder---list-of-local-food-infrastructure-program-projects.html
https://www.canada.ca/en/agriculture-agri-food/news/2022/12/database-list-of-local-food-infrastructure-fund-projects-fourth-intake.html
https://www.canada.ca/en/agriculture-agri-food/search.html
https://www.canada.ca/en/agriculture-agri-food/search/advanced-search.html
https://www.canada.ca/en/air-force/services/benefits-military/family-support/resources/child-care.html
https://www.canada.ca/en/air-force/services/benefits-military/family-support/resources/search.html
https://www.canada.ca/en/air-force/services/royal-canadian-air-force-speakers-bureau/request-speaker.html
https://www.canada.ca/en/analytics/pat-feedback.html
https://www.canada.ca/en/army/corporate/ie-map-divisions.html
https://www.canada.ca/en/army/programs/cae-initiatives.html
Loading