Table of Contents
- About
- Capabilities
- Usage
- Common Use Cases
- Installation
- Possibilities for future improvements
- Contributing
Profile Scout is a versatile Python package that offers scraping and detection capabilities for profile pages on any given website, including support for information extraction. By leveraging its robust search functionality and machine learning, this tool crawls the provided URL and identifies the URLs of profile pages within the website. Profile Scout offers a convenient solution for extracting user profiles, gathering valuable information, and performing targeted actions on profile pages. With its streamlined approach, this tool simplifies the process of locating and accessing profile pages, making it an invaluable asset for data collection, web scraping, and analysis tasks. Additionally, it supports information extraction techniques, allowing users to extract specific data from profile pages efficiently.
Profiel Scout can be useful to:
- Investigators and OSINT Specialists (information extraction, creating information graphs, ...)
- Penetration Testers and Ethical Hackers/Social Engineers (information extraction, reconnaissance, profile building)
- Scientists and researchers (data engineering, data science, social science, research)
- Companies (talent research, marketing, contact acquisition/harvesting)
- Organizations (contact acquisition/harvesting, data collecting, database updating)
Profile Scout is mainly a crawler. For given URL, it will crawl the site and perform selected actions. If the file with URLs is provided, each URL will be processed in seperate thread.
Main features:
- Flexible and controlled page scraping (HTML, page screenshot, or both)
- Detecting and scraping profile pages during the crawling process
- Locating the collective page from which all profile pages originate.
- Information extraction from HTML files
Options:
-h, --help
show this help message and exit
--url URL
URL of the website to crawl
-f URLS_FILE_PATH, --file URLS_FILE_PATH
Path to the file with URLs of the websites to crawl
-D DIRECTORY, --directory DIRECTORY
Extract data from HTML files in the directory. To avoid saving output, set '-ep'/'--export-path' to ''
-v, --version
print current version of the program
-a {scrape_pages,scrape_profiles,find_origin}, --action {scrape_pages,scrape_profiles,find_origin}
Action to perform at a time of visiting the page (default: scrape_pages)
-b, --buffer
Buffer errors and outputs until crawling of website is finished and then create logs
-br, --bump-relevant
Bump relevant links to the top of the visiting queue (based on RELEVANT_WORDS list)
-ep EXPORT_PATH, --export-path EXPORT_PATH
Path to destination directory for exporting
-ic {scooby}, --image-classifier {scooby}
Image classifier to be used for identifying profile pages (default: scooby)
-cs CRAWL_SLEEP, --crawl-sleep CRAWL_SLEEP
Time to sleep between each page visit (default: 2)
-d DEPTH, --depth DEPTH
Maximum crawl depth (default: 2)
-if, --include-fragment
Consider links with URI Fragment (e.g. http://example.com/some#fragment) as seperate page
-ol OUT_LOG_PATH, --output-log-path OUT_LOG_PATH
Path to output log file. Ignored if '-f'/'--file' is used
-el ERR_LOG_PATH, --error-log-path ERR_LOG_PATH
Path to error log file. Ignored if '-f'/'--file' is used
-so {all,html,screenshot}, --scrape-option {all,html,screenshot}
Data to be scraped (default: all)
-t MAX_THREADS, --threads MAX_THREADS
Maximum number of threads to use if '-f'/'--file' is provided (default: 4)
-mp MAX_PAGES, --max-pages MAX_PAGES
Maximum number of pages to scrape and page is considered scraped if the action is performed successfully (default: unlimited)
-p, --preserve
Preserve whole URI (e.g. 'http://example.com/something/' instead of 'http://example.com/')
-r RESOLUTION, --resolution RESOLUTION
Resolution of headless browser and output images. Format: WIDTHxHIGHT (default: 2880x1620)
Full input line format is: '[DEPTH [CRAWL_SLEEP]] URL"
DEPTH and CRAWL_SLEEP are optional and if a number is present it will be consider as DEPTH.
For example, "3 https://example.com" means that the URL should be crawled to a depth of 3.
If some of the fields (DEPTH or CRAWL_SLEEP) are present in the line then corresponding argument is ignored.
Writing too much on the storage drive can reduce its lifespan. To mitigate this issue, if there are more than
30 links, informational and error messages will be buffered and written at the end of
the crawling process.
RELEVANT_WORDS=['profile', 'user', 'users', 'about-us', 'team', 'employees', 'staff', 'professor',
'profil', 'o-nama', 'zaposlen', 'nastavnik', 'nastavnici', 'saradnici', 'profesor', 'osoblje',
'запослен', 'наставник', 'наставници', 'сарадници', 'професор', 'особље']
The program can be used from:
- command line tool
profilescout
or - from another program as a package
profilescout
Command line example:
$ profilescout -a scrape_pages --url 'http://example.com'
Package example (simple):
from profilescout.crawl import Crawler
from profilescout.web.webpage import ScrapeOption
base_url = 'http://example.com'
crawl_options = CrawlOptions(max_depth=2, max_pages=20)
crawler = Crawler(options, '.')
scrape_option = ScrapeOption.ALL
for step in crawler.crawl(base_url):
crawler.save(scrape_option)
Package example (advance):
from profilescout.common.structures import OriginPageDetectionStrategy
from profilescout.crawl import Crawler
from profilescout.link.utils import is_valid_sublink
from profilescout.web.webpage import ScrapeOption
# Default values for `CrawlOptions` are provided below as an example.
# If the value of a parameter is not set, these default values will be used instead
crawl_options = CrawlOptions(
max_depth=3,
max_pages=None, # there is no limit to the number of scraped pages
crawl_sleep=2,
include_fragment=False,
bump_relevant=True,
use_buffer=False,
scraping=True,
resolution=(2880, 1620))
base_url = 'http://example.com'
export_path '.'
detection_strategy = OriginPageDetectionStrategy()
crawler = Crawler(options, export_path, detection_strategy, image_classifier)
scrape_option = ScrapeOption.ALL
for step in crawler.crawl(base_url):
if detection_strategy.successful():
result = detection_strategy.get_result()
origin = result['origin']
og_crawler = crawler.create_subcrawler()
og_crawler.options.max_depth = result['depth'] + 1
og_crawler.links_from_structure = True
og_crawler.skip_sublinks = True
og_crawler.skip_first_page = True
og_crawler.sublink_filters = [lambda page_link: is_valid_sublink(page_link.url, result['most_common_format'], '####')]
for og_step in og_crawler.crawl(origin, result['depth']):
og_crawler.save(scrape_option)
og_crawler.skip_sublinks = True
crawler.mark_as_visited(og_crawler.get_visited_links(), og_crawler.get_scraped_count())
Note: Order of arguments/switches doesn't matter
Scrape the URL up to a depth of 2 (-d
) or a maximum of 300 scraped pages (-mp
),
depending on which comes first. Store scraped data at /data
(-ep
)
profilescout --url https://example.com -d 2 -mp 300 -ep /data
Scrape HTML (-so html
) for every page up to a depth of 2 for the list of URLs (-f
).
Number of threads to be used is set with -t
profilescout -ep /data -t `nproc` -f links.txt -d 2 -so html
Start scraping screenshots from specific page (-p
). It is import to note here that
without -p
, program would ignore full path, to be precise /about-us/meet-the-team/
part
profilescout -p --url https://www.wowt.com/about-us/meet-the-team/ -mp 4 -so screenshot
Scrape each website in the URLs list and postpone writing to the storge disk (by using buffer, -b
)
profilescout -b -t `nproc` -f links.txt -d 0 -ep /data
Scrape profile pages (-a scrape_profiles
) and prioritize links that are relevant to some specific domain (-br
).
For example, if we were searching for profile pages of professors we would like to give priority to links that
contain related terms which could lead us to the profile page. Note: you can change it in file
constants.py
profilescout -br -t `nproc` -f links.txt -a scrape_profiles -mp 30
Find and screenshot profile, store it as 600x400 (-r
) image and then wait (-cs
) 30 seconds before moving to the next profile
profilescout -br -t `nproc` -f links.txt -a scrape_profiles -mp 1000 -d 3 -cs 30 -r 600x400
Locate the origin page of profile pages (-a locate_origin
) with classifier called scooby
(-ic scooby
).
Note that visited pages are lond so in can be used for something like scanning the website
profilescout -t `nproc` -f links.txt -a locate_origin -ic scooby
Extract information (-D
) contained in profile HTMLs that are located at /data
and store it at ~/results
(-ep
)
profilescout -D /data -ep ~/results
pip3 install profilescout
- Create virtual environment (optional, but recommended)
python3 -m venv /path/to/some/dir
- Activate virtual environment (skip if you skipped the first step)
source /path/to/some/dir/bin/activate
- Install requirements
pip3 install -r requirements.txt
- Install package locally
pip3 install -e .
- Explore
profilescout -h
- Create image and run container. Execute this in project's directory
mkdir "/path/to/screenshot/dir/" # if it does not exist
# this line may differ depending on your shell,
# so check the documentation for the equivalent file to .bashrc
echo 'export SS_EXPORT_PATH="/path/to/screenshot/dir/"' >> ~/.bashrc
docker build -t profilescout .
docker run -it -v "$SS_EXPORT_PATH":/data profilescout
Add --rm
if you want it to be disposable (one-time task)
- Test deployment (inside docker container)
profilescout -mp 4 -t 1 -ep '/data' -p --url https://en.wikipedia.org/wiki/GNU
- Classification
- Profile classification based on existing data (without crawling)
- Classification using HTML and images, as well as the selection of appropriate classifiers
- Scraping
- Intelligent downloading of files through links available on the profile page
- Crawling
- Support for scraping using proxies.
- Crawling actions
- Ability to provide custom actions
- Actions before and after page loading.
- Multiple actions for each stage of page processing (before, during, and after access).
- Crawling strategy
- Ability to provide custom heuristics
- Ability to choose crawling strategy (link filters, etc.)
- Support for deeper link bump
- Selection of relevant words using CLI
- Usability
- Saving progress and the ability to resume
- Increased automation (if the profile is not found at depth DEPTH, increase the depth and continue).
- Extraction
- Support for national numbers, e.g.
011/123-4567
- Experiment with lightweight LLMs
- Experiment with Key-Value extraction and Layout techniques like LayoutLM
- Support for national numbers, e.g.
If you discover a bug or have a feature idea, feel free to open an issue or PR.
Any improvements or suggestions are welcome!