-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --robotspass
shunt for records related to robots.txt
#43
Conversation
Example output: https://mirror.ikhoefgeen.nl/WIDE-20121115212638-00463.robots.warc.gz Very rough analysis: #!/usr/bin/env python3
from collections import defaultdict, Counter
import warcio
import sys
from pprint import pprint
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser
domains = defaultdict(set)
def get_host(url:str) -> str:
return urlparse(url).hostname
def parse_robots_txt(buffer):
lines = buffer.read().decode('utf-8', errors='ignore').splitlines()
parser = RobotFileParser()
parser.parse(lines)
return parser
with open(sys.argv[1], 'rb') as fh:
for record in warcio.ArchiveIterator(fh):
domains[get_host(record.rec_headers.get_header('WARC-Target-URI'))].add((
record.rec_type,
record.rec_headers.get_header('WARC-Date'),
record.http_headers.get_header('Content-Type') if record.http_headers is not None else None,
len(parse_robots_txt(record.content_stream()).entries) > 0 if record.rec_type == 'response' and record.http_headers.get_header('Content-Type') == 'text/plain' else None
))
print(f'{"domain":<40s} req res rev rob')
for domain, records in domains.items():
types = Counter(record[0] for record in records)
hits = sum(1 for record in records if record[3] is not None)
print(f'{domain:<40s} {types["request"]: 4d} {types["response"]: 4d} {types["revisit"]: 4d} {hits: 4d}') req = requests
|
So interesting realisation: I'm matching I'm removing that for now in the hope that it will remove noise from the robots.txt-only warc. |
Tried it by myself and it is working except it fails when adding
Truncated output using
The warc is |
Huh, odd. Lemme check what's going on |
It seems that is also failing in master, so probably not related to this. |
Fixes #41.
Necessary for #40.
I also did a little bit of clean-up of the code to make it easier to pass options to the WARCProcessor.