Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --robotspass shunt for records related to robots.txt #43

Merged
merged 6 commits into from
Nov 2, 2023

Conversation

jelmervdl
Copy link
Member

@jelmervdl jelmervdl commented Oct 6, 2023

Fixes #41.

Necessary for #40.

I also did a little bit of clean-up of the code to make it easier to pass options to the WARCProcessor.

@jelmervdl
Copy link
Member Author

Example output: https://mirror.ikhoefgeen.nl/WIDE-20121115212638-00463.robots.warc.gz

Very rough analysis:

#!/usr/bin/env python3
from collections import defaultdict, Counter
import warcio
import sys
from pprint import pprint
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser


domains = defaultdict(set)


def get_host(url:str) -> str:
	return urlparse(url).hostname


def parse_robots_txt(buffer):
	lines = buffer.read().decode('utf-8', errors='ignore').splitlines()
	parser = RobotFileParser()
	parser.parse(lines)
	return parser


with open(sys.argv[1], 'rb') as fh:
	for record in warcio.ArchiveIterator(fh):
		domains[get_host(record.rec_headers.get_header('WARC-Target-URI'))].add((
			record.rec_type,
			record.rec_headers.get_header('WARC-Date'),
			record.http_headers.get_header('Content-Type') if record.http_headers is not None else None,
			len(parse_robots_txt(record.content_stream()).entries) > 0 if record.rec_type == 'response' and record.http_headers.get_header('Content-Type') == 'text/plain' else None
		))

print(f'{"domain":<40s}  req  res  rev  rob')
for domain, records in domains.items():
	types = Counter(record[0] for record in records)
	hits = sum(1 for record in records if record[3] is not None)
	print(f'{domain:<40s} {types["request"]: 4d} {types["response"]: 4d} {types["revisit"]: 4d} {hits: 4d}')

req = requests
res = responses with body in warc
rev = response that was similar to previous, no body in warc
rob = parseable robots.txt

domain                                    req  res  rev  rob
www.carparts911.co.uk                      94   94    0    0
focusun.com.cn                            126  126    0    0
building.ua                                48   48    0    0
www.worldwidefred.com                      36   36    0    0
thisistoday.com                            47   47    0    0
farsibuy.com                              139  139    0    0
shakespearetheatre.org                      1    0    1    0
www.somerbyatstvincents.com                58   58    0    0
www.essen24.ch                             69   69    0    0
events.jeepforum.com                      137  137    0    0
0570ts.com                                107    0  107    0
livegoldprices.com                          1    0    1    0
securitor.com.au                          120  120    0    0
www.fogyas.info                            24   24    0    0
www.michaelsappliance.com                 101    0  101    0
www.1800wellness.org                      125  125    0    0
wetter.news.at                             68   68    0    0
www.amperoase.de                           96   96    0    0
yx56.cn                                   122  122    0    0
www.haiguangpharm.com                     119  119    0    0
media.nationaltimes.com.au                 88   88    0    0
galerias.autocosmos.com.co                 96   96    0    0
www.cini.com.br                            34   34    0    0
www.cockpig.com                             1    1    0    0
obivon.bandcamp.com                         1    1    0    1
www.britishsnoring.co.uk                    1    0    1    0
www.premioaberje.com.br                    65   65    0    0
www.bvoe.at                                68    0   68    0
m.thisistoday.com                          45   45    0   45
inkwin.cn                                 124  124    0    0
programma.ntr.nl                           81   81    0    0
articles9000.com                           21   21    0    0
www.starlight-express-musical.de           86   86    0    0
search.sterlingplumbing.com                 6    6    0    0
le-port.jp                                133  133    0    0
www.greedge.com                           121  121    0    0
www.inoprosport.ru                         43   43    0    0
www.donpepino.com                         134  134    0    0
jingcai.ourgame.com                         1    0    1    0
dollarmachine.com                           1    1    0    0
rumetal.ru                                 57   57    0    0
www.evdema.com                             80   80    0    0
maps.sakh.com                              56   56    0    0
corridadalongevidade.com.br                54   54    0    0
ekorat.pl                                  39   39    0    0
www.pontoabc.com                            1    1    0    0
liaoyuan.laoke.com                          1    0    1    0
www.cantastic.co.uk                         1    0    1    0
www.rubancollectif.fr                       1    1    0    0
localracing.nascar.com                      1    0    1    0
g8kn.rtsdn.nl                               1    0    0    0
inoprosport.ru                             48   48    0    0
parcuri.auto.ro                             1    0    1    0
www.balenciaga-handbags-replica.com         1    1    0    0
www.wiscbankandtrustonline.com              1    1    0    0
www.toolzone.ru                             1    1    0    0
www.supermicro.co.uk                       60   60    0    0
conexaoaeroporto.com.br                    55   55    0    0
toolzone.ru                                 1    1    0    0
www.inboedelverzekeringen.com               1    1    0    1
www.alfaromeo.cz                            1    0    1    0
www.supplementklinik.com                    1    1    0    0
www.rccrawler.jp                            1    1    0    0
kyliehuisman.waarbenjij.nu                  1    1    0    0
www.ahrd.gov.cn                             1    0    1    0
viagrafrance.eklablog.fr                    1    1    0    0
cnc.nb591.com                               1    1    0    0
www.articles9000.com                       33   33    0    0
paseosenuruguay.com.uy                      1    1    0    1
www.bonsaime.com                            1    0    1    0
mobilidola.com                              1    1    0    0
lagentille.illustrateur.org                 1    0    1    0
dl.ambiweb.de                               1    0    1    0
www.leisurefitness.com                      7    7    0    0
www.la-ferme-aux-anes.com                   1    1    0    0
lokin510051.net                             1    1    0    0
www.bristolcc.edu                           1    0    1    0
67635.zipslocal.com                         1    1    0    1
cdn.originationnews.com                     1    1    0    0
www.pillazorras.com                         1    0    1    0
transparencia.dgop.cl                       1    0    1    0
nurserybeddingforsale.com                 106  106    0    0
yiwu.laoke.com                              1    0    1    0
partnerjuarez.com                           1    1    0    0
www.lafca.net                               1    0    1    0
tom-wessels.waarbenjij.nu                   1    1    0    0
achatpolosralphlauren.webnode.fr            1    1    0    0
d1003kuyu2j7ac.cloudfront.net               1    1    0    0
19953.zipslocal.com                         1    1    0    1
i3-pack-viewer.win.soft32download.com       1    1    0    1
ylwfrnd.bandcamp.com                        1    1    0    1
www.topcollezionehogan.com                  1    1    0    0
ymj.66good.com                              1    0    1    0
baseballmusings.com                         1    1    0    0
fzfl.66good.com                             1    0    1    0
xie.66good.com                              1    0    1    0
www.ttarchive.com                           1    0    1    0
teebag.bandcamp.com                         1    0    1    0
snailo.com                                  1    1    0    0
www.snailo.com                              1    1    0    0
smdigitalrec.bandcamp.com                   1    1    0    1
fremt.vlsu.ru                              49   49    0    0
haberresim3.borsagundem.com                 1    0    1    0
img1.yxyun.com                              1    0    1    0
chapman-design.com                          1    1    0    0
jlmarclefebvre.com                          1    1    0    0
07763.zipslocal.com                         1    1    0    1
pleinnargent.site.voila.fr                  1    1    0    0
pleinnargent.voila.net                      1    1    0    0
therealside-b.bandcamp.com                  1    1    0    1
67030.zipslocal.com                         1    1    0    1
www.realestatejuice.com                    93   93    0    0
ros-bis.ru                                  1    0    1    0
andrewnacin.com                             1    0    1    0
www.lorealmenexpert.com.au                  1    1    0    0
67585.zipslocal.com                         1    1    0    1
66250.zipslocal.com                         1    1    0    1
profictking.com                             1    0    0    0
jp.bluebellgroup.com                        9    9    0    0
zswj.jmw.com.cn                             1    0    1    0
for-yota.ru                                 1    0    1    0
www.mkqs.us                                 1    0    1    0
osterbeatmaker.bandcamp.com                 1    0    1    0
www.on-camera-audiences.com                38   38    0    0
www2.lhric.org                              1    0    1    0
www.chefscatalog.com                        1    1    0    1
07303.zipslocal.com                         1    1    0    1
www.lik-m.ru                                1    1    0    1
www.spot5.ru                              108    0  108    0
galeri.borsagundem.com                      1    0    1    0
woodandrags.com                             1    1    0    0
www.flexaodebraco.info                      1    0    0    0
www.assembleia.go.gov.br                    4    4    0    0
08732.zipslocal.com                         1    1    0    1
nashville.craigslist.org                    1    0    1    0
www.photoschau.de                           1    0    0    0
eugene.craigslist.org                       1    0    1    0
phoenix.craigslist.org                      1    0    1    0
tampa.craigslist.org                        1    0    1    0
www.craigslist.org                          5    0    5    0
cafebook.free.fr                            1    0    1    0
portland.craigslist.org                     1    0    1    0
www.iranski.com                             1    0    1    0
hom-12.montadalhilal.com                    1    1    0    1
columbus.craigslist.org                     1    0    1    0
www.seedcoregroup.com                       1    0    1    0
www.shelkovitsa.ru                          1    0    1    0
66411.zipslocal.com                         1    1    0    1
3852125.i.tiexue.net                        1    0    1    0
schnusekater.de                             1    0    1    0
would-be-messiahs.bandcamp.com              1    1    0    1
pc7.2ch.net                                 1    0    1    0
designevilstarvirus.free.fr                 1    1    0    0
myrecipecollection.info                     1    0    1    0
www.furleypage.co.uk                        1    0    1    0
bsos.gratisforo.es                          1    1    0    1
www.rekatee.com                             1    0    1    0
urbanwavesrecords.bandcamp.com              1    0    1    0
www.deckyachting.at                        72   72    0    0
www.slovenskenovice.si                      1    0    1    0
d25a50wq0hgskv.cloudfront.net               1    1    0    0
3038779.ic98.com                            1    1    0    0
m.sodahead.com                              1    0    1    0
olimpo.gratisforo.es                        1    1    0    1
system.of.etienne.free.fr                   1    0    1    0
www.deckyachting.de                        58    0   58    0
3004567.ic98.com                            1    1    0    0
3175838.ic98.com                            1    1    0    0
estosipuedo.com                             1    1    0    0
guerrillerosclan.gratisforo.es              1    1    0    1
www.tilegoddess.com                         1    1    0    0
ethicsorganization.com                      1    1    0    0
www.insidehome.gr                           1    0    1    0
www.voli-aerei.it                           1    0    1    0
123-reg-suspended.co.uk                     1    1    0    0
little-arcadia.gratisforo.es                1    1    0    1
www.deckyachting.it                        65   65    0    0
www.securfaq.usenet.eu.org                 52   52    0    0
www.deckyachting.com                       55   55    0    0
otan.fractal.free.fr                        1    1    0    0
mohe-casm.edu.eg                            1    1    0    0
cdn.auto.ro                                 1    1    0    0
shop.web-deal.fr                            1    1    0    1
stephanemontabert.blog.24heures.ch          1    1    0    0
listmx.mindseyesociety.org                  1    0    1    0
de.gamelicker.com                           1    0    1    0
totalef1.fr.cr                              1    1    0    1
67747.zipslocal.com                         1    1    0    1
www.glowkitchen.com                         1    0    1    0
maisbeiramar.com                            1    0    1    0
jenome.bandcamp.com                         1    1    0    1
www.1000ps.de                               1    1    0    1
sexshopinhouse.com                          1    1    0    0
www.apdaparkinson.org                       1    0    1    0
www.portalmegasound.com.br                  1    1    0    1
thetechiris.com                             1    1    0    0
sc.offcn.com                                1    0    1    0
www.sgf-branson-airport.com                 1    0    1    0
67950.zipslocal.com                         1    1    0    1
accounts.wowcore.com                        1    0    1    0
renepaulhenry.free.fr                       1    0    1    0
blog.aloharag.com                           1    0    1    0
luzhou.offcn.com                            1    1    0    0
yaan.offcn.com                              1    1    0    0
mianyang.offcn.com                          1    1    0    0
www.deckyachting.ru                        58   58    0    0
salutesulweb.paginemediche.it               1    1    0    1
ganzi.offcn.com                             1    1    0    0
www.010huashi.com                           1    0    1    0
www.prace-doma.majestat.cz                  1    1    0    0
www.umograf.com                             1    0    1    0
www.energiehoch3.de                         1    0    1    0
www.naplescondoboutique.com                 1    1    0    0
joannespoelstra.waarbenjij.nu               1    1    0    0
www.adjyc.com                               1    0    1    0
live.charbroil.com                          1    0    1    0
coup2pouces.xooit.fr                        1    0    1    0
www.travel-monster.com                      1    0    1    0
add-anime.net                              16   16    0    0
08067.zipslocal.com                         1    1    0    1
caitlincanty.com                            1    1    0    0
princegeorgesplanning.org                   1    0    0    0
www.sru.edu                                 1    0    1    0
zigong.offcn.com                            1    1    0    0
www.kippetjetok.nl                          1    0    1    0
collectfavorites.com                        1    1    0    0
www.nsc-zwickau.de                          1    1    0    1
67124.zipslocal.com                         1    1    0    1
www.hnu.edu                                 1    1    0    1
tv8.2ch.net                                 1    0    1    0
busca.gospelmais.com.br                     1    1    0    1
www.momsarchive.com                         1    0    1    0
www.blacklooks.org                          1    0    1    0
www.euskalsurf.com                          3    3    0    0
myspc.southplainscollege.edu                1    0    1    0
www.moveoutcleaning.melbournebd.com.au      1    1    0    1
nascar.com                                  1    0    1    0
cbq.com.qa                                  4    4    0    0
doubfly.joinbbs.net                         1    1    0    0
jm-heysay.joinbbs.net                       1    1    0    0
links.gospelmais.com.br                     1    0    1    0
www.sudosupply.ru                           1    1    0    1
www.bobostory.cn                            1    1    0    0
www.feiraartigo.com.br                      1    1    0    0
redclassics-guild.com                       1    1    0    0
gratis-xxxporno.gratisxxxporno.nl           1    1    0    1
gratis-dildosex.gratisxxxporno.nl           1    1    0    1
www.kivacarbon.fr                           1    0    1    0
5007228.i.tiexue.net                        1    1    0    0
www.nestle-nespresso.com                    1    0    1    0
www.cheapslabs.com                          1    0    1    0
mikefreeman.myshopify.com                   1    0    1    0
www.forwardbyelysewalker.com                1    0    1    0
cncworld.org                                1    0    1    0
www.greensboronewschannel.com               1    1    0    0
zhaoqing.laoke.com                          1    0    1    0
07028.zipslocal.com                         1    1    0    1
www.southgatebath.com                       1    1    0    0
www.luxurywatches-boutique.com              1    1    0    0
www.ypresrally.com                          1    0    1    0
m.381382.com                                1    1    0    0
www.nishitacarrentals.com                   1    1    0    0
rubenaussie.waarbenjij.nu                   1    1    0    0
www.preisvergleich.ch                       1    0    1    0
mybabytrip.free.fr                          1    0    1    0
www.adlc.ca                                 1    0    1    0
mail.sid.edu.in                             1    0    1    0
www.mountainviewgrand.com                   4    4    0    0
08106.zipslocal.com                         1    1    0    1
photos.skylersamuels.org                    1    1    0    0
www.decorativecollective.com                1    0    1    0
airforceonebase.com                         1    1    0    0
www.article-directoryblog.com               1    1    0    0
store.gama-go.com                           1    0    1    0
kete.net.nz                                 1    0    1    0
www.stardustfurnishings.com                 1    0    1    0
paint.im                                    1    1    0    0
camillement.fan2.fr                         1    1    0    1
www.furniture-egoparis.com                  1    0    1    0
66402.zipslocal.com                         1    1    0    1
67638.zipslocal.com                         1    1    0    1
oshibori-art.com                            1    0    1    0
www.deratyzacja-dezynsekcja.pl              1    0    1    0
blackmilkbeatmakers.bandcamp.com            1    0    1    0
19731.zipslocal.com                         1    1    0    1
www.riverisland.com                         1    1    0    1
search.invasivespeciesinfo.gov              1    1    0    0
www.agara.co.jp                             1    0    1    0
www.zdnet.com                               1    0    1    0
08327.zipslocal.com                         1    1    0    1
www.pro-komm.de                             1    0    1    0
soa.uiyi.cn                                 1    1    0    0
www.autofacts.co.za                         1    1    0    0
kannapolis.wbtv.com                         1    0    1    0
ka.9v8v.com                                 1    0    1    0
comcastmi.shoutpost.com                     1    0    1    0
www.educasexo.com                           1    1    0    1
www.tblog.com                               1    0    1    0
gsph.com                                    1    1    0    0
static.jemontremalingerie.com               1    0    1    0
www.micromax.tv                             1    0    0    0
bestcomcastfl.shoutpost.com                 1    0    1    0
www.jemontremalingerie.com                  1    0    1    0
19938.zipslocal.com                         1    1    0    1
cecf.org                                    1    0    1    0
forums.familyfriendpoems.com                1    1    0    1
www.familyfriendpoems.com                   1    0    1    0
radiou.free.fr                              1    0    1    0
19962.zipslocal.com                         1    1    0    1
mofa.gov.bd                                 1    0    1    0
gypsyhawk.bandcamp.com                      1    0    1    0
www.huitongtian.com                         1    1    0    1
www.repowatch.org                           1    0    1    0
velvetconnect.com                           1    0    1    0
lemire.jb.free.fr                           1    0    1    0
www.bamteli.tv                              1    1    0    0
www.j-edu.cn                                1    1    0    0
www.moderne-frauen.de                       1    1    0    1
irish-music.fan2.fr                         1    1    0    1
www.parkersbarbershop.com                   1    1    0    0
www.cactusinfo.net                          1    0    1    0
www.leedsescortsservice.co.uk               1    1    0    0
l2-crash.ru                                 1    1    0    0
rodolphe.breton.free.fr                     1    1    0    0
www.ijstv.com                               1    0    1    0
www.dotcomresearch.com                      1    0    1    0
www.unitedwayyuma.org                       1    1    0    0
www.studiotech.hu                           1    0    1    0
nx12447a.hosting.net.vodafone.pt            2    2    0    0
www.busybeedaycare.homestead.com            1    0    0    0
busybeedaycare.homestead.com                1    0    1    0
07752.zipslocal.com                         1    1    0    1
www.bulledebb.com                           1    0    1    0
natashanassar.com                           1    0    1    0
www.deckyachting.hu                        19    0   19    0
www.elamaule.cl                             1    0    1    0
larahabold.waarbenjij.nu                    1    1    0    0
www.beyondsalmon.com                        1    0    1    0
024.rc51.com.cn                             1    0    1    0
www.holidaymunnar.com                       1    0    1    0
qq10086.joinbbs.net                         1    1    0    0
forumanimal.com                             1    1    0    1
weibo.joinbbs.net                           1    1    0    0
97504.zipslocal.com                         1    1    0    1
77027.zipslocal.com                         1    1    0    1
chrisbrownfrance.com                        1    0    1    0
by28.by28.com                               1    0    1    0
www.lupy-optika.cz                          1    0    1    0
milfsexxxtube.com                           1    0    1    0
19720.zipslocal.com                         1    1    0    1
bns07.joinbbs.net                           1    1    0    0
www.freesblogs.com                          1    0    1    0
15265.zipslocal.com                         1    1    0    1
5340174.i.tiexue.net                        1    1    0    0
www.coolnorthface.zoomshare.com             1    1    0    0
www.landbigfish.com                         1    0    1    0
pasjaswiata.pl                              1    1    0    0
us.bulgari.com                              1    0    1    0
98166.zipslocal.com                         1    1    0    1
www.mills-reeve.com                         1    0    1    0
5363617.i.tiexue.net                        1    1    0    0
78708.zipslocal.com                         1    1    0    1
xqhdw.joinbbs.net                           1    1    0    0
www.regus.com.mt                            1    1    0    1
rayman.ubi.com                              1    0    1    0
15276.zipslocal.com                         1    1    0    1
www.healthplusinsights.co.uk                1    1    0    1
betweenthecitiesarestars.bandcamp.com       1    0    1    0
mamievandorenshow.free.fr                   1    0    1    0
nordik.ch                                   1    1    0    0
15210.zipslocal.com                         1    1    0    1
www.hyuna-hk.joinbbs.net                    1    1    0    0
taesunhk.joinbbs.net                        1    1    0    0
78741.zipslocal.com                         1    1    0    1
hey-say-city.joinbbs.net                    1    1    0    0
98175.zipslocal.com                         1    1    0    1
www.jafcw.net                               1    1    0    0
19805.zipslocal.com                         1    1    0    1
5298693.i.tiexue.net                        1    1    0    0
1909845.i.tiexue.net                        1    0    1    0
78756.zipslocal.com                         1    1    0    1
www.michael-buser.de                        1    0    1    0
piwik.blaser.de                             1    0    0    0
krystalhkfc.joinbbs.net                     1    1    0    0
78748.zipslocal.com                         1    1    0    1
www.gem-tang-hk.joinbbs.net                 1    1    0    0
www.aece.ro                                 1    0    1    0
www.blog.designkoma.com                     1    0    1    0
15207.zipslocal.com                         1    1    0    1
pic.mytophome.com                           1    0    1    0
www.prestistolenguitars.com                 1    1    0    0
www.creativityworks.info                    1    1    0    0
78745.zipslocal.com                         1    1    0    1
taeyeon520.joinbbs.net                      1    1    0    0
88bei.joinbbs.net                           1    1    0    0
98181.zipslocal.com                         1    1    0    1
www.themuseat269.com                        1    0    1    0
pakdashtsport.com                           1    1    0    1
practice.joinbbs.net                        1    1    0    0
www.ryeowookhk.joinbbs.net                  1    1    0    0
55101.zipslocal.com                         1    1    0    1
gaytube-twinks.com                          1    1    0    0
thierry.stora.free.fr                       1    0    1    0
78713.zipslocal.com                         1    1    0    1
55145.zipslocal.com                         1    1    0    1
enterthespiral.bandcamp.com                 1    1    0    1
15290.zipslocal.com                         1    1    0    1
3898712.i.tiexue.net                        1    1    0    0
2415702.i.tiexue.net                        1    1    0    0
www.extremesexreel.com                      1    0    1    0
51jzs.joinbbs.net                           1    1    0    0
15218.zipslocal.com                         1    1    0    1
15242.zipslocal.com                         1    1    0    1
guneyyapiizolasyon.com.tr                   1    0    1    0
www.unichim.su                              1    1    0    1
northerntorment.bandcamp.com                1    1    0    1
15224.zipslocal.com                         1    1    0    1
www.surfernetwork.com                       1    1    0    1
19885.zipslocal.com                         1    1    0    1
3293839.i.tiexue.net                        1    1    0    0
39121.zipslocal.com                         1    1    0    1
www.inmybooks.com                           2    2    0    0
55130.zipslocal.com                         1    1    0    1
55112.zipslocal.com                         1    1    0    1
www.haywirecustomguitars.com                1    0    1    0
benetton-stv.ru                             1    0    0    0
www.ask-kurd.com                            1    1    0    0
arevera.ru                                  1    0    1    0
eindrah.bandcamp.com                        1    0    1    0
hardcorechristianity.com                    1    1    0    1
venetz.ru                                   1    0    0    0
15202.zipslocal.com                         1    1    0    1
15220.zipslocal.com                         1    1    0    1
elsemiekdouma.waarbenjij.nu                 1    1    0    0
98189.zipslocal.com                         1    1    0    1
hermesgraphite.zoomshare.com                1    1    0    0
52xdt.joinbbs.net                           1    1    0    0
kzddcr.joinbbs.net                          1    1    0    0
www.sefure.oplympus.com                     1    1    0    0
baiwei.joinbbs.net                          1    1    0    0
fighting.joinbbs.net                        1    1    0    0
55127.zipslocal.com                         1    1    0    1
www.gsmdoki.com                             1    1    0    1
m.williamtravisjewelry.com                  1    0    0    0
www.sepiva.es                               1    1    0    0
15277.zipslocal.com                         1    1    0    1
traders-company.com                         1    1    0    1
15233.zipslocal.com                         1    1    0    1

@jelmervdl jelmervdl requested review from ZJaume and lpla October 10, 2023 14:58
@jelmervdl
Copy link
Member Author

So interesting realisation: I'm matching /robots.txt with a query string as well, but that doesn't make much sense. That would never be the case when it is requested for crawling purposes. It would only happen when it is directly linked somewhere.

I'm removing that for now in the hope that it will remove noise from the robots.txt-only warc.

src/warcpreprocessor.cc Outdated Show resolved Hide resolved
@ZJaume
Copy link
Member

ZJaume commented Nov 2, 2023

Tried it by myself and it is working except it fails when adding --paragraph-identification.

./warc2text/build/bin/warc2text -o output/ --tag-filters warc2text-runner/mt-filter-list.annotated --url-filters warc2text-runner/url-filter-list.annotated --paragraph-identification --robotspass robots.warc.gz WIDE-20180405202949-00696.warc.gz
[2023-11-02 15:13:05.251716] [info] Processing WIDE-20180405202949-00696.warc.gz
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
fish: “./warc2text/build/bin/warc2text…” terminated by signal SIGABRT (Abort)

Truncated output using -v:

[2023-11-02 15:12:40.164412] [trace] Processing HTML document http://perm.ekb-tuning.ru/brands/Aragon/a70101b271810b66fdc35a5d4ff5e058/371c48cea8e29e733dde6f48e63559cc
[2023-11-02 15:12:40.164499] [trace] Processing HTML document http://www.livingfoodz.com/recipes?combine=&field_cuisine_tid=All&field_type_tid=All&field_user_type_value=All&field_ingredients_field_item_tid=4442,1522
[2023-11-02 15:12:40.164569] [trace] Processing HTML document http://codestracker.com/code/LA373649996CN
[2023-11-02 15:12:40.164633] [trace] Processing HTML document http://www.pauls-praxis.de/index.html;jsessionid=7C50FB2C3B2C2A9FCBE5FD4E216704C7.tcn-6?_random=1027625027
[2023-11-02 15:12:40.166615] [trace] Processing HTML document http://www.sammler.ru/index.php?s=7448a11805cbcb31907d0e9149b46bcd&showforum=274
[2023-11-02 15:12:40.167768] [trace] Record http://www.sammler.ru/index.php?s=7448a11805cbcb31907d0e9149b46bcd&showforum=274: language not detected
[2023-11-02 15:12:40.168988] [trace] Processing HTML document http://www.weektube.com/ko_pool_longest_4/
[2023-11-02 15:12:40.175526] [trace] Processing HTML document https://www.livesex.com/sk/girls/prstovanie%20latin%20pornohviezdy/
[2023-11-02 15:12:40.176454] [trace] Processing HTML document https://onesoci.com/js/vendor/css/js/plugins/ng-flow/js/plugins/ng-flow/js/services/js/plugins/chartjs/js/plugins/dropzone/css/img/favIcon_oneSoci.ico
[2023-11-02 15:12:40.176601] [trace] Record https://onesoci.com/js/vendor/css/js/plugins/ng-flow/js/plugins/ng-flow/js/services/js/plugins/chartjs/js/plugins/dropzone/css/img/favIcon_oneSoci.ico: language not detected
fish: “./warc2text/build/bin/warc2text…” terminated by signal SIGSEGV (Address boundary error)

The warc is hplt/data/wide00016/WIDE-20180405202949-crawl813/WIDE-20180405202949-00696.warc.gz.

@jelmervdl
Copy link
Member Author

Huh, odd. Lemme check what's going on

@ZJaume
Copy link
Member

ZJaume commented Nov 2, 2023

It seems that is also failing in master, so probably not related to this.

@jelmervdl jelmervdl merged commit 5ce5c74 into bitextor:master Nov 2, 2023
@jelmervdl jelmervdl deleted the shunt-robotstxt branch November 2, 2023 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shunt robots.txt responses to separate warc
2 participants