-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate crawling sitemaps #41
Comments
What might modifications to ultimate-sitemap-parser look like?
|
continuing... fetching/parsing is done by the
|
(Start of) Summary of issues:
|
NOTES: |
Thank you for this!
I'm also reading that scheduling and de-duping are both going to involve a different approach than in the rss-fetcher Is that a fair gloss? |
Yes, I think you have the gist. My first observation was that "to a zeroth order approximation" (for what little that's worth), a sitemap page (of either kind) could be viewed as an RSS page to poll. Whether this means a single feeds table or not is an open question. feeds in the rss-fetcher are kept in 1-to-1 correspondence with feeds in the web-search (mcweb) database. A "bells and whistles" implementation might be that a human using a web UI:
And that initially, the above is done manually, for a few sources, to better until we learn about the (currently) unknown unknowns. And finally, filtering for sites that have LONG backlogs of old articles (decades) there's the question of whether we can eliminate "index" pages (feeds) for historical news we don't want to fetch, lightening our polling load, and/or whether it's possible to filter on article URL alone (requires that the date appear in the article URL). I think the rss-fetcher polling infrastructure is adaptable, tho again, whether sitemap fetching and rss fetching should be in a single process (in case a site has both, and we want to be kind across page types) or not is open... Code-wise sitemap parsing is simple/small, and the scheduling infrastructure needs are very similar (with differennt parameters like min and max interval). Google news tags are (hopefully) a good indicator of links that should be considered news (and might even not require human intervention to vet the links), BUT the if the sites follows the best practices above the pages would be more volatile (tags kept for a limited time only), and require faster polling than other sitemap pages. And finally, yes, the current "low end" dedup we currently perform is likely to behave badly for a site that has sitemap pages for historical articles that are not COMPLETELY static (ie; change header data so their page checksum changes). A "higher end" dedup database could be a MongoDB cluster co-resident on the ES servers. The minimum information would be an index (by initial URL) of trivial (size zero?) objects. A middle path (now that we have ES running) would be another set of ES indices (managed similarly to the "search" indices) keyed by initial URL (with no searchable/indexed fields). An enhanced version of either of the above would be to keep some status/timestamp information for each initial URL: instead of dropping stories in the pipeline, we could queue them to a worker that updates the status (and "fork" a copy after an article is indexed). The ability to have a pipeline worker output sent to multiple input queues is already built into the system.... Some past and recent thoughts about URL retention for dedup are in #25 |
Some questions:
|
Many sites have a single google news tag enhanced (urlset) sitemap page, which is functionally equivalent to an rss feed:
What I've done:
The most likely next step would be:
|
This issue is for discussion/documentation of crawling site maps for story URL discovery.
I'm creating it in the rss-fetcher repo, since I think:
Past work on sitemap parsing exists in https://github.com/mediacloud/ultimate-sitemap-parser (so the issue could be there) but my attempts at using it are that it:
Or the issue could be in story-indexer (since it's the big/active repo).
The text was updated successfully, but these errors were encountered: