Skip to content

Commit

Permalink
WARC/1.1 annotated: hopsFromSeed metadata field #59
Browse files Browse the repository at this point in the history
  • Loading branch information
ato committed Jun 22, 2022
1 parent bccb0e7 commit b8c4108
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions specifications/warc-format/warc-1.1-annotated/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1110,6 +1110,25 @@ optional.
- 'fetchTimeMs': time in milliseconds that it took to collect the
archived URI, starting from the initiation of network traffic.

> **Community recommendation:** #59
> The `hopsFromSeed` field comes from the [discovery path](https://heritrix.readthedocs.io/en/latest/glossary.html#discovery-path)
> concept in the Heritrix web crawler. The value is a string containing
> one character for each link or embed followed from the seed, for
> example "LLLE" might be an image on a page that's 3 links away from
> a seed. The value of `hopsFromSeed` for a seed URI should be the
> empty string.
>
> | Symbol | Meaning | Examples |
> |--------|----------------------------------------------------------|----------------------------------------------------------|
> | `L` | Link | `<a href=...>` |
> | `E` | Embedded | `<img src=...>`<br>`<script src=...>` |
> | `X` | Speculative embed | `<script>var url = 'http://example.org/foo.js';</script>` |
> | `R` | Redirect | `HTTP/1.0 302 Found`<br>`Location: ...` |
> | `P` | Prerequisite | robots.txt, DNS lookup |
> | `I` | Implicit/Implied | favicon.ico |
> | `M` | Manifest | URLs listed in sitemap files |
> | `S` | Form submission | `<form action=...>` |
A 'metadata' record may be associated with other records derived from
the same capture event using the WARC-Concurrent-To header. A 'metadata'
record may be associated to another record which it describes, using the
Expand Down

0 comments on commit b8c4108

Please sign in to comment.