-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathrvest.qmd
173 lines (129 loc) · 9.91 KB
/
rvest.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# Scrape data from web pages {#sec-rvest}
## TODO
- Strongly point to R4DS for basics.
- Reconcile chapter with new LOs.
## Introduction
*(introduction will be written \~last)*
### Learning Objectives {.unnumbered}
After you read this chapter, you will be able to:
- Decide whether to scrape data from a web page.
- Use {polite} to responsibly scrape web pages.
- Scrape complex data structures from web pages.
- Scrape content that requires interaction.
### Prerequisites {.unnumbered}
*(prerequisites will be filled in as I write, if I decide to keep this section)*
## Should I scrape this data?
When you see data online that you think you could use, stop to answer these three questions:
- Am I legally allowed to scrape this data?
- Am I allowed to scrape this data automatically?
- Do I need to scrape this data with code?
### Am I legally allowed to scrape this data?
I am an R developer, not a lawyer, and none of this should be construed as legal advice.
If you're going to use the data in a commercial product, you may want to consult a lawyer.
That said, these guidelines should get you started in most cases.
If you're using the data for your own exploration or for nonprofit educational purposes, you're almost definitely free to use the data however you like.
Copyright cases tend to involve either making money off of the work, or making it harder for the owner of the work to use it to make money.
Also check the site for legal disclaimers.
These are usually located at the bottom of the page, or possibly somewhere on the first page of the site.
Look for words or phrases like "Legal," "License," "Code of Conduct", "Usage", or "Disclaimers." Sometimes the site explicitly grants the right to use the data, which will generally supersede any general legal protection.
If you're going to publish (or otherwise share) the data, and you can't find anything on the site given you permission, you'll have to decide if your use case is allowed.
In the United States, facts are not protected by copyright.[^rvest-1]
TODO: DIFFERENT EXAMPLE, EVERYONE USES RECIPES! That's why cook books and online recipes tend to devote as much space (or more) to stories than to the recipes themselves --- the recipes are facts and thus don't have copyright protection in the U.S.
However, a *collection* of facts (such as the data you're trying to scrape) *can* be protected in the U.S. if that collection was selected by a person.
[^rvest-1]: See the [U.S. Copyright Office Fair Use Index](https://www.copyright.gov/fair-use/) for a detailed discussion of the legal definition of fair use in the United States, and an index of related court decisions.
> "These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws"[^rvest-2]
[^rvest-2]: [Feist Publications, Inc. v. Rural Telephone Service Co., 499 U. S. 340 (1991)](https://www.law.cornell.edu/supremecourt/text/499/340)
Outside the U.S., protections may be stronger or weaker.
For example, the European Union established specific legal protections for databases in a [directive on the legal protection of databases](https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31996L0009:EN:HTML).
If you're going to publish the data, investigate legal requirements in your location.
### Am I allowed to scrape this data automatically?
Even if it is *legal* for you to use the data, it might not be *polite* to do so.
For example, the site might have preferences about which pages can be accessed by code, or specific protections or guidelines about certain pages.
Most websites list these preference in a `robots.txt` file at the root of the site.
For example, the `robots.txt` file for the online version of this book is available at <https://jonthegeek.github.io/wapir/robots.txt>.
This file might contain one or a few lines.
```{js}
#| eval: false
#| filename: "https://wapir.io/robots.txt"
# TODO: Replace this with the final version for this book.
Sitemap: https://wapir.io/sitemap.xml
```
*Things to put into this section:*
- robots.txt (search for `"User-agent: *"` and the particular page you want to scrape)
- Licenses & legal usage. IANAL.
- Is it worth scraping? Will {datapasta} cover it?
- Quickly identifying whether it will be hard.
- Look for an API to use instead of scraping. Just introduce the idea then point to that chapter.
- Even when robots.txt allows scraping, give tips on how to be a good citizen.
- {[polite](https://dmi3kno.github.io/polite/index.html)} package? I think *probably* focus more on learning to read robots.txt and how to implement things in rvest (and later httr2), but I should at least mention {polite}.
## How can I scrape a table of data from a web page?
- Basic intro to {rvest}.
- Scrape a static table in this repo where html_table works out of the box. It looks like it'll get a single table without any fuss.
- BRIEFLY touch on data cleaning, but send them to R4DS for more details.
- Show an example with multiple tables & how to deal with that (pre-SelectorGadget).
## How can I scrape more complex data from a web page?
- Using SelectorGadget.
- Start by reproducing the table scrape, without SelectorGadget (basically just do what )
- A second table where html_table doesn't work as well (weird headers or something). Maybe; the next example might be enough.
- A web page with structured content that isn't in a table. Like the Star Wars example in rvest, but someting else. Maybe information about HTTP request methods? Oh, or Xpath rules!
- CSS selectors
- Start by digging into what SelectorGadget produced for the examples above/what it means.
- Add an example where it's way more straightforward to do things sequentially (structured data in a particular cell of a particular table, something like that).
- CSS selectors vs Xpath. Hmm. Which is more straightforward? Probably teach Xpath because it lets you do all the things. Nope, CSS selectors can also do all the things, and they overlap with actual CSS which is useful in its own right! Consider teaching both, at least briefly, and point to MDN or w3schools for more. FWIW, rvest translates everything to xpath via `selectr::css_to_xpath()`.
- If Xpath: Note the difference between `".//"` (search below the current node) and `"//"` (search anywhere in the document). You never want just `"//"` in a pipe, because it ignores previous steps (well, except MAYBE if you're using a selection to go back and find something else)! Probably put this in a call out.
- Maybe note that a lot of {rvest} is reexports from {xml2}?
- Also note the `flatten` argument for `xml2::xml_find_all()`! By default it de-duplicates, so watch out if you're trying to align lists. Make sure this behaves how I think it does, and, if so, provide an example where it matters. This appears to be the only place where I need to bring up {xml2} directly, but probably point it out for further reading.
- `html_attrs()` (list all of the attributes) vs `html_attr()` (get a specific attribute). Similar to `attributes()` vs `attr()`.
- A con of CSS selectors: Can't go back up the tree! I feel like this is probably important in some contexts. If there's a legitimate case where this is true, use that as the reason to focus on Xpath.
## Ideas to cover {.unnumbered}
- Advanced {rvest} techniques.
- Can I \~easily deploy something that requires a session?
- Is the session stuff in {rvest} relatively stable? How might the {chromote} PR impact it?
- Are there options other than {rvest} that I should explore? {RSelenium}, {chromote} worth mentions, or too different?
- Bridge into {httr2}. Don't write this yet, bridges should be written last!
- Larger-than-memory scraping, splitting scrapes, etc.
- Automation.
- Refining selections toward "what" vs "how" & "where", trying to make your selections less fragile.
- Deduplication.
- Error handling.
- Logging? Likely mention and point to do4ds for more details (or just a specific logging package).
- Mention packages for notification that a job is finished (from {beepr} to an email or Slack message... but maybe use those to tease later chapters).
- {targets}?
- At least mention.
- Dig into this to see how quickly I could introduce without it growing into a separate *thing.*
## Another Exploration {.unnumbered}
Suggested by Emir Bilim on DSLC, let them know when it's worked out!
Also include a case like this in the Appendix!
```{r yahoo-finance, eval = FALSE}
library(rvest)
url <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=1"
## Annual only for now.
# Cache the read during experimentation so you don't hit it over and over.
raw_html <- rvest::read_html(url)
annual_cells <-
raw_html |>
rvest::html_element(
xpath = ".//div[@class='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)']"
) |>
rvest::html_text2() |>
# The text comes in as a single block separated by newlines (\n).
stringr::str_split_1(stringr::fixed("\n"))
# The first text to come in is the column headers. We want to go through the
# last one that's a date.
annual_date_cell_numbers <- stringr::str_which(
annual_cells, "^\\d{1,2}/\\d{1,2}/\\d{4}$"
)
last_date_cell_number <- max(annual_date_cell_numbers)
annual <-
annual_cells |>
# We're going to transpose the table as we go; rows will become columns,
# columns will become rows. It's tidier that way. The last date is our last
# column.
matrix(nrow = last_date_cell_number, byrow = FALSE) |>
as.data.frame() |>
janitor::row_to_names(row_number = 1) |>
tibble::as_tibble()
# That gets the a good start on the "Annual" data; it just needs normal cleaning
# from there. It does NOT include the breakdowns yet, though. I believe both
# that and Quarterly will require some session fanciness. To be continued!
```