Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an index, style guide, handle remaining bugs #238

Merged
merged 65 commits into from
Sep 27, 2021
Merged

Conversation

trevorcampbell
Copy link
Contributor

@trevorcampbell trevorcampbell commented Sep 22, 2021

This PR is not done yet.

I've finished everything listed below except the index (waiting for wrangling PR).

To address:

Index

I made one! This required some fixes to enable compile to PDF:

Fixes to make the book compile to PDF

  • need to add figure units to out.width
    • done: added pt
  • the gentoo penguin image in clustering chapter breaks things with percent signs in its name
    • done: i stored a local copy of it instead
  • clustering two-column "center update / label update" spaces use unicode (non-latex) characters
    • done: no-op. xelatex handles unicode characters.
  • img/population_vs_sample.svg -- svgs break pdf output
    • done: i rendered this to png instead
  • unicode characters in the jupyter section
    • done: no-op. xelatex handles unicode characters.
  • duplicate entry in refs.bib (wilson2014best)
    • done: removed old entry
  • Lots of fig.retina = ... in inference.Rmd without actually setting out.width (causes LaTeX to generate width=468px which breaks xelatex because the unit should be pt
    • done: either set width explicitly or removed setting retina
  • Fixed @online entries in references.bib to be @misc

This work closes #120

Style Guide

I created one. It's in README.md. I just made executive decisions on everything in the interest of moving this along, but feel free to raise anything I did there as a discussion item. Also feel free to move the style guide into its own .md file and link from README.md if desired.

The work on the style guide closes #240 -- but note that we still need to do the actual formatting pass to fix it to conform to the style.

Clustering

Fixed the two examples of "failures" in clustering (bad init, bump in elbow plot).
This closes #236

Videos in chapter 2

No-op. These videos were removed (I assume it was the jupyter one). The only videos left are in the additional resources at the end, and those aren't about previewing data in jupyter. The jupyter chapter no longer has any videos.

I've added to the style guide a mention of making sure video links are usable in both pdf and html.
This work closes #189

Database in Chapter 2

Made it clearer that the DB host/username/password info that we use in ch2 is fake. This closes #187

Feedback on multicollinearity

The first paragraph uses terms we have not used in the book before (outliers and collinear predictors) without defining them. Could we roughly hint at what they are here, or say that we will explain them, and their impact below?

agree, done

In the outlier section, should we discuss the simple solution of dropping outliers? Why this can work sometimes, and why sometimes this should be avoided and instead other advanced methods (like median or quantile regression, should be used?)

Nope - I don't think we should go into it in this text. I don't think it's possible to convey it in such a way that it won't cause students to go "oh well I don't like this datapoint, it looks wrong to me" and then just arbitrarily throw out data.

What do unstable coefficients mean, we should explain this term.

I agree, unstable is a bit too technical -- replaced with "unreliable" + explanation instead!

We say "Therefore, when performing multivariate linear regression, it is important to avoid including very linearly related predictors." but this begs the question how do we avoid this? Could we suggest or show some simple ways of looking at this? And then what to do if they see this? For example, in the case of the two measurements, you could drop one, or you could create a new feature from these two where you average them?

Definitely out of scope. I just point the reader to the end of the chapter.

At the end of the multicollinearity section, can we point to resources where the can read more about this in case they are interested or have to deal with this in a more complex case than we present?

See above.

At the end of the outlier section, can we point to resources where the can read more about this in case they are interested or have to deal with this?

agreed, but not worth the effort before CRC subm

This work closes #195

Feedback on predictor design

We write "For example, a dataframe df with two variables—x and y—with a nonlinear relationship between the two variables will not be correctly captured by simple linear regression, as shown in Figure 9.11. " but is "correct" the right word? Maybe "fully" might be better?

agree, done

Can we make it clear that we drop X, when we add Z to the model?

I don't think we need to -- the plot axes are clear that we're plotting z now

Should we use captitals to refer to X, Y and Z here?

I don't see any reason to - lowercase seems fine

what is df, we never define it...

it was defined in the text, but I added a quick printout of df at the beginning

At the end of this section, can we point to resources where the can read more about this in case they are interested or have to deal with this in a more complex case than we present?

I agree, but not worth the effort prior to CRC subm.

This work closes #196

set.seed

  • seed/randomness was previously first introduced at the end of classification 1 for a single off-the-cuff usage in picking 3 observations to create imbalanced data; I thought it was a bit hidden and not clearly described
  • I removed that usage of randomess
  • now the first introduction of randomness is in classification2, where we can justify it clearly saying "OK now we need randomness since we're splitting data"
  • I have now devoted a whole section to randomness and seeds
  • it's now clear that you only set.seed once at the beginning of your analysis
  • I've hidden everywhere in the textbook where we "seed hack" to get results we want (for pedagogical reasons) and just put a seed at the start of the chapter when we load tidyverse, reminding the reader why we do this

This work closes #168

Miscellaneous

  • fixed y axis labels in inference chapter
  • fixed typos throughout
  • cleaned the appendix a bit
  • fixed weird learning objective spacing in chapter 2
  • fixed super long lines in ch2; 80ch line limit
  • moved faithful_plot.* images to img/ folder
  • fixed bad sample in the first example (small_sacramento) in regression1 (there was a point lying basically right on top of the dashed line)

@trevorcampbell trevorcampbell changed the title Makeindex Create an index Sep 22, 2021
@trevorcampbell trevorcampbell changed the title Create an index Create an index, formatting/style guide Sep 22, 2021
@trevorcampbell trevorcampbell mentioned this pull request Sep 23, 2021
@trevorcampbell trevorcampbell removed a link to an issue Sep 24, 2021
@trevorcampbell trevorcampbell linked an issue Sep 24, 2021 that may be closed by this pull request
Copy link
Contributor

@leem44 leem44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevorcampbell Thank you for putting all of this together!! Looks great!

  • The page numbers in the index seem to be off by 2. E.g. "data science": in index says it's on page 5 but really its on page 7
  • I guess we'll need to do another pass and look at all the plots and code that fall off page, and also the plots that are in weird places (e.g. in the middle of text etc)
  • nice addition of the set.seed explanation!
  • a bunch of the figure references display "Figure ??" -- I started pointing them out and then realized that they are in a lot of places so likely this will need to be a part of the formatting pass

intro.Rmd Outdated Show resolved Hide resolved
intro.Rmd Outdated Show resolved Hide resolved
intro.Rmd Outdated Show resolved Hide resolved
intro.Rmd Outdated Show resolved Hide resolved
intro.Rmd Show resolved Hide resolved
inference.Rmd Outdated Show resolved Hide resolved
inference.Rmd Outdated Show resolved Hide resolved
inference.Rmd Show resolved Hide resolved
inference.Rmd Show resolved Hide resolved
version-control.Rmd Outdated Show resolved Hide resolved
@trevorcampbell trevorcampbell merged commit 4eff8b4 into dev Sep 27, 2021
@trevorcampbell trevorcampbell deleted the makeindex branch December 1, 2021 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment