Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choice randomization: better approximation of JR behaviour, fixes #49 #241

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

brontolosone
Copy link
Contributor

Closes #49

I have verified this PR works in these browsers (latest versions):

  • Chromium
  • Firefox

Some related problems remain to be solved:

This brings what Webforms does more in line (barring #240) with what Javarosa does, and as such fixes the immediate problem of #49.

I felt it was worth it to be verbose with the comments here, so check those out.

This story is not over yet. Depending on whether we deem it OK to change the seed derivation algo, I'd like to make it value type/length agnostic and would just hash the input in its textual form and derive a seed from that hash - see getodk/javarosa#800. And in that case this code will need to be altered again.

Copy link

changeset-bot bot commented Oct 14, 2024

🦋 Changeset detected

Latest commit: 6628199

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@getodk/xpath Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@brontolosone brontolosone marked this pull request as draft October 14, 2024 17:40
@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch 2 times, most recently from 69be0c6 to 6432f97 Compare October 15, 2024 09:26
@brontolosone brontolosone marked this pull request as ready for review October 15, 2024 09:53
Copy link
Member

@eyelidlessness eyelidlessness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I really appreciate how the commentary here tells the why story!

As discussed a bit in Slack, I think a couple of adjustments would make the change clearer. And would either make the need for some of the commentary moot, or make the remaining commentary more useful.

  1. There's a clear "JavaRosa compatibility" responsibility here. While it is inherently coupled to the seededRandomize implementation, it is also very specifically a mapping to a more general concept: longValue. I think it would help immensely for future understanding of what's going on here if we make that an explicit function, with the same name.

  2. In general, I've found liberal use of JSDoc comments (i.e. /** ... */) really helpful. The comment style provides support for all sorts of editor functionality.

    Here, we'd get a lot of benefit from inline linking support (i.e. {@link $URL_OR_REFERENCE} and/or {@link $URL_OR_REFERENCER | more specific title}). In particular I think a permalink to the JavaRosa resolveRandomSeed method would be useful, and probably also links to the pertinent issues. This ties directly to point 1: giving the JavaRosa-compatible thing a name corresponding to the Java thing it emulates, also gives a clear place for that JSDoc to reference it and clarify the nuances it's addressing.

    I think a few tweaks to this comment would be pretty much perfect.

  3. We can eliminate the divergences from JavaRosa by using BigInt values for several of these cases. This, combined with their usage in a clearly articulated longValue JR-equivalent would also eliminate the need for commentary on those cases. In my local exploration of this, what I found made the most sense with the least fuss was to change type Int = number to type Int = bigint | number, then have the longValue equivalent also produce that Int type. Pretty much everything else that would need to change falls out of that (i.e. any mixed-type operators producing fractional values do the appropriate explicit Number() casts to preserve those mathematical semantics).

    It'd also be useful for the Infinity/-Infinity cases to be bound to constants with clear names. Insofar as there's still benefit to commentary on those, JSDoc on those constants is a good place.

Aside from making some of the intent clearer here, I suspect we may find there are other edge cases where we want to cordon off JavaRosa-compat/Java-isms in a general and reusable way. Even if that seems like a premature abstraction, doing it in this case is a direct, 1:1 linkable reference to the existing abstraction we'll be emulating.

Edit: oh, and this definitely feels like it deserves a changeset.

@brontolosone
Copy link
Contributor Author

This needs updates for:

Drafting!

@brontolosone brontolosone marked this pull request as draft October 24, 2024 10:23
@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch from 706f380 to 159f06c Compare December 10, 2024 14:50
@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch 2 times, most recently from c435390 to b7cce27 Compare December 16, 2024 20:10
@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch 2 times, most recently from 5fa9d75 to 9bb403c Compare December 16, 2024 21:02
@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch from 9bb403c to c83fef2 Compare December 16, 2024 21:08
@brontolosone
Copy link
Contributor Author

This needs updates for:

* incorporating the fallback for non-numeric-looking seed nodes to hash-based seed derivation of [hash un-numeric input when used as PRNG seed, fixes #800 javarosa#801](https://github.com/getodk/javarosa/pull/801)

* incorporating the followup thereof; the behaviour for zero-length strings of [Extend randomize seed tests javarosa#805](https://github.com/getodk/javarosa/pull/805)

Done!

@brontolosone brontolosone marked this pull request as ready for review December 17, 2024 10:20
Copy link
Member

@eyelidlessness eyelidlessness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this is really close. Most of my remaining feedback is around code clarity (naming, separation of responsibilities, accessibility and applicability of comments).

packages/xpath/src/functions/xforms/node-set.ts Outdated Show resolved Hide resolved
Comment on lines 410 to 417
function toBigIntHash(text: string): bigint {
// hash text with sha256, and interpret the first 64 bits of output
// (the first and second int32s ("words") of CryptoJS digest output)
// as a BigInt. Thus the entropy of the hash is reduced to 64 bits, which
// for some applications is sufficient.
// The underlying representations are big-endian regardless of the endianness
// of the machine this runs on, as is the equivalent JavaRosa implementation
// at https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function toBigIntHash(text: string): bigint {
// hash text with sha256, and interpret the first 64 bits of output
// (the first and second int32s ("words") of CryptoJS digest output)
// as a BigInt. Thus the entropy of the hash is reduced to 64 bits, which
// for some applications is sufficient.
// The underlying representations are big-endian regardless of the endianness
// of the machine this runs on, as is the equivalent JavaRosa implementation
// at https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726
/**
* Hash text with sha256, and interpret the first 64 bits of output (the first
* and second int32s ("words") of CryptoJS digest output) as a BigInt. Thus the
* entropy of the hash is reduced to 64 bits, which for some applications is
* sufficient. The underlying representations are big-endian regardless of the
* endianness of the machine this runs on, as is the
* {@link https://github.com/getodk/javarosa/blob/ab0e8f4da6ad8180ac7ede5bc939f3f261c16edf/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java#L718-L726 | equivalent JavaRosa implementation}.
*/
const toBigIntHash = (text: string): bigint => {

As a JSDoc comment, this allows the same documentation to be accessed at the call site.

Switching to an arrow function is somewhat a nit, but it's generally preferable to avoid unnecessary function functions as they have confusing behavior. (Maybe that's also a thing we could lint?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ea5c499 removes the function keyword.

As for multiline comments: I don't like them. My editor is not supremely ergonomic with it, especially with the decorative * in front of each line. Which, anyway, diminish the advantages of multiline comments — now one has to prefix each line with * instead of //, PLUS still manage the actual comment start and end markers - how is that a win over just plain simple // line comments, I wonder?
Github is also not super smart with them, look at the "keyword" syntax highlighting it applied to the diff just above! So I don't like to use that comment style myself but if someone else does, they're welcome to ;-)

As for JSDoc links, I don't like them. They move the description of the link to after the link (cf. Markdown). So then to read what the link is doing there, what it's for, I first need to scan to the end of a long URL. The hypothetical usability gain is that if you have an IDE that is smart with specifically JSDoc comments, you can click the link? Copy-pasting isn't so bad and anyway most things — my editor, my terminal — already make http(s)-URLs clickable (or ctrl-clickable). Not worth the disruption of the natural text reading flow to me, but I won't complain if someone else makes {@link https://asdsadsoewofihewofbcewnco.ewewrewrewrewconlnzc.cwefewr.few.cpoqwjeansls | these kind of links}, I just won't emit them myself ;-)

Less is more!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to make the case for JSDoc. If you're open to reconsidering here, that would be excellent. If not, I think we should come back to this discussion as a team.


JSDoc comments are a standard designed to encode structured documentation about any symbol they're attached to.

The editor support alone goes well beyond linking to URLs. For example, the ability to reference documentation across modules are invaluable. Linking to other symbols (both within and across modules) is also invaluable, both as a navigation tool and because they can be kept up to date as those symbols change.

Beyond editor support, being a standard for structured documentation and association with symbols, JSDoc can be used for documentation output. We are already using this to generate documentation for @getodk/xforms-engine. I'd quite like to expand that to other use cases.

I share your distaste for some of the syntax minutiae of JSDoc, and @link is particularly weird (I suspect this is because inline tags are relatively rare). But that distaste doesn't outweigh the overwhelming benefits of an extensible documentation standard which is widely adopted in tooling we already use. It is also widely adopted throughout this project, and across the ecosystem; which is to say, it is both locally and globally idiomatic.

I also noticed GitHub's odd presentation in a couple diff suggestions in this PR. It's worth noting that:

  • That's not representative of how GitHub presents JSDoc in complete source
  • It's not representative of how GitHub presents JSDoc in diffs broadly
  • GitHub's syntax highlighting is notoriously inconsistent across various views
  • The highlighting is applied to an incomplete (i.e. syntactically invalid) chunk of code, which likely exacerbates potential issues

Lastly, I am sensitive to poor authoring ergonomics. I'm a bit surprised to hear that your editor doesn't make adding/editing JSDoc comments easier than single line comments, as that's my experience in the editors I'm familiar with. If this is a major hangup, I'd be happy to help look into ways to make the authoring experience nicer for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the change hoping that the benefits will become apparent at some point 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

packages/xpath/src/lib/collections/sort.ts Outdated Show resolved Hide resolved
packages/xpath/src/lib/collections/sort.ts Outdated Show resolved Hide resolved
packages/xpath/test/xforms/randomize.test.ts Outdated Show resolved Hide resolved
return seededRandomize(nodes, seed);
if (seedExpression === undefined) return seededRandomize(nodes);
const seed = seedExpression.evaluate(context);
const asNumber = seed.toNumber(); // TODO: There are some peculiarities to address: https://github.com/getodk/web-forms/issues/240
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this comment belongs here. It isn't specific to this cast, it's specific to casting to XPath number throughout. Fine to leave since we have an issue tracking it, but we'll probably just find it went stale some time after we address the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended for someone reading the randomization code when trying to figure out why WF and JR still produce different sort orders. If it goes stale (when the issue is resolved) then following the link to the issue will make that apparent. I don't see a big problem.

Comment on lines +392 to +405
let finalSeed: number | bigint | undefined;
if (Number.isNaN(asNumber)) {
// Specific behaviors for when a seed value is not interpretable as numeric.
// We still want to derive a seed in those cases, see https://github.com/getodk/javarosa/issues/800
const seedString = seed.toString();
if (seedString === '') {
finalSeed = 0; // special case: JR behaviour
} else {
// any other string, we'll convert to a number via a digest function
finalSeed = toBigIntHash(seedString);
}
} else {
finalSeed = asNumber;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't "special case: JR behavior" apply to all of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Some of the behaviour is in the odk spec. The "zero-length-string becomes 0" behaviour was surprising though.

@brontolosone brontolosone force-pushed the 49_randomization_seed_inputs branch from b36d0a1 to cb6a021 Compare December 18, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seed to randomize should not require an integer
2 participants