From 7e61f515d9e345e2664492c89968854f877a4916 Mon Sep 17 00:00:00 2001 From: Wendell Piez Date: Wed, 11 Sep 2024 17:21:18 -0400 Subject: [PATCH] Now with complete pipeline producing OSCAL from PDF via HTML and NISO STS intermediates --- projects/oscal-from-pdf/GRAB-FM6-22.xpl | 20 ++ projects/oscal-from-pdf/GRAB-RESOURCES.xpl | 37 +++ .../extract-FM6_22-chapter4.xpl | 253 +++++++++++++----- projects/oscal-from-pdf/readme.md | 40 +-- .../oscal-from-pdf/src/fm22-6-html-to-sts.xsl | 2 +- .../src/fm22-6_chapter4-sts.sch | 48 ++++ .../oscal-from-pdf/src/fm22-6_chapter4.sch | 13 - .../src/fm22-6_sts-enhance1.xsl | 183 +++++++++++++ .../src/fm22-6_sts-enhance2.xsl | 143 ++++++++++ .../src/fm22-6_sts-enhance3.xsl | 39 +++ .../src/fm22-6_sts-to-oscal.xsl | 252 +++++++++++++++++ projects/oscal-from-pdf/src/oscal-check.sch | 18 ++ .../oscal-from-pdf/src/xvrl-summarize.xpl | 50 ++++ 13 files changed, 992 insertions(+), 106 deletions(-) create mode 100644 projects/oscal-from-pdf/GRAB-FM6-22.xpl create mode 100644 projects/oscal-from-pdf/GRAB-RESOURCES.xpl create mode 100644 projects/oscal-from-pdf/src/fm22-6_chapter4-sts.sch delete mode 100644 projects/oscal-from-pdf/src/fm22-6_chapter4.sch create mode 100644 projects/oscal-from-pdf/src/fm22-6_sts-enhance1.xsl create mode 100644 projects/oscal-from-pdf/src/fm22-6_sts-enhance2.xsl create mode 100644 projects/oscal-from-pdf/src/fm22-6_sts-enhance3.xsl create mode 100644 projects/oscal-from-pdf/src/fm22-6_sts-to-oscal.xsl create mode 100644 projects/oscal-from-pdf/src/oscal-check.sch create mode 100644 projects/oscal-from-pdf/src/xvrl-summarize.xpl diff --git a/projects/oscal-from-pdf/GRAB-FM6-22.xpl b/projects/oscal-from-pdf/GRAB-FM6-22.xpl new file mode 100644 index 00000000..72ffda49 --- /dev/null +++ b/projects/oscal-from-pdf/GRAB-FM6-22.xpl @@ -0,0 +1,20 @@ + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/projects/oscal-from-pdf/GRAB-RESOURCES.xpl b/projects/oscal-from-pdf/GRAB-RESOURCES.xpl new file mode 100644 index 00000000..bb76b61d --- /dev/null +++ b/projects/oscal-from-pdf/GRAB-RESOURCES.xpl @@ -0,0 +1,37 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/projects/oscal-from-pdf/extract-FM6_22-chapter4.xpl b/projects/oscal-from-pdf/extract-FM6_22-chapter4.xpl index 99655a55..eb72ad7d 100644 --- a/projects/oscal-from-pdf/extract-FM6_22-chapter4.xpl +++ b/projects/oscal-from-pdf/extract-FM6_22-chapter4.xpl @@ -1,36 +1,94 @@ - + name="MINIMAL" xmlns="http://www.w3.org/1999/xhtml"> + + - + - + - + + + - + - + + + + + + + + + + + + - + + + + + + + + + + + + + + - + @@ -43,20 +101,11 @@ - - - - - - - - - + + + @@ -67,7 +116,7 @@ - + @@ -75,55 +124,126 @@ - - - + + - + - + - - - + + + - + - - + + - + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Schema { $sts-rng } not found - try running pipeline + GRAB-NISO_STS-RNG.xpl + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Schema { $oscal-xsd } not found - try running pipeline + GRAB-RESOURCES.xpl + + + + + + + + + ---> - - +--> + - + \ No newline at end of file diff --git a/projects/oscal-from-pdf/readme.md b/projects/oscal-from-pdf/readme.md index 6f92b6ec..fb242b0f 100644 --- a/projects/oscal-from-pdf/readme.md +++ b/projects/oscal-from-pdf/readme.md @@ -69,44 +69,46 @@ A sequence of transformations reproduces the data set with NISO STS (Standards T - Repairing broken callout boxes - Consolidating partial/broken tables into tables - Grouping bulleted items into lists + - Making spot corrections -Etc. Details can be traced in the [XProc pipeline file extract-FM6_22-chapter4.xpl](extract-FM6_22-chapter4.xpl). +The last category includes a number of tactical interventions to correct or enhance encoding where the text has been wrongly or inadequately tagged at earlier stages - note that these changes all serve to *reveal* not *hide* the original, and in every case the XSLT documents unambiguously any changes being made. -The result is Chapter 4 *only* in a correct and harmonious STS encoding. +Details can be traced in the [XProc pipeline file extract-FM6_22-chapter4.xpl](extract-FM6_22-chapter4.xpl). -As currently configured (still in development) this pipeline writes results - intermediate as well as final - to the [temp](temp) directory. +The result is Chapter 4 *only* in a correct and harmonious STS encoding. -In particular, the file [temp/t05_sts-corrected.xml](temp/t05_sts-corrected.xml) in that directory (produced by the pipeline) is valid NIST STS XML. +As currently configured (still in development) this pipeline writes results - intermediate as well as final - to the [temp](temp) directory. Using a 'diff' tool on any consecutive pair of files here shows the changes made by the transformation that produces the later from the earlier. View this file using NISO STS Tools such as the [NISO STS Viewer](https://pages.nist.gov/xslt-blender/sts-viewer/). NISO STS makes a good intermediate model for this enhancement, as it - -- Presents a comprehensive, retrospective encoding of the document as received, stabilizing a semantic representation, without entanglement in the next task, namely mapping this (or any) semantically adequate representation into OSCAL +- Presents a comprehensive, retrospective encoding of the document as received, stabilizing a semantic representation, without entanglement in the related but different problem of mapping this (or any) semantically adequate representation into OSCAL - Can be inspected, tested and validated on the way through, including with bespoke validation, STS display tools and other methods -- Produces a useful spin-off artifact (the STS instance itself) +- Produces a useful spin-off artifact: the STS instance itself With a clean STS representation of the document in hand, casting into OSCAL will be straightforward. The main advantages of this stepwise process are in transparencey and traceability, duplicability, and debuggability. +#### NISO STS quality check -#### NISO STS to OSCAL +If it is adequate in its semantic description, we should be able to validate the h*ck out of the data in its STS form. (This is circular reasoning: if it is invalid, we call it inadequate.) -TODO: this is where we pause ... +The constraint set we use to assert this validation is in three tiers: -OSCAL will be free-form text (with links) followed by a control sequence +1. NISO STS validation - the pipeline [GRAB-NISO_STS-RNG.xpl](GRAB-NISO_STS-RNG.xpl) acquires a copy of an RNG schema for this purpose +1. Schematron validation - house rules (NIST RLM) - see [src/sts-check.sch](src/sts-check.sch) +1. Schematron validation - bespoke rules asserting regularities for this instance - see [src/fm22-6_chapter4.sch](src/fm22-6_chapter4.sch) + Note: this Schematron is used to report errors and anomalies in sources or intermediate STS files that are then remediated in subsequent phases -The XProc to perform this conversion is *TBD* +#### NISO STS to OSCAL + +- Tables 4-6 and on (the 'competency tables') become controls with parts +- The narrative sequence is dropped into its own free-flowing OSCAL (nested parts) -keep section sequence replacing tables with links to control structures - competency/attribute/component - -drill all the way to part[@class='item'] with line items - create control structures for tables 4-6 - 4-80 - -Retain tables 1-5 as plain-old tables - break numbered sections out into parts - separate out a formal part for capabilities / indicators +TODO: Schematron and check internal cross-referencing + +OSCAL will be free-form text (with links) followed by a control sequence ### Initial planning and survey diff --git a/projects/oscal-from-pdf/src/fm22-6-html-to-sts.xsl b/projects/oscal-from-pdf/src/fm22-6-html-to-sts.xsl index de8cc2f8..04144b5f 100644 --- a/projects/oscal-from-pdf/src/fm22-6-html-to-sts.xsl +++ b/projects/oscal-from-pdf/src/fm22-6-html-to-sts.xsl @@ -84,7 +84,7 @@

- bulleted + bullet

diff --git a/projects/oscal-from-pdf/src/fm22-6_chapter4-sts.sch b/projects/oscal-from-pdf/src/fm22-6_chapter4-sts.sch new file mode 100644 index 00000000..9db85232 --- /dev/null +++ b/projects/oscal-from-pdf/src/fm22-6_chapter4-sts.sch @@ -0,0 +1,48 @@ + + + + + + + + Punctuation check. + + + + + Table label '' is not distinctive + + + + table out of order (Strength Indicators label missing) + table out of order ('Need Indicators' label missing) + table out of order + table out of order ('Underlying Causes' label missing) + table out of order + table out of order ('Feedback' label missing) + table out of order ('Study' label missing) + table out of order ('Practice' label missing) + + + + + + + + We expect p element to start with a target. + + + + + \ No newline at end of file diff --git a/projects/oscal-from-pdf/src/fm22-6_chapter4.sch b/projects/oscal-from-pdf/src/fm22-6_chapter4.sch deleted file mode 100644 index 73820cd9..00000000 --- a/projects/oscal-from-pdf/src/fm22-6_chapter4.sch +++ /dev/null @@ -1,13 +0,0 @@ - - - - - - - We expect p element to start with a target. - - - - - \ No newline at end of file diff --git a/projects/oscal-from-pdf/src/fm22-6_sts-enhance1.xsl b/projects/oscal-from-pdf/src/fm22-6_sts-enhance1.xsl new file mode 100644 index 00000000..b8e9739d --- /dev/null +++ b/projects/oscal-from-pdf/src/fm22-6_sts-enhance1.xsl @@ -0,0 +1,183 @@ + + + + + + + + + + + href="../lib/NISO-STS-interchange-1-MathML3-RNG/NISO-STS-interchange-1-mathml3.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0" + href="../src/fm22-6_chapter4.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron" + + + + + + + + Determining Developmental Activities + +

Answer these to select appropriate developmental activities:

+ + +

Developmental Activity: How do I need to improve?

+
+ +

Desired Outcome: What do I hope to achieve?

+
+ +

Method: How am I going to do this? What resources do I need?

+
+ +

Time available: When will I do this? How will I monitor progress (such as identifying and monitoring milestones, rewarding success, or identifying accountability partners)?

+
+ +

Limits: What factors will affect or hinder successfully implementing this activity?

+
+ +

Controls: What minimizes or controls the factors that hinder implementing this activity?

+
+
+
+
+ + + + + + <xsl:apply-templates/> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+ + + Unexpected unnaccounted-for element(s) being grouped on key { current-grouping-key() } + +
+ +
+
+
+ + + + + + ^\s*4\-\d\d?\d?\. + + + + + + + + + + + + + + + + + + +

Need Indicators

+
+ + + + + + + + + + + + + + + + + + + + + + + +
\ No newline at end of file diff --git a/projects/oscal-from-pdf/src/fm22-6_sts-enhance2.xsl b/projects/oscal-from-pdf/src/fm22-6_sts-enhance2.xsl new file mode 100644 index 00000000..18f60241 --- /dev/null +++ b/projects/oscal-from-pdf/src/fm22-6_sts-enhance2.xsl @@ -0,0 +1,143 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + +

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + { if (ancestor::sec/title/lower-case(.)!tokenize(.,'\s+') = $competencies) then 'competency' else 'attribute' } + + + + + + + + + + + + + + + + + + + + Legend + + + + +

{ following-sibling::text()[1] ! normalize-space(.) }

+
+
+
+
+
+ + + + + + + + + + + + +
\ No newline at end of file diff --git a/projects/oscal-from-pdf/src/fm22-6_sts-enhance3.xsl b/projects/oscal-from-pdf/src/fm22-6_sts-enhance3.xsl new file mode 100644 index 00000000..65647c8a --- /dev/null +++ b/projects/oscal-from-pdf/src/fm22-6_sts-enhance3.xsl @@ -0,0 +1,39 @@ + + + + + + + + + + + tables + 4-4 + and + 4-5 + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/projects/oscal-from-pdf/src/fm22-6_sts-to-oscal.xsl b/projects/oscal-from-pdf/src/fm22-6_sts-to-oscal.xsl new file mode 100644 index 00000000..c15114e9 --- /dev/null +++ b/projects/oscal-from-pdf/src/fm22-6_sts-to-oscal.xsl @@ -0,0 +1,252 @@ + + + + + + + + + + + + + + Electronic Transcription - US Army Field Manual 6-22: Developing Leaders (November 2022) - Chapter 4: Learning and Developmental Activities + 2024-09-10T15:19:38.6995852-04:00 + 0.8 draft + 1.1.2 + +

This encoded representation of Field Manual 6-22 was produced from published source without explicit authorization from or any coordination with the document's originators or the US Army. Reliance on this representation without reference to the publication from which it is derived is not advised.

+

See the repository readme for further documentation.

+
+
+ + Learning and Developmental Activities + + + + Attributes + + + + Competencies + + +
+
+ + + + + + + + + + + + + + + + + + + <xsl:apply-templates/> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ +

+
+
+ + + { substring-after(.,'Tip: ') } + + + + + + + + + + + + + + + + + + + + + + + + <xsl:apply-templates/> + + + + + + + + + + + + + + + + + + + + + + + + { caption/title/normalize-space() } + + + + + + Strength Indicators + + + Need Indicators + + + Underlying Causes + + + + + Feedback + + + Study + + + Practice + + + + + + + Untitled + + { $title } + + + + + +
    + +
+
+ + +
  • + +
  • +
    + + +
    + +
    + + +

    + +

    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + Not matching { name() } + + + + + +
    \ No newline at end of file diff --git a/projects/oscal-from-pdf/src/oscal-check.sch b/projects/oscal-from-pdf/src/oscal-check.sch new file mode 100644 index 00000000..332509c8 --- /dev/null +++ b/projects/oscal-from-pdf/src/oscal-check.sch @@ -0,0 +1,18 @@ + + + + + + + + + + + Internal link href='' has no target + + + + \ No newline at end of file diff --git a/projects/oscal-from-pdf/src/xvrl-summarize.xpl b/projects/oscal-from-pdf/src/xvrl-summarize.xpl new file mode 100644 index 00000000..5c58bfc7 --- /dev/null +++ b/projects/oscal-from-pdf/src/xvrl-summarize.xpl @@ -0,0 +1,50 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + CONGRATULATIONS! No validation errors are reported against { substring-after($schema, $here) } + + + + + + + Uhoh . . . +Validating result with { $schema } - { $error-count } { + if ($error-count eq 1) then 'error' else 'errors' } reported + + + + + + \ No newline at end of file