Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf validation #130

Merged
merged 13 commits into from
Oct 27, 2023
Merged

Protobuf validation #130

merged 13 commits into from
Oct 27, 2023

Conversation

xfbs
Copy link
Contributor

@xfbs xfbs commented Oct 13, 2023

In this PR, I am introducing the validation crate, which allows for validating some properties of protobuf files (and this, of buffrs packages).

Functionality

Specifically, I am trying to validate:

  • File structure. We are considering having a standardized file structure. Similar to how Rust projects are standardized to have a lib.rs (if library) or main.rs (if binary), and files with names matching those of the modules they represent.
  • Package names. If you publish a buffrs package named my_lib, you'd expect the protobuf to publish types for my_lib or my_lib.submodule.
  • Breaking changes. Essentially, we are considering validating whether the buffrs package you are trying to publish will introduce new functionality (types, optional fields, messages) or break functionality (rename, remove, add required fields). In doing so, we can safeguard our buffrs package repository and prevent accidentally releasing breaking changes.

Implementation

The way this works is that I am using an existing protocol buffer parser, the protobuf_parse crate. I am then loading the parsed data into some custom types. This is a partially lossy operation, because there is some information that we do not care about (does not affect the validation process). I can do some simple validation based on this information already.

Additionally, I am able to pull a previous version and run it through the same process and compare the previous version's parsed information with this version's parsed information. To do this, I am using the diff-struct crate, which allows me to generate a set of differences between two structs. With this information, I am then (somewhat easily) able to decide if there has been breaking changes.

flowchart LR
    package_protobuf[Current sources<br/><i>*.proto</i>]
    previous_protobuf[Previous sources<br/><i>*.proto</i>]
   
    subgraph Validate current package
    direction LR
        package_parsing[Parsing protobuf<br/>Using <i>protobuf_parse</i>]
        package_loading[Loading into custom data structure<br/>in <i>buffrs_validation</i>]
        package_checking[Checking properties<br/>in <i>buffrs_validation</i>]
        package_parsing --> package_loading --> package_checking
    end
    
    subgraph Parse previous package
        direction LR
        previous_parsing[Parsing protobuf<br/>Using <i>protobuf_parse</i>]
        previous_loading[Loading into custom data structure<br/>in <i>buffrs_validation</i>]
        previous_parsing --> previous_loading
    end

    package_protobuf --> package_parsing
    previous_protobuf --> previous_parsing

    diff[Generate differences<br/>using <i>diff_struct</i>]

    package_loading --> diff
    previous_loading --> diff

    validate[Validate changes<br/>in <i>buffrs_validation</i>]

    diff --> validate
Loading

Samples

For example, currently (not all functionality is implemented) the file books.proto which looks like this:

syntax = "proto3";

package com.book;

message Book {
    int64 isbn = 1;
    string title = 2;
    string author = 3;
}

message GetBookRequest {
    int64 isbn = 1;
}

message GetBookViaAuthor {
    string author = 1;
}

service BookService {
    rpc GetBook (GetBookRequest) returns (Book) {}
    rpc GetBooksViaAuthor (GetBookViaAuthor) returns (stream Book) {}
    rpc GetGreatestBook (stream GetBookRequest) returns (Book) {}
    rpc GetBooks (stream GetBookRequest) returns (stream Book) {}
}

message BookStore {
    string name = 1;
    map<int64, string> books = 2;
}

enum EnumSample {
    option allow_alias = true;
    UNKNOWN = 0;
    STARTED = 1;
    RUNNING = 1;
}

Will parse as this:

{                                                                                                                                                                                                                                           
  "packages": {                                                                                                                                                                                                                             
    "com.book": {                                                                                                                                                                                                                           
      "entities": {                                                                                                                                                                                                                         
        "Book": {                                                                                                                                                                                                                           
          "kind": "message",                                                                                                                                                                                                                
          "fields": {           
            "1": {                
              "name": "isbn",
              "type_": "int64",
              "label": "optional",
              "default": null
            },
            "2": {
              "name": "title",
              "type_": "string",
              "label": "optional",          
              "default": null
            },
            "3": {
              "name": "author",
              "type_": "string",
              "label": "optional",
              "default": null
            }
          }
        },
        "BookService": {
          "kind": "service"
        },
        "BookStore": {
          "kind": "message",
          "fields": {
            "1": {
              "name": "name",
              "type_": "string",
              "label": "optional",
              "default": null
            },
            "2": {
              "name": "books",
              "type_": "message",
              "label": "repeated",
              "default": null
            }
          }
        },
        "EnumSample": {
          "kind": "enum",
          "values": {
            "0": {
              "name": "UNKNOWN"
            },
            "1": {
              "name": "RUNNING"
            }
          }
        },
        "GetBookRequest": {
          "kind": "message",
          "fields": {
            "1": {
              "name": "isbn",
              "type_": "int64",
              "label": "optional",
              "default": null
            }
          }
        },
        "GetBookViaAuthor": {
          "kind": "message",
          "fields": {
            "1": {
              "name": "author",
              "type_": "string",
              "label": "optional",
              "default": null
            }
          }
        }
      }
    }
  }
}

If I make some changes to the Book message, I can generate a diff output like this. It only includes things that are removed or changed (insertion counts as change).

PackagesDiff {
    packages: BTreeMapDiff {
        altered: {
            "com.book": PackageDiff {
                entities: BTreeMapDiff {
                    altered: {
                        "Book": Message(
                            MessageDiff {
                                fields: BTreeMapDiff {
                                    altered: {
                                        3: FieldDiff {
                                            name: None,
                                            type_: Int32,
                                            label: NoChange,
                                            default: NoChange,
                                        },
                                    },
                                    removed: {
                                        2,
                                    },
                                },
                            },
                        ),
                    },
                    removed: {},
                },
            },
        },
        removed: {},
    },
}

Try it out

You can play with it yourself. There are two CLI tools in the validation crate at the moment, which allow you to parse and diff arbitrary protocol buffer files. Note that not all functionality is implemented. Run these from within the validation folder.

# parse protoc file
cargo run --features tools --bin parse -- --format json path/to/file.proto

# parse protoc file (show raw data)
cargo run --features tools --bin parse -- --raw path/to/file.proto

# generate diff between two files
cargo run --features tools --bin diff -- path/to/a.proto path/to/b.proto

Further ideas

  • We could use the same parsing strategy and data structures for the frontend, which would allow us to render protocol buffer definitions and make them searchable.

This is a draft PR, feel free to look at it but understand that it is not quite ready for review.

@xfbs xfbs added component::cli Everything related to the buffrs cli component::registry Everything related to the buffrs registry complexity::high Issues or ideas that are highly complex. require discussion and may affect backwards compatibility type::feature Shipping or drafting a new feature for this product type::idea Rough idea or proposal for buffrs priority::medium This is not urgent, but we should do this. labels Oct 13, 2023
@xfbs xfbs self-assigned this Oct 13, 2023
@mara-schulke mara-schulke added this to the Stabilization milestone Oct 14, 2023
@xfbs
Copy link
Contributor Author

xfbs commented Oct 19, 2023

Note to self: we discussed this today. The idea is to get this to a state where it can be merged. The differential validation (comparing changes between two versions of packages) will not be finished or used. Instead, we will first focus on doing the simple validation (validating properties of the current candidate package). This should give us a lot of bang for our buck.

The action plan here is as follows:

  • Pretty-print validation errors (similar to how clippy formats things, just make them readable).
  • Implement a buffrs lint (or buffrs check) subcommand that only runs these checks. This should be very useful for local development where a fast feedback cycle is useful. Long-term, these validations should also run on the registry side, but it is terrible UX to have to publish a package to check warnings.
  • Write a discussion to figure out how the validations should work specifically. Specifically, standards for file naming need to be figured out so we can be consistent.
  • Implement a lint to check dependencies (making sure we don't depend on anything we don't declare)?

@xfbs xfbs force-pushed the pe/validation branch 2 times, most recently from c7f0cc6 to 994fc83 Compare October 25, 2023 12:06
For the time being, this will only be used to build informative lints.
@xfbs xfbs marked this pull request as ready for review October 25, 2023 21:43
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/command.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/validation.rs Outdated Show resolved Hide resolved
@mara-schulke
Copy link
Contributor

Okay so my understanding for the required actions to merge this are:

  1. Resolve code-style / algorithmic comments
  2. Condense public interface to the bare minimum (see comments)
  3. Eventually: Move validate away from the store (as we dont want to validate the store, but a Package!)
  4. Use static rules and simplify the implementation to only work with static impls. Over generalization is our enemy here i guess – it will come at the cost of high maintanance without providing any immediate benefit for us as we dont expect external people to implement rules for a long time.
  5. Add documentation in the book about this (both the buffrs lint synopsis and the general reasoning for linting in the buffrs reference!)

So 4 & 5 are the most important "blockers" for me; The rest is more style oriented but i would be happy if they are equally adressed.

Thank you! 😊

@mara-schulke
Copy link
Contributor

I wrote the documentation in #153 – feel free to specify (i added rule codes by the way, and now we have linkable urls to use in the violation reports).

Also can we print the violations to stderr?

@mara-schulke
Copy link
Contributor

src/validation.rs Outdated Show resolved Hide resolved
src/validation.rs Outdated Show resolved Hide resolved
src/validation.rs Outdated Show resolved Hide resolved
@xfbs xfbs merged commit 69c4e09 into main Oct 27, 2023
7 checks passed
@xfbs xfbs deleted the pe/validation branch October 27, 2023 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity::high Issues or ideas that are highly complex. require discussion and may affect backwards compatibility component::cli Everything related to the buffrs cli component::registry Everything related to the buffrs registry priority::medium This is not urgent, but we should do this. type::feature Shipping or drafting a new feature for this product type::idea Rough idea or proposal for buffrs
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants