-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support determining mode based on shebang interpreter directive #47
base: master
Are you sure you want to change the base?
Conversation
Attempt to determine the appropriate mode based on the presence of an interpreter directive on the first line of the file. Interpreter directives are examined first, followed by the filename when determining the mode mirroring what Emacs does by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really awesome! Thanks so much, it's a very useful addition I wished I had before.
Sorry for the multiple comments and being pedantic about not loading the whole file, happy to help with any of them
@@ -17,6 +17,8 @@ pub struct ModeConfig { | |||
pub comment: Option<CommentConfig>, | |||
pub indentation: IndentationConfig, | |||
pub grammar: Option<GrammarConfig>, | |||
#[serde(default)] | |||
pub shebangs: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking whether this should be a new field, rather than a new variant of FilenamePattern
-- granted, we should probably rename it to FilePattern
as it will look inside the file too to determine if the mode applies. I.e. something like FilePattern::Shebang
.
The structure I suggest may make it harder to avoid reading the file when the filename would suffice.
zee/src/editor/buffer.rs
Outdated
let mode = text | ||
.line(0) | ||
.as_str() | ||
.and_then(|shebang| context.0.mode_by_shebang(shebang)) | ||
.or_else(|| { | ||
file_path | ||
.as_ref() | ||
.and_then(|path| context.0.mode_by_filename(path)) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may potentially read the whole file in pathological cases, e.g. a minified file that has no new lines. One goal of zee
is to keep being fast for any kind of pathological file you can think of and do anything that is potentially blocking in the background (e.g. parsing syntax or writing the file to disk). I've also been trying to avoid doing anything linear in the length of a line in the UI thread.
I think the right solution here long term is to build buffers, i.e. call Buffer::new()
in a background thread, rather than in the main, UI thread.
For this PR though, I'd be happy if instead you just bound how much of the file you read, say 256 bytes at most and test the regex for that. You'll have to deal with potentially truncated utf-8...
Maybe a better solution is to read characters until you encounter either 1. a new line or 2. if you don't after X characters, you give you and don't check the shebang -- essentially we only test if shebangs apply up to a certain line length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 2nd comment is that I think we should test if "mode_by_filename" applies and only then check the shebangs to avoid having to read the file if the name already matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions. I'm looking to see if there is any standards around whether white space is allowed before the #!
and what the maximum length is on most platforms. I have some outdated information that 127-512 bytes is the maximum but with a lack of a standard to point at, I think an overly large maximum might be best. FreeBSD for example historically supported 4096. If white space before the shebang is not allowed then a two pass strategy where only the first two characters are examined followed by the remainder of the line up to the maximum discussed might be the most efficient.
zee/src/editor/mod.rs
Outdated
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> &Mode { | ||
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> Option<&Mode> { | ||
self.modes | ||
.iter() | ||
.find(|&mode| mode.matches_by_filename(filename.as_ref())) | ||
.unwrap_or(&PLAIN_TEXT_MODE) | ||
} | ||
|
||
pub fn mode_by_shebang(&self, shebang: &str) -> Option<&Mode> { | ||
self.modes | ||
.iter() | ||
.find(|&mode| mode.matches_by_shebang(shebang)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If shebang is a variant of FilenamePattern
, there would be one function and we could continue to return &Mode
rather than Option<&Mode>
-- I would like to continue have Context
own coming up with a Mode
for any possible file whatsoever and not have a default PLAIN_TEXT_MODE
potentially duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about merging these methods into something like mode_by_file_pattern
that would always return a &Mode
but I'm stuck on how to handle the difference in parameters. Two of the variants operate on the filename while the other needs a portion of the file content. To have one method we would have to always pass a portion of the file content to the function in addition to the filename which would require reading at least part of the content of every file.
All of that being said, prior to detecting the mode or creating the buffer open_file()
is calling Rope::from_reader()
which I think might be reading the entire file anyway.
Line 156 in 0b40784
Rope::from_reader(BufReader::new(File::open(&file_path)?))?, |
Buffer::new()
so reading a slice should be quite efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an updated version of this code with a limit on the length of text examined for the interpreter directive, an inversion of the order mode checks are performed (filename checks first, then shebang), and a merging of the mode_by_filename() and mode_by_shebang() methods into one, mode_by_file(), which always returns a &Mode
.
I ended up settling on 256 characters for the maximum length of the shebang directive, matching what linux has done since 2018 (https://lore.kernel.org/lkml/[email protected]/) (https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h)
I haven’t found a satisfactory way of adding a shebang case to FilenamePattern, the biggest stumbling blocks being that the shebang line would need to be passed to the matches() method and the pattern list would need to be sorted/filtered if we wanted to ensure that filename patterns were always examined before shebang directives.
1) Constrain the number of characters examine when determining the interpreter directive to 256 to avoid performance penalty of scanning long lines unnecessarily. 2) Move to a single mode_by_file() function in Context. 3) Always attempt to match the mode based on the filename first, falling back to the shebang line only if no match is found.
Attempt to determine the appropriate mode based on the presence of an interpreter directive on the first line of the file. Interpreter directives are examined first, followed by the filename when determining the mode mirroring what Emacs does by default.
Reopening now that
highlight-with-queries
has been merged. Feel free to close if there is no interest in supporting this feature.