-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable ropey unicode_lines
feature
#50
Conversation
With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls might have more lines. See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities. async fn initialize(&self, params: InitializeParams) -> LspResult<InitializeResult> {
*self.workspace_folders.write().await = params.workspace_folders;
+ let position_encoding = params.capabilities.general.and_then(|general_cap| {
+ general_cap.position_encodings.and_then(|encodings| {
+ if encodings.contains(&PositionEncodingKind::UTF8) {
+ Some(PositionEncodingKind::UTF8)
+ } else {
+ self.client.show_message(MessageType::WARNING, "utf8 is not supported, defaulting to utf16 which may result in offset errors").await;
+ None
+ }
+ })
+ });
Ok(InitializeResult {
server_info: Some(ServerInfo {
name: "llm-ls".to_owned(),
version: Some(VERSION.to_owned()),
}),
capabilities: ServerCapabilities {
text_document_sync: Some(TextDocumentSyncCapability::Kind(
TextDocumentSyncKind::INCREMENTAL,
)),
+ position_encoding,
..Default::default()
},
})
} |
Hi! Thanks for the quick response. Ropey Line BreaksI think this is a separate issue from position encoding negotiation. Enforcing UTF8 position encodings will not prevent Ropey's line count from diverging from Tree Sitter's or VSCode's, rendering Issues with EncodingRegarding the After looking at
Here's a video showcasing 1) a crash and 2) llm-ls' mirror going out of sync due to some unicode characters. LLM-LS.Bug.Report.movHere's a test case that illustrates this: Long Test Case for `Document::change`Case: mod test {
use tower_lsp::lsp_types::Position;
use tree_sitter::Node;
#[allow(unused_imports)]
use super::*;
#[tokio::test]
async fn test_document_change_tree_consistency_medium() {
let a = "let a = '🥸 你好';\rfunction helloWorld() { return '🤲🏿'; }\nlet b = 'Hi, 😊';";
let mut document = Document::open("javascript", a).await.unwrap();
document
.change(Range::new(Position::new(0, 14), Position::new(2, 13)), ",")
.await
.unwrap();
let b = "let a = '🥸 你好,😊';";
assert_eq!(document.text.to_string(), b);
let mut parser = Parser::new();
parser
.set_language(tree_sitter_javascript::language())
.unwrap();
let b_tree = parser.parse(b, None).unwrap();
assert!(nodes_are_equal_recursive(
&document.tree.unwrap().root_node(),
&b_tree.root_node()
));
}
#[allow(dead_code)]
fn nodes_are_equal_recursive(node1: &Node, node2: &Node) -> bool {
if node1.kind() != node2.kind() {
return false;
}
if node1.start_byte() != node2.start_byte() {
return false;
}
if node1.end_byte() != node2.end_byte() {
return false;
}
if node1.start_position() != node2.start_position() {
return false;
}
if node1.end_position() != node2.end_position() {
return false;
}
if node1.child_count() != node2.child_count() {
return false;
}
for i in 0..node1.child_count() {
let child1 = node1.child(i).unwrap();
let child2 = node2.child(i).unwrap();
if !nodes_are_equal_recursive(&child1, &child2) {
return false;
}
}
true
}
} Output:
Enforcing UTF-8 Position Encoding
This is by far the most convenient option, and we tried to go that route as well. Unfortunately, position encoding kind negotiation is a relatively new feature of the Language Server Protocol. It was introduced in 3.17.0, in 2022. As of writing, UTF-16 is the default, mandatory encoding of the protocol and must always be supported by servers. I thought it was still worth a try but it turns out that the VSCode on my machine, the latest version, does not even support anything other than UTF-16.
In our case, we deemed it was necessary to stick to UTF-16 for the following reasons:
If you managed to successfully negotiate UTF-8 position encodings, we'd love to hear more! Once again, I might be wrong about some of the details of this. It's based off my current understanding of LSP and library implementations which is quite modest. Anyway, all this Unicode handling caused quite the headaches on our side so we'd be happy to upstream our tests and implementation which handles the translation between the different encodings if you want. We're also looking at other approaches to simplify this complexity and if you're interested, I'm happy to collaborate on this together! Bon Dimanche 🤗 |
Thanks for the detailed response. I'd be happy to take a look at your code and more than happy to have you contribute to llm-ls. My main concern is that the rope crate only supports utf8, so I'll have to check if there are other rope crates that do support utf16. I'm also not sure how other editors fair regarding Unicode encoding, I'll take a look. For now let's merge the current PR. |
Awesome! I'll kickstart a PR today or tomorrow. I see the CI is not passing, it seems it's failing on |
I'm not sure where it's coming from, just updated the CI's secret, let's see if that was the issue. |
Hi @McPatate, hope you're doing good! I was working on upstreaming our document syncing implementation but two things happened: 1. Didn't get around to thoroughly test itI'm pretty sure our current implementation is correct but I also wouldn't bet my hand on it. Currently we have a few unit tests that ensure the documents are kept in sync by simulating different kind of edits but it's hard to know whether it will stand the test of thousand of user files/actions once in production. I'd feel bad about sharing under-tested code hence I'm currently looking around the web to try to find suitable edit traces that could help us simulate real-world, long-lived, complex editing sessions in our tests and ensure that everything matches in the end. 2. Uncovered additional issues this time relating to
|
Hey @rojas-diego, thanks for the detailed message. I'm going to merge the PR, just ran testbed locally and lgtm. Don't hesitate to run testbed yourself in future PRs, this is the way I test Regarding Regarding Feel free to create a new PR and we can discuss the issues in more detail over there. Thanks again for the effort! |
Hey @rojas-diego, do you still want to contribute your gist to the project? |
Hi! First of all, thanks a lot for open-sourcing this.
I was working on an internal fork of the project and noticed a potential issue. I have little experience with the LSP and Ropey so this might not be relevant. Anyway, here's the issue:
With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls' representation might have more lines.
See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks
See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments