Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmt files as input and output #24

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

rgrinberg
Copy link
Member

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the feature proposal sensible, consistent with our recent support for .cmir-linear in the backend. It should also be fairly simple to implement. I'm in favor.

(We also have flags such as -stop-after typing, which in theory could be combined with -bin-annot to get .cmt[i] files, but having input/output targets is clearer and easier to specify precisely.)

rfcs/cmt_input.md Outdated Show resolved Hide resolved
Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also very much in favour, just with some suggested tweaks to the handling of -o

rfcs/cmt_input.md Outdated Show resolved Hide resolved
rfcs/cmt_input.md Outdated Show resolved Hide resolved
@lpw25
Copy link
Contributor

lpw25 commented Mar 28, 2021

The feature is easy to implement and maintain, so I don't object. However, I don't think it is the best way to achieve your aims in the medium to long term.

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

  1. Merlin uses them to do jump-to-definition, however there is some upcoming work on jump-to-definition that will allow it to work using a much smaller compiler output than cmt files.
  2. odoc uses them to generate documentation, however it mostly only wants the much smaller cmti files. cmt files are only needed when there is no mli file.

For dune's checking mode I think that the mode should not be trying to produce any output beyond the possible errors. So -stop-after typing would be a better option.

For sharing type-checking time, it would be much better to store the lambda code to disk. The translation from the typedtree to lambda is inexpensive and the result should be a lot smaller. At the moment there are some differences between the lambda code for bytecode and the lambda code for native, but I think these differences can and should be eliminated anyway.


## Speed up compilation when type checking is a bottle neck

This likely doesn't occur very often, but there are cases where type checking is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, type checking is consistently the majority of compilation time at Jane Street.

@rgrinberg
Copy link
Member Author

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

ocp-index should be added to that list. I don't know how alive libmonda is today, but I know it uses cmt files too.

I think it would be somewhat of a shame if cmt files were to go away. It's true that they're very bulky, but they've been pretty good at allowing people to experiment with tooling. For example, I know someone is experimenting with adding a workspace wide "go to symbol" using cmt files.

In the end, I could certainly manage if cmt files went away, but will be saddened that one of the few successful hooks for cool hacks is gone.

For dune's checking mode I think that the mode should not be trying to produce any output beyond the possible errors. So -stop-after typing would be a better option.

Doesn't that mean checking mode and lambda generation will not be able to reuse the type checking phase? If -stop-after doesn't produce a typed tree, generating lambda will have to type things again.

@nojb
Copy link

nojb commented Mar 28, 2021

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

ocp-index should be added to that list. I don't know how alive libmonda is today, but I know it uses cmt files too.

I think it would be somewhat of a shame if cmt files were to go away. It's true that they're very bulky, but they've been pretty good at allowing people to experiment with tooling. For example, I know someone is experimenting with adding a workspace wide "go to symbol" using cmt files.

Just as a data point, at LexiFi we have several refactoring tools that we use often and that make use of cmt files. Just to mention one, we have a typed "grep" tool where you can search for eg if __ then (__ : Foo.t) else __ to locate for all ocurrences of an if where the branches have type Foo.t (in general you can search for expressions with wildcards __ and type constraints). A second example is the tool that we used to migrate to floatarray https://github.com/nchataing/caml-migrate-floatarray, and so on.

All of this to say that for me it would be a mistake to make cmt files go away without having something to replace the same functionality...

@lpw25
Copy link
Contributor

lpw25 commented Mar 28, 2021

I certainly don't think we should stop supporting cmt files. They are occasionally useful for the kinds of things mentioned above. But I would like to get to a point where they are only built on request, rather than always being built during day-to-day development. Building them all the time has a noticeable cost and I don't think they are used enough to justify it.

Doesn't that mean checking mode and lambda generation will not be able to reuse the type checking phase?

Yeah. I'm assuming that people would either use checking mode or try to build things, not both. I would have thought that the loss of incrementality when changing modes was worth not having to write out the cmt files.

@c-cube
Copy link
Contributor

c-cube commented Mar 28, 2021 via email

@rgrinberg
Copy link
Member Author

But I would like to get to a point where they are only built on request, rather than always being built during day-to-day development. Building them all the time has a noticeable cost and I don't think they are used enough to justify it.

It would be good if this was possible, but I'm not sure how to make this work in practice. We need cmt files for installed libraries for many applications, and I'm not sure how to generate these "on demand" without just installing all the build rules and sources.

Yeah. I'm assuming that people would either use checking mode or try to build things, not both. I would have thought that the loss of incrementality when changing modes was worth not having to write out the cmt files.

The workflow I'm thinking of where this could matter is when one is editing sources and running the test suite in the background in watch mode. E.g. $ dune build -w @check @runtest. I'm not sure how good this is in practice, but it will be nice to experiment here if the RFC is accepted.

@EduardoRFS
Copy link

I was actually in process of implementing support for the complete Env.t on the cmt to do Typedtree -> Lambda after the fact without IO.

So this for me is a nice movement but also makes me wonder, there was any discussion of unifying the multiple files that we have today? I know that ocaml/ocaml#608 proposes adding the flambda to cmxa, maybe we could just have an incremental .cmt?

type t = {
  typedtree: typedtree option;
  lambda: lambda option;
  cmm: cmm option;
  flambda: flambda option;
}

Of course this could be compressed and storing only what was configured, so --include-lambda-in-cmt ... With this then we could include this assets on installed artifacts, which would allow whole program analysis, like checked extensions in userspace. And it also covers the goals of this RFC.

@lpw25
Copy link
Contributor

lpw25 commented Mar 29, 2021

I mean, the previous RFC, if merged, would also roughly divide the size of cmt files by two.

I'm not sure that the size is worth it even if divided in two.

With the previous RFC the core compilation pipeline for each file will go:

  • Take the typedtree
  • Marshall it
  • Compress it
  • Store it to disk
  • Read it from disk
  • Decompress it
  • Unmarshall it
  • Throw 90% of it away

This doesn't seem a good route to improving compilation times.

We need cmt files for installed libraries for many applications

I suspect this isn't actually true. Once merlin stops using them I think we will only need cmt files for ml files without mli files, for odoc. I think we are currently installing these files everywhere and then basically never using them.

However, even if it were true, the vast majority of builds are not installing builds. In a monorepo style approach, e.g. where you pull all dependencies into a single dune workspace to build them, you are never doing an installing build. Even if you still build cmt files whenever you are installing a library it will still be a big improvement.

The workflow I'm thinking of where this could matter is when one is editing sources and running the test suite in the background in watch mode.

For that workflow, I think building bytecode only is probably the best approach. I would have thought that building cmo files was already a very quick checking mode. I would not be surprised if building cmo files was quicker than building and compressing cmt files.

@rgrinberg
Copy link
Member Author

So this for me is a nice movement but also makes me wonder, there was any discussion of unifying the multiple files that we have today? I know that ocaml/ocaml#608 proposes adding the flambda to cmxa, maybe we could just have an incremental .cmt?

Mutation in place really complicates the implementation of build systems. At the very least, it would require to implement locking to write to this mega cmt file since it is now used for so many different things. Not to mention that this would probably make dune's cache useless as well.

However, even if it were true, the vast majority of builds are not installing builds. In a monorepo style approach, e.g. where you pull all dependencies into a single dune workspace to build them, you are never doing an installing build. Even if you still build cmt files whenever you are installing a library it will still be a big improvement.

That's a good compromise. Things that only need to be built for install builds are much less problematic.

For that workflow, I think building bytecode only is probably the best approach. I would have thought that building cmo files was already a very quick checking mode. I would not be surprised if building cmo files was quicker than building and compressing cmt files.

The problem is that users sometimes prefer to run their tests in native mode. I don't want to speak for all users here, but in my experience Iarge test suites can get slowish. In any case, I prefer to run things closer to how they are in production when testing.

@rgrinberg
Copy link
Member Author

On a more practical note, I had a brief look at implementing this and it isn't as easy as I thought.

To compile the typedtree to lambda, we need to obtain the module_coercion for the module we are compiling. To compute this coercion, we need the inferred and the written signature of the module. The written part obviously comes from the cmi, but the inferred part isn't saved anywhere as far as I can see. To get around this, we need to do one of:

  • Store the coercion in the cmt. Doesn't seem very attractive given that some people are already loathing the size of cmt's
  • Infer the signature for the module and calculate the coercion again. This looks like a lot of extra work - both in the amount of code this will require, and the work introduced at runtime.

@EduardoRFS
Copy link

EduardoRFS commented Apr 1, 2021

Yeah I did the same yesterday, seems like storing the coercion is the best idea as it is a simple structure, overall I think it will not be a significant upgrade in size.

@EduardoRFS
Copy link

This may be a really niche property, but because of this any PPX that is used through staged_pps on Dune, will be able to be executed only once instead of twice. So typed PPXs will probably be be faster

@xavierleroy
Copy link

Like @lpw25, I dont't like .cmt files much. It's data hoarding: we don't quite know what information external tools need, so we'll dump the whole typed AST to disk, no matter how big and redundant it is, no matter how much disk space it consumes in OPAM package installations, no matter how much time this adds to compilation times. (And it adds more now that marshaling took a big performance hit in preparation for Multicore OCaml.)

So, I'm skeptical that using .cmt files as an intermediate step in builds is a good idea. Expect noticeable slowdowns for essentially zero benefits. If you want typechecking + ability to run tests on the fly, compilation to bytecode does the job already and more efficiently.

@xavierleroy
Copy link

we'll dump the whole typed AST to disk, no matter how big and redundant it is

I remember reviewing a project on analysis and refactoring of source code that had problems scaling up to large code bases because their Java representation of the annotated AST consumed about 1 kilobyte of disk space for every line of source code. It sounded crazy to me at the time. But I'm not sure .cmt files are doing much better than that today.

@rgrinberg
Copy link
Member Author

So, I'm skeptical that using .cmt files as an intermediate step in builds is a good idea. Expect noticeable slowdowns for essentially zero benefits

It was pointed out that type checking is a bottle neck at jane street. So I'm not sure about the slowdowns. I've listed other benefits in the RFC.

I'm in full agreement with the concerns about data hoarding and bloat, but perhaps that discussion is not directly relevant to this RFC. The RFC does not suggest to add any bloat, it only proposes to use the already existing bloat to solve current problems. If there was a way to avoid cmt's and not lose any tooling support, I'm sure nobody would object. Unfortunately, that does not seem to be on the cards in the short term. So why not reuse what's already there to work around current pain points until whatever is going to replace cmt's is ready?

@vrotaru
Copy link

vrotaru commented Sep 19, 2021

To add to the tools/progams that use the .cmt files there is also OCamlEditor which uses .cmt(i) files for displaying an outline of OCaml files and displaying expression type on hover
image

Which is not shown on the screenshot are the type hovers, but you can probably believe me on that. It runs on current OCaml (4.12) version, and will run on 4.13, but on the other side there are probably just 3 people in the world using it (and I'm not sure who the third is), so it should not be such a big concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants