cmt files as input and output #24

rgrinberg · 2021-03-28T04:28:34Z

gasche

I find the feature proposal sensible, consistent with our recent support for .cmir-linear in the backend. It should also be fairly simple to implement. I'm in favor.

(We also have flags such as -stop-after typing, which in theory could be combined with -bin-annot to get .cmt[i] files, but having input/output targets is clearer and easier to specify precisely.)

rfcs/cmt_input.md

dra27

I'm also very much in favour, just with some suggested tweaks to the handling of -o

rfcs/cmt_input.md

lpw25 · 2021-03-28T16:57:27Z

The feature is easy to implement and maintain, so I don't object. However, I don't think it is the best way to achieve your aims in the medium to long term.

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

Merlin uses them to do jump-to-definition, however there is some upcoming work on jump-to-definition that will allow it to work using a much smaller compiler output than cmt files.
odoc uses them to generate documentation, however it mostly only wants the much smaller cmti files. cmt files are only needed when there is no mli file.

For dune's checking mode I think that the mode should not be trying to produce any output beyond the possible errors. So -stop-after typing would be a better option.

For sharing type-checking time, it would be much better to store the lambda code to disk. The translation from the typedtree to lambda is inexpensive and the result should be a lot smaller. At the moment there are some differences between the lambda code for bytecode and the lambda code for native, but I think these differences can and should be eliminated anyway.

lpw25 · 2021-03-28T17:01:13Z

rfcs/cmt_input.md

+
+## Speed up compilation when type checking is a bottle neck
+
+This likely doesn't occur very often, but there are cases where type checking is


FWIW, type checking is consistently the majority of compilation time at Jane Street.

rgrinberg · 2021-03-28T19:43:41Z

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

ocp-index should be added to that list. I don't know how alive libmonda is today, but I know it uses cmt files too.

I think it would be somewhat of a shame if cmt files were to go away. It's true that they're very bulky, but they've been pretty good at allowing people to experiment with tooling. For example, I know someone is experimenting with adding a workspace wide "go to symbol" using cmt files.

In the end, I could certainly manage if cmt files went away, but will be saddened that one of the few successful hooks for cool hacks is gone.

For dune's checking mode I think that the mode should not be trying to produce any output beyond the possible errors. So -stop-after typing would be a better option.

Doesn't that mean checking mode and lambda generation will not be able to reuse the type checking phase? If -stop-after doesn't produce a typed tree, generating lambda will have to type things again.

nojb · 2021-03-28T19:58:50Z

The main issue is that cmt files are huge and mostly useless. I think there are only two tools in the OCaml ecosystem that currently consume cmt files:

ocp-index should be added to that list. I don't know how alive libmonda is today, but I know it uses cmt files too.

I think it would be somewhat of a shame if cmt files were to go away. It's true that they're very bulky, but they've been pretty good at allowing people to experiment with tooling. For example, I know someone is experimenting with adding a workspace wide "go to symbol" using cmt files.

Just as a data point, at LexiFi we have several refactoring tools that we use often and that make use of cmt files. Just to mention one, we have a typed "grep" tool where you can search for eg if __ then (__ : Foo.t) else __ to locate for all ocurrences of an if where the branches have type Foo.t (in general you can search for expressions with wildcards __ and type constraints). A second example is the tool that we used to migrate to floatarray https://github.com/nchataing/caml-migrate-floatarray, and so on.

All of this to say that for me it would be a mistake to make cmt files go away without having something to replace the same functionality...

lpw25 · 2021-03-28T21:41:46Z

I certainly don't think we should stop supporting cmt files. They are occasionally useful for the kinds of things mentioned above. But I would like to get to a point where they are only built on request, rather than always being built during day-to-day development. Building them all the time has a noticeable cost and I don't think they are used enough to justify it.

Doesn't that mean checking mode and lambda generation will not be able to reuse the type checking phase?

Yeah. I'm assuming that people would either use checking mode or try to build things, not both. I would have thought that the loss of incrementality when changing modes was worth not having to write out the cmt files.

c-cube · 2021-03-28T21:44:13Z

I mean, the previous RFC, if merged, would also roughly divide the size of cmt files by two. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rgrinberg · 2021-03-28T21:55:44Z

But I would like to get to a point where they are only built on request, rather than always being built during day-to-day development. Building them all the time has a noticeable cost and I don't think they are used enough to justify it.

It would be good if this was possible, but I'm not sure how to make this work in practice. We need cmt files for installed libraries for many applications, and I'm not sure how to generate these "on demand" without just installing all the build rules and sources.

Yeah. I'm assuming that people would either use checking mode or try to build things, not both. I would have thought that the loss of incrementality when changing modes was worth not having to write out the cmt files.

The workflow I'm thinking of where this could matter is when one is editing sources and running the test suite in the background in watch mode. E.g. $ dune build -w @check @runtest. I'm not sure how good this is in practice, but it will be nice to experiment here if the RFC is accepted.

Signed-off-by: Rudi Grinberg <[email protected]>

EduardoRFS · 2021-03-29T04:57:39Z

I was actually in process of implementing support for the complete Env.t on the cmt to do Typedtree -> Lambda after the fact without IO.

So this for me is a nice movement but also makes me wonder, there was any discussion of unifying the multiple files that we have today? I know that ocaml/ocaml#608 proposes adding the flambda to cmxa, maybe we could just have an incremental .cmt?

type t = {
  typedtree: typedtree option;
  lambda: lambda option;
  cmm: cmm option;
  flambda: flambda option;
}

Of course this could be compressed and storing only what was configured, so --include-lambda-in-cmt ... With this then we could include this assets on installed artifacts, which would allow whole program analysis, like checked extensions in userspace. And it also covers the goals of this RFC.

lpw25 · 2021-03-29T06:40:10Z

I mean, the previous RFC, if merged, would also roughly divide the size of cmt files by two.

I'm not sure that the size is worth it even if divided in two.

With the previous RFC the core compilation pipeline for each file will go:

Take the typedtree
Marshall it
Compress it
Store it to disk
Read it from disk
Decompress it
Unmarshall it
Throw 90% of it away

This doesn't seem a good route to improving compilation times.

We need cmt files for installed libraries for many applications

I suspect this isn't actually true. Once merlin stops using them I think we will only need cmt files for ml files without mli files, for odoc. I think we are currently installing these files everywhere and then basically never using them.

However, even if it were true, the vast majority of builds are not installing builds. In a monorepo style approach, e.g. where you pull all dependencies into a single dune workspace to build them, you are never doing an installing build. Even if you still build cmt files whenever you are installing a library it will still be a big improvement.

The workflow I'm thinking of where this could matter is when one is editing sources and running the test suite in the background in watch mode.

For that workflow, I think building bytecode only is probably the best approach. I would have thought that building cmo files was already a very quick checking mode. I would not be surprised if building cmo files was quicker than building and compressing cmt files.

rgrinberg · 2021-03-31T23:46:21Z

So this for me is a nice movement but also makes me wonder, there was any discussion of unifying the multiple files that we have today? I know that ocaml/ocaml#608 proposes adding the flambda to cmxa, maybe we could just have an incremental .cmt?

Mutation in place really complicates the implementation of build systems. At the very least, it would require to implement locking to write to this mega cmt file since it is now used for so many different things. Not to mention that this would probably make dune's cache useless as well.

However, even if it were true, the vast majority of builds are not installing builds. In a monorepo style approach, e.g. where you pull all dependencies into a single dune workspace to build them, you are never doing an installing build. Even if you still build cmt files whenever you are installing a library it will still be a big improvement.

That's a good compromise. Things that only need to be built for install builds are much less problematic.

For that workflow, I think building bytecode only is probably the best approach. I would have thought that building cmo files was already a very quick checking mode. I would not be surprised if building cmo files was quicker than building and compressing cmt files.

The problem is that users sometimes prefer to run their tests in native mode. I don't want to speak for all users here, but in my experience Iarge test suites can get slowish. In any case, I prefer to run things closer to how they are in production when testing.

rgrinberg · 2021-03-31T23:56:37Z

On a more practical note, I had a brief look at implementing this and it isn't as easy as I thought.

To compile the typedtree to lambda, we need to obtain the module_coercion for the module we are compiling. To compute this coercion, we need the inferred and the written signature of the module. The written part obviously comes from the cmi, but the inferred part isn't saved anywhere as far as I can see. To get around this, we need to do one of:

Store the coercion in the cmt. Doesn't seem very attractive given that some people are already loathing the size of cmt's
Infer the signature for the module and calculate the coercion again. This looks like a lot of extra work - both in the amount of code this will require, and the work introduced at runtime.

EduardoRFS · 2021-04-01T11:03:46Z

Yeah I did the same yesterday, seems like storing the coercion is the best idea as it is a simple structure, overall I think it will not be a significant upgrade in size.

EduardoRFS · 2021-04-10T23:00:16Z

This may be a really niche property, but because of this any PPX that is used through staged_pps on Dune, will be able to be executed only once instead of twice. So typed PPXs will probably be be faster

xavierleroy · 2021-04-18T17:09:27Z

Like @lpw25, I dont't like .cmt files much. It's data hoarding: we don't quite know what information external tools need, so we'll dump the whole typed AST to disk, no matter how big and redundant it is, no matter how much disk space it consumes in OPAM package installations, no matter how much time this adds to compilation times. (And it adds more now that marshaling took a big performance hit in preparation for Multicore OCaml.)

So, I'm skeptical that using .cmt files as an intermediate step in builds is a good idea. Expect noticeable slowdowns for essentially zero benefits. If you want typechecking + ability to run tests on the fly, compilation to bytecode does the job already and more efficiently.

xavierleroy · 2021-04-18T17:14:10Z

we'll dump the whole typed AST to disk, no matter how big and redundant it is

I remember reviewing a project on analysis and refactoring of source code that had problems scaling up to large code bases because their Java representation of the annotated AST consumed about 1 kilobyte of disk space for every line of source code. It sounded crazy to me at the time. But I'm not sure .cmt files are doing much better than that today.

rgrinberg · 2021-05-01T23:04:38Z

So, I'm skeptical that using .cmt files as an intermediate step in builds is a good idea. Expect noticeable slowdowns for essentially zero benefits

It was pointed out that type checking is a bottle neck at jane street. So I'm not sure about the slowdowns. I've listed other benefits in the RFC.

I'm in full agreement with the concerns about data hoarding and bloat, but perhaps that discussion is not directly relevant to this RFC. The RFC does not suggest to add any bloat, it only proposes to use the already existing bloat to solve current problems. If there was a way to avoid cmt's and not lose any tooling support, I'm sure nobody would object. Unfortunately, that does not seem to be on the cards in the short term. So why not reuse what's already there to work around current pain points until whatever is going to replace cmt's is ready?

vrotaru · 2021-09-19T10:27:37Z

To add to the tools/progams that use the .cmt files there is also OCamlEditor which uses .cmt(i) files for displaying an outline of OCaml files and displaying expression type on hover

Which is not shown on the screenshot are the type hovers, but you can probably believe me on that. It runs on current OCaml (4.12) version, and will run on 4.13, but on the other side there are probably just 3 people in the world using it (and I'm not sure who the third is), so it should not be such a big concern.

rgrinberg force-pushed the cmt-io branch from af740b5 to 8a23ae1 Compare March 28, 2021 04:29

gasche reviewed Mar 28, 2021

View reviewed changes

rfcs/cmt_input.md Outdated Show resolved Hide resolved

dra27 reviewed Mar 28, 2021

View reviewed changes

rfcs/cmt_input.md Outdated Show resolved Hide resolved

rfcs/cmt_input.md Outdated Show resolved Hide resolved

lpw25 reviewed Mar 28, 2021

View reviewed changes

rgrinberg added 2 commits March 28, 2021 15:46

cmt files as input and output proposal

a8cf597

Signed-off-by: Rudi Grinberg <[email protected]>

change command line according to @dra27

489caa6

Signed-off-by: Rudi Grinberg <[email protected]>

rgrinberg force-pushed the cmt-io branch from 8a23ae1 to 489caa6 Compare March 28, 2021 22:48

Remove conjecturing about type check bottlenecks

c1e0aec

Signed-off-by: Rudi Grinberg <[email protected]>

rgrinberg mentioned this pull request Apr 17, 2021

Compile with cmt files as input ocaml/ocaml#10353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmt files as input and output #24

cmt files as input and output #24

rgrinberg commented Mar 28, 2021

gasche left a comment

dra27 left a comment

lpw25 commented Mar 28, 2021

lpw25 Mar 28, 2021

rgrinberg commented Mar 28, 2021

nojb commented Mar 28, 2021

lpw25 commented Mar 28, 2021

c-cube commented Mar 28, 2021 via email

rgrinberg commented Mar 28, 2021

EduardoRFS commented Mar 29, 2021

lpw25 commented Mar 29, 2021

rgrinberg commented Mar 31, 2021

rgrinberg commented Mar 31, 2021

EduardoRFS commented Apr 1, 2021 •

edited

Loading

EduardoRFS commented Apr 10, 2021

xavierleroy commented Apr 18, 2021

xavierleroy commented Apr 18, 2021

rgrinberg commented May 1, 2021

vrotaru commented Sep 19, 2021


		## Speed up compilation when type checking is a bottle neck

		This likely doesn't occur very often, but there are cases where type checking is

cmt files as input and output #24

Are you sure you want to change the base?

cmt files as input and output #24

Conversation

rgrinberg commented Mar 28, 2021

gasche left a comment

Choose a reason for hiding this comment

dra27 left a comment

Choose a reason for hiding this comment

lpw25 commented Mar 28, 2021

lpw25 Mar 28, 2021

Choose a reason for hiding this comment

rgrinberg commented Mar 28, 2021

nojb commented Mar 28, 2021

lpw25 commented Mar 28, 2021

c-cube commented Mar 28, 2021 via email

rgrinberg commented Mar 28, 2021

EduardoRFS commented Mar 29, 2021

lpw25 commented Mar 29, 2021

rgrinberg commented Mar 31, 2021

rgrinberg commented Mar 31, 2021

EduardoRFS commented Apr 1, 2021 • edited Loading

EduardoRFS commented Apr 10, 2021

xavierleroy commented Apr 18, 2021

xavierleroy commented Apr 18, 2021

rgrinberg commented May 1, 2021

vrotaru commented Sep 19, 2021

EduardoRFS commented Apr 1, 2021 •

edited

Loading