-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindustrial-use.html
570 lines (449 loc) · 20.8 KB
/
industrial-use.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
<!DOCTYPE html>
<HTML>
<!-- Mirrored from lamport.azurewebsites.net/tla/industrial-use.html by HTTrack Website Copier/3.x [XR&CO'2014], Thu, 26 Mar 2020 22:31:24 GMT -->
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="GENERATOR" CONTENT="Mozilla/4.05 [en] (X11; I; OSF1 V4.0 alpha)
[Netscape]">
<!--
%&b&<b>#</b>&
%&c& <b>#</b> &
-->
<!-- The following loads the style sheet for the html files of
the tla web site -->
<link rel="stylesheet" type="text/css" href="tlaweb.css">
<script src="tlaweb.js"> </script>
<!-- The following causes the name of this page in the left-hand column
not to have a link -->
<SCRIPT>
noLinkName = "Industrial Use" ;
</SCRIPT>
<title>Industrial Use of TLA+</title>
</HEAD>
<BODY onload="initialize()">
<table id="main">
<tr>
<td id="main_leftcolumn" >
</td>
<td id="main_contentcolumn">
<table>
<tr >
<td style="vertical-alight:top">
<div id = "showleftcol" > </div>
<H1>Industrial Use of TLA+</H1>
<p style="margin-top:-8px; margin-bottom:-18px">
Leslie Lamport<p>
<font size=-1><I> Last modified on 4 December 2018</I></font>
</td>
<!-- td style="vertical-alight:top;width:auto" -->
<!-- img src="tla-logo.png" style="width:170px;margin-top:14px"> </img -->
<!-- /td -->
</tr>
</table>
<hr style="margin-bottom:-5px;margin-top:-11px">
<P style="margin-top:0px"> </P>
<DIV class="hidden-div" style="color:red;margin-bottom:-22px"><b>
You'll miss a lot on this web site unless you enable Javascript
in your browser. </b></DIV>
<H2 id ="h2intro"
class="show-hide" onclick="showHide('hide-intro','intro')">Introduction
<font
id="hide-intro" >
[show]</font>
</H2>
<DIV id="intro" class="hidden-div">
It's hard to find out about the use of TLA+ in industry.
Companies usually don't talk about their engineering process, and any
specifications they write are almost always proprietary. I
happened to have contacts at Intel and Amazon who were willing to
describe how they used TLA+, and Amazon even published a couple of
papers about it. And of course, I know about much of the TLA+ use
at Microsoft. I learned about its use in building OpenComRTOS
after its developers published a book about it. The
<i>Other Use</i>
section below describes a few smaller examples of TLA+ use that I've learned
about.
</DIV>
<H2 id="h2intel" class="show-hide"
onclick="showHide('hide-intel','intel')">Intel
<font
id="hide-intel" >
[show]</font>
</H2>
<DIV id="intel" class="hidden-div">
TLA+ was first used in industry to model hardware. Hardware engineers
are highly motivated to eliminate bugs, and they understand that this
requires thinking hard about what they're building before they begin
implementing it in silicon.
<p>
The early use of TLA+ at Intel is described
in
<a href=
"intel-excerpt0b02.html?back-link=industrial-use.html?unhideBut@EQhide-intel@AMPunhideDiv@EQintel">this section</a>
of a paper, written in 2002. It provides an excellent account of
how TLA+ fit into their design process.
<p>
A picture of TLA+ use in 2008 is provided by the paper:
<blockquote>
<b>Pre-RTL formal Verification:
An Intel Experience</b><br>
Robert Beers <br>
<em>DAC '08: Proceedings of the 45th annual Design Automation Conference,</em><br>
Pages 806-822<br>
<a href="http://dl.acm.org/citation.cfm?id=1391675">available on-line</a>
</blockquote>
It explains how a formal verification (FV) team used TLA+ in a project
to design and implement a new cache-coherence protocol for a processor
chip. The team validated the protocol and parts of the register
transfer language (RTL) implementation. Here is an excerpt from that
paper that describes the results of the pre-RTL FV effort.
<blockquote>
The FV team filed 45 <q>issues</q> that the project used before RTL to
track engineering problems requiring special attention to solve.
A design lead for one of the verified microarchitectures, when asked
for about the FV work, said, <q>They found hundreds of bugs before RTL
and played an important role in creating a stable
microarchitecture.</q> This microarchitecture had the lowest
ratio of bugs per line of RTL even though the unit was new. On
RTL and silicon there have been no coherence protocol or architecture
feature bugs. In the areas covered by pre-RTL FV, only one
microarchitecture bug has been found on silicon, attributed to
accepting an unfounded claim that the affected logic would never
interact with particular microarchitecture protocols.
</blockquote>
TLA+ was still being used at Intel in 2014, but I have no
information about what has happened there since then.
</DIV>
<H2 id="h2amazon" class="show-hide" onclick="showHide('hide-amazon','amazon')">Amazon
<font
id="hide-amazon" >
[show]</font>
</H2>
<DIV id="amazon" class="hidden-div">
Amazon Web Services (AWS) has been using TLA+ since 2011. In April,
2015, six engineers published an article describing how it was introduced
to AWS and how it was being used there. Here are the <q>key insights</q>
listed in the article.
<ul>
<li> Formal methods find bugs in system
designs that cannot be found through
any other technique we know of. </li>
<li style="margin-top:7px"> Formal methods are surprisingly feasible
for mainstream software development
and give good return on investment.</li>
<li style="margin-top:7px"> At Amazon, formal methods are routinely
applied to the design of complex real-world software, including public
cloud services.</li> </ul>
This article provides an excellent picture of TLA+ use in an
industrial environment, describing its benefits and what it can't do.
A highly abridged version of the article, a pointer to the complete
article, and a pointer to another article about TLA+ use at Amazon can
be found
<a href=
"amazon-excerpt4033.html?back-link=industrial-use.html?unhideBut@EQhide-amazon@AMPunhideDiv@EQamazon">here</a>.
<!-- a href="amazon-excerpt.html" target="_blank">here</a -->
James Hamilton has a
<a href="https://perspectives.mvdirona.com/2014/07/challenges-in-designing-at-scale-formal-methods-in-building-robust-distributed-systems/">blog
post</a> and Tim Rath gave a
<a href="https://www.infoq.com/presentations/aws-testing-tla">lecture</a>
about the use of TLA+ at AWS.
<p>
Here is what Chris Newcombe, the lead author of the articles, has
written
<a href="https://groups.google.com/forum/#!searchin/tlaplus/professional$20career/tlaplus/ZJCi-UF31fc/Mawvwi6U1CYJ">elsewhere</a>:
<blockquote>
TLA+ is the most valuable thing that I've learned in my professional
career. It has changed how I work, by giving me an immensely powerful
tool to find subtle flaws in system designs. It has changed how I
think, by giving me a framework for constructing new kinds of
mental-models, by revealing the precise relationship between
correctness properties and system designs, and by allowing me to move
from `plausible prose' to precise statements much earlier in the
software development process.
</blockquote>
</DIV>
<h2 id="h2microsoft" class="show-hide" onclick="showHide('hide-microsoft','microsoft')">
Microsoft
<font
id="hide-microsoft" >
[show]</font>
</H2>
<DIV id="microsoft" class="hidden-div">
TLA+ was used sporadically at Microsoft beginning around 2004.
Here is one of those uses as reported by the late
<a href="https://en.wikipedia.org/wiki/Charles_P._Thacker" target="_blank">Chuck
Thacker</a>.
<blockquote>
During the development of the Xbox 360, I was working
with the engineering group on the memory coherence protocol.
Working with an intern, we were developing a TLA+ model for the
protocol. In the course of this, we discovered a very subtle
bug. We reported it to IBM, and they told us it couldn't
happen. A couple of weeks later, they relented and told us that
not only was the bug real, but that their regression tests wouldn't
have found it. Had they not fixed it, they would have shipped us
chips that would have deadlocked after about four hours of use. Had
this [occurred], the schedule for a Christmas launch would almost
certainly have been missed.
</blockquote>
Starting around 2015, use at Microsoft increased—especially in
Azure, Microsoft's cloud service.
Here are some abridged and lightly edited descriptions of TLA+ use
written by members of the Azure team. Some of the
TLA+ specifications mentioned were actually written in PlusCal.
<p>
Albert Greenberg, Corporate
Vice President of Azure Networking, writes:
<blockquote>
<!-- In Azure Networking, our mission is to
build the world's most reliable cloud. Our components are at the
bottom of the cloud software stack, the infrastructure of the
infrastructure. In everything we build, we think about the
fundamentals (security, availability, supportability, and performance)
to scale the network, it's features and services.-->
Our engineering process moves validation as early as possible in the
engineering lifecycle — we strive to detect errors sooner to reduce
risk, increase agility, and provide fast feedback to developers.
<!-- That is, we strive to reduce the distance to flaws. -->
Among the key tools we use is <!--are TLA and--> the TLA+
specification language — enabling us to describe our components
formally with simple mathematics, with as little overhead as possible
beyond what is needed to write the mathematics precisely.
<p>
We have found TLA+ especially well-suited for writing high-level
specifications of concurrent and distributed systems: concretely
defining the problem, providing a specification close to the desired
implementation, and proving that successive refinements implement the
specification — or else finding the flaws and correcting the design.
Among the projects in Azure Networking where we have applied TLA+ are
Azure DNS (record propagation), RingMaster (distributed global
replication and checkpoint coordination), Distributed Load Shedding,
and Macsec (encryption key rollover orchestration). For example, in
RingMaster, <!-- Global Replication, --> we identified a bug that
showed up in intermittent failures of unit tests but was
devilishly hard to localize in the code. We <!-- implemented --> wrote
a specific TLA+ specification of the desired behavior, quickly found
the bug in the logical design, and fixed the code.
<!-- again simply. -->
</blockquote>
Dharma Shukla, Microsoft Technical Fellow, writes this about
TLA+ use in Azure Cosmos DB, Microsoft's globally distributed
database service:
<blockquote>
<!-- Azure Cosmos DB is Microsoft's globally distributed database service.
Cosmos DB offers elastic scalability of both writes and reads all
around the world with guaranteed single digit millisecond latency at
the 99th percentile and 99.999% high availability. The service
transparently replicates the data and provides a single system image
of globally distributed Cosmos database with a choice of five
well-defined consistency models (precisely specified using TLA+),
while users write and read to local replicas anywhere in the world.
As a globally distributed database service, Cosmos DB provides SLAs
for consistency, latency, throughput and availability to its users.
-->
The Cosmos DB engineering team have been using TLA+ for specifying and
validating the correctness of the core algorithms as well as the
high-level specifications of the five consistency models that the
system offers to our customers. We have found TLA+ extremely useful in
two fundamental ways:
<ol>
<li> <!-- TLA+ is an invaluable tool -->
For designing distributed algorithms.
For example,
<!-- last year, -->
while designing the algorithm for the multi-location
write capability, we found a critical correctness bug that violated an
important safety property of the system. The bug was found while
writing the TLA+ spec, and we were able to correct the issue
<!-- - even -->
before writing a single line of code.
</li>
<li>
Precisely specifying the guarantees provided by an algorithm or
system.
<!-- We have found TLA+ beneficial not only just for specifying
the high-level design of the system but also for specifying the
guarantees the system provides to our users.
-->
For example, Cosmos DB
has used TLA+ <!-- as a language -->
to specify the guarantees provided by the five consistency models the service
offers to its users. These specifications are made available to our
users. <!-- here-->
</li>
</ol>
</blockquote>
Cheng Huang, Principle Software Engineer Manager, writes:
<blockquote>
TLA+ uncovered a safety violation even in our most confident
implementation. We had a lock-free data structure implementation
which was carefully design & implemented, went through
thorough code review, and was tested under stress for many days. As a
result, we had high confidence about the implementation. We
eventually decided to write a TLA+ spec, not to verify correctness,
but to allow team members to learn and practice PlusCal. So, when the
model checker reported a safety violation, it really caught us by
surprise. This experience has become the aha moment for many team
members and their de facto testimonial about TLA+.
<p>
Our system manages many Paxos rings and dynamically reconfigures these
rings on the fly. TLA+ helped us realize that getting this correct is
way more subtle than it appears. We were very glad to have uncovered
a safety violation which would have caused data loss <!-- for cloud
storage business --> (for customers) upon a specific ordered sequence
of events during the recovery from <!-- DC-wide--> a power failure.
We couldn't imagine that this would have been discovered through any of our
testing efforts.
<p>
TLA+ is not yet a prerequisite for our hiring. However, a candidate's
knowledge of TLA+ is given significant weight in our evaluation. To
us, it is a great indicator of those who really care about quality and
correctness. <!-- deep in their heart. -->
</blockquote>
</DIV>
<h2 id="h2rtos" class="show-hide"
onclick="showHide('hide-rtos','rtos')"><a name="OpenComRTOS">
OpenComRTOS</a>
<font
id="hide-rtos" >
[show]</font>
</h2>
<DIV id="rtos" class="hidden-div">
The European Space Agency's <em>Rosetta</em>
<img src="verhulst-book.png" style="width:100px;float:right;margin-left:10px">
spacecraft, which flew to a comet, used a real-time operating system
called <em>Virtuoso</em> to control some of its instruments.
The next version of that operating system, called
<em>OpenComRTOS</em>, was developed using TLA+ as described in this book:
<blockquote>
<b><a href="https://www.springer.com/gp/book/9781441997357">Formal
Development of a Network-Centric RTOS</a></b><br>
Eric Verhulst,
Raymond T. Boute,
José Miguel Sampaio Faria,<br>
Bernard H. C. Sputh,
and Vitaliy Mezhuyev<br>
<i>Springer, 2011</i>
</blockquote>
Eric Verhulst, the head of the group that developed OpenComRTOS, wrote
this
in an email to me:
<blockquote>
The [TLA+] abstraction helped a lot in
coming to a much cleaner architecture (we witnessed first hand the
brain washing done by years of C programming). One of the results was
that the code size is about 10x less than [in Virtuoso].
</blockquote>
</DIV>
<h2 id="h2other" class="show-hide"
onclick="showHide('hide-other','other')"><a name="otherUse">Other Use</a>
<font
id="hide-other" >
[show]</font>
</H2>
<DIV id="other" class="hidden-div">
I have seen a number of reports of TLA+ (and PlusCal) being used
in real life. Some have been sent to me; some I found by simply
searching the web for <q>TLA+</q>. Here are a few of them.
<p>
A web posting titled
<a href="https://rix0r.nl/essays/2015/08/25/model-checking-for-the-working-man-mf/">Model
Checking for the Working Man (m/f)</a>
by a programmer identified only as <i>Rico</i> describes how he was led
to try TLA+ for the first time, and his experience using it.
Here is a brief extract.
<blockquote>
What I was expecting was to wrestle with this for a good
couple of days, but actually this tool chain is so easy to get started
with, I got our algorith[m] modeled, the problem detected and the
solution verified, all within half a day!
<p>
Let me say that again:
<blockquote>
TLA+ is fantastically easy to use!
</blockquote>
... This tool is so easy to use and so effective, you cant
afford to not use it in your development cycle somewhere.
</blockquote>
By <q>TLA+</q> he meant PlusCal. It's not as easy to solve a real
problem with TLA+ when you start by knowing nothing about it, and he
found TLA+ to be <q>cumbersome</q>. But it's worth learning TLA+ if
you're building a complex distributed system.
<p>
<a href="https://news.ycombinator.com/">Hacker News</a> has a <a
href="https://news.ycombinator.com/item?id=9601770">web page about
TLA+</a>. Here is part of a posting on it from 8 July 2015:
<blockquote>
One of my coworkers fleshed out an important and complex component
that absolutely <i>had</i> to be logically correct, by using TLA+ modelling.
It was the first time he'd used TLA+ ... at all, and it took a
bit of experimenting to get used to it, but he soon generated a solid
logical model and exposed a number of fringe flaws in the original
proposed solution.
<p>
From TLA+ model to completed Java application took very little time at
all, the logic was all there already, it just needed [to be] fleshed out in
the language. It was also easy to split the work out amongst multiple
programmers. He's argued that the total time, including learning
TLA+, took less time than writing the application from scratch in Java
and discovering the bugs as they went along.
</blockquote>
In November, 2016, Bogdan Munteanu sent me email describing TLA+
use at Dropbox. Here is an extract from that email.
<blockquote>
[An] engineer decided to learn/experiment with [TLA+], so I provided
some guidance and feedback. He wrote the formal spec for one of our
two-phase commit protocols that we knew would fail in certain
real-world situations. His spec found the issue ... and the engineer
found the process extremely valuable. He mentioned that learning TLA+
was not at all a steep curve — it took him longer to understand the
actual protocol.
<p>
About a month ago, another engineer designed a deadlock detection
algorithm for one of the new distributed protocols at Dropbox. After
hearing about the success of the previous TLA+ project, he decided to
write a spec for this new protocol with the help of the other
engineer. They found a bug in the protocol, fixed it, then confirmed
the fix by re-running the model checker. In both cases the spec was
written in PlusCal. Everyone was really impressed by the results, and
for most came as a surprise because few have heard of formal
verification before.
</blockquote>
In <a href="https://groups.google.com/forum/#!topic/tlaplus/unVsjY7Aefc">a
TLA+ Google group posting</a>, Hillel Wayne reports this story
of
<a href="https://github.com/elastic/elasticsearch/issues/31976#ISSUECOMMENT-404722753">a bug in Elastic Search (ES)</a>:
<blockquote>
<ol>
<li>ES has been running in production for a long time.</li>
<li>In March, David Turner wrote a TLA+ model and discovered
a bug with ES 6.2 that nobody's yet run into.</li>
<li>In April, they preemptively fix the bug and update the
version to 6.3.</li>
<li>Two days ago somebody on 6.2.4 reports a bizarre concurrency bug.</li>
<li>After investigation, the source of the bug exactly
matches the discovered TLA+ edge case.</li>
</ol>
</blockquote>
</DIV>
</td>
</tr>
<!-- Bottom Back button -->
<tr>
<td>
<a class="back-link" style="display:none" href="#">
<p style="margin-top:-50px"><b>Back</b>
</p>
</a>
</td>
</tr>
</table>
</BODY>
<!-- Mirrored from lamport.azurewebsites.net/tla/industrial-use.html by HTTrack Website Copier/3.x [XR&CO'2014], Thu, 26 Mar 2020 22:31:29 GMT -->
</HTML>