netgen performance issues #47

antonblanchard · 2022-01-14T23:17:39Z

I have a design where netgen is taking over 90 minutes to complete. I notice a few things:

We only ignore fill/tap/fake diode cells on GDS extraction (I'm using DEF extraction). Any reason for that?
https://github.com/RTimothyEdwards/open_pdks/blob/e4b453599ad0b112e90c89220145c8219dda3cb0/sky130/netgen/sky130_setup.tcl#L335
Even with GDS extraction, we don't ignore decap cells. If I do 1 and 2, netgen drops to taking about 2 minutes. I'm not sure if ignoring them all is dangerous, if so we could have a "go fast" option to use while working on a design, and switch to checking all cells on final tape out.
A lot of the extra time is in CombineParallel(), where for every insertion we walk the entire property list:

            spropfirst = sproplast = NULL;
            for (ob2 = sob; ob2->type > FIRSTPIN || ob2 == sob; ob2 = ob2->next)
               pob = ob2;
            if (ob2->type == PROPERTY) spropfirst = ob2;
            for (; ob2->type == PROPERTY; ob2 = ob2->next)            <------------ HERE
               sproplast = ob2;

I don't quite have my head around how CombineParallel() works, but couldn't we just insert at the front of the list?

The text was updated successfully, but these errors were encountered:

RTimothyEdwards · 2022-01-15T02:48:12Z

@antonblanchard :

(1) The dependence of the setup file on the environment variable MAGIC_EXT_USE_GDS was put in by the Openlane developers. I don't like it and I don't think it should be there. I guess if the layout is using abstract views of the standard cells, then these cells should appear in the layout because they're black boxes like all the abstract views. If the layout is using transistor-level views, then tap and fill cells get optimized out of the layout, so they have to be ignored or they'll fail to match since they exist in one netlist but not the other. Decap cells have devices in them, so they won't get optimized out.

(2) Most things that you could do, like placing the decap cell upside down or shorting across its power and ground, are going to be detected elsewhere, so it is probably okay to ignore decap as well. But it's not the same problem as the others, because all decap cells will be properly represented in both netlists. I want my default setup file to compare everything that is possible to be compared. If somebody wants to rewrite the setup script for their own purposes, they can do that.

(3) Almost certainly it can be optimized for speed. I hadn't noticed it being a bottleneck or I would have done something about it already. Mostly I'd like to get rid of all the atrociously inefficient linked lists in netgen's database, but that's a huge code re-write. But usually there are simple enough ways to rewrite the linked list handling so that doesn't blow up on big netlists.

d-m-bailey · 2022-01-15T03:42:51Z

@antonblanchard @RTimothyEdwards

I believe MAGIC_EXT_USE_GDS was an addition that I implemented for device level LVS and was merged into openlane for the reasons that Tim mentioned. Device level LVS can't compare tap and fill cells, but LEF LVS is the only way to verify that the tap and fill counts match (or maybe it doesn't do this? Tim, when merging parallel black box circuits, is the count important? It seems to me that in most cases, it should be).

What version of magic are you using? Do your fill/tap cell counts match? If the cell counts don't match, netgen may try to flatten them. Recently, there was a code improvement for flattening that reduced device level LVS time from around 90 minutes to 20 minutes for some designs. However, if you're using LEF, the decap cells shouldn't have any devices to flatten.

You mentioned the CombineParallel routine, but I was under the impression that LEF/DEF based LVS does not have any low level devices, and thus no property records. Am I missing something?

How many cells are in the design (including fill, tap, and decap)? netgen has a fixed hash size that may (or may not) be relevant.

There has been some discussion of changing MAGIC_EXT_USE_GDS to some other name since the LVS on mag views has a related problem. See RTimothyEdwards/open_pdks#206

Any suggestions?

RTimothyEdwards · 2022-01-15T17:10:57Z

@antonblanchard : The code can be tricky, but yes, it is possible to insert properties at the beginning of the list instead of at the end, and avoid the ever-increasing search for the end of the property list. I just pushed a correction to the code which should do this (netgen version 1.5.218, just pushed to the opencircuitdesign.com repository). Please try it out and let me know by how much it reduces the runtime for your example.

RTimothyEdwards · 2022-01-15T17:23:19Z

@d-m-bailey : The device count itself is a property, for the reason that you have to be able to match a transistor (W = 1, M = 2) to a transistor (W = 2, M = 1). The best way to handle that is to combine all parallel devices up front and keep a list of all unique sets of properties for those devices, including the number of devices sharing the same (other) properties. For simple subcircuits like standard cells that don't have any other properties, they just accumulate a bunch of records indicating a property "M = 1". These get combined later during device matching by adding them all up and replacing them with a single property. This may not be the most efficient way to do it, but as long as the runtime doesn't grow exponentially with the size of the circuit, I'm not going to spend the time figuring out if I can optimize it further.

RTimothyEdwards · 2022-01-15T17:36:31Z

@d-m-bailey : It might be helpful for the subcircuit summary to count out the total number of devices. For example, all digital synthesized circuits in sky130 will collapse to decap_X = 1 (e.g., sky130_fd_sc_hd__decap_3 (1) in the output). This output is misleading because it suggests that the subcircuit has only one decap_3 cell in it, instead of potentially thousands. It would probably be better to output something like sky130_fd_sc_hd__decap_3 (1) (M=1032) or even just sky130_fd_sc_hd__decap_3 (1032). The latter form could be misleading as well, though, because in the case I mentioned above for a transistor (W = 1, M = 2) vs. (W = 2, M = 1), the subcircuit summary would say there are two devices on the left and one on the right, but insist that this is a match. An alternative method would be where it outputs "Merged 1032 devices" to break that out into a count per device type ("Merged 1032 sky130_fd_sc_hd__decap_3 devices").

antonblanchard · 2022-01-15T22:52:22Z

@RTimothyEdwards @d-m-bailey thanks for the extensive replies. I tried the patch, and it does make a big difference - 95 minutes down to 17 minutes, over 5x faster. Very nice.

antonblanchard · 2022-01-16T11:06:47Z

I found another O(n^2) issue, this time in PropertyOptimize():

    if (comb == FALSE) {
       for (i = 0; i < run - 1; i++) {
         for (j = i + 1; j < run; j++) {

Could we exit out once we've run out of things to merge? Something like:

diff --git a/base/netcmp.c b/base/netcmp.c
index 9b7456d..76d7dcd 100644
--- a/base/netcmp.c
+++ b/base/netcmp.c
@@ -4860,6 +4860,7 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
    // Now combine records with same properties by summing M (S).
    if (comb == FALSE) {
       for (i = 0; i < run - 1; i++) {
+        int nr_empty = 0;
         for (j = i + 1; j < run; j++) {
            pmatch = 0;
            for (p = 1; p < pcount; p++) {
@@ -4982,9 +4983,14 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
               else if (vlist[0][j]->value.ival > 0) {
                  vlist[0][i]->value.ival += vlist[0][j]->value.ival;
                  vlist[0][j]->value.ival = 0;
+              } else {
+                 nr_empty++;
               }
            }
         }
+        // Were all potential matches empty?
+        if (nr_empty == (run-(i+1)))
+           break;
       }
    }

I'm not sure if that is safe to do, but my test case continues to work and it takes the rundown down from 17 minutes to just under 3 minutes. I can imagine there is a worst case where we have to iterate the list n^2 times, but it would seem a common case where we'll only take a few iterations (or even 1).

antonblanchard · 2022-01-16T11:31:02Z

What version of magic are you using? Do your fill/tap cell counts match? If the cell counts don't match, netgen may try to flatten them. Recently, there was a code improvement for flattening that reduced device level LVS time from around 90 minutes to 20 minutes for some designs. However, if you're using LEF, the decap cells shouldn't have any devices to flatten.

I'm using the most recent openlane images (eg 2022.01.13_01.51.43), so both magic and netgen are a month or two old. The fill and tap counts do look to match.

How many cells are in the design (including fill, tap, and decap)? netgen has a fixed hash size that may (or may not) be relevant.

There's 722953 devices, but it's overwhelmingly decap, tap and fill cells. After the optimisation phase I have 69363 devices.

There has been some discussion of changing MAGIC_EXT_USE_GDS to some other name since the LVS on mag views has a related problem. See RTimothyEdwards/open_pdks#206

Thanks. I didn't full appreciate the differences between transistor level extraction, and black box device level extraction, but this discussion has helped.

antonblanchard · 2022-01-16T11:35:47Z

(2) Most things that you could do, like placing the decap cell upside down or shorting across its power and ground, are going to be detected elsewhere, so it is probably okay to ignore decap as well. But it's not the same problem as the others, because all decap cells will be properly represented in both netlists. I want my default setup file to compare everything that is possible to be compared. If somebody wants to rewrite the setup script for their own purposes, they can do that.

That makes sense. I made the suggestion before understanding why GDS based extraction has to ignore some cells, and fixing the performance issues where we find them is the best way forward, not removing checking.

d-m-bailey · 2022-01-16T15:39:03Z

It would probably be better to output something like sky130_fd_sc_hd__decap_3 (1) (M=1032) or even just sky130_fd_sc_hd__decap_3 (1032).

@RTimothyEdwards How about sky130_fd_sc_hd__decap_3 (1032->1)?

RTimothyEdwards · 2022-01-16T19:20:07Z

@d-m-bailey : Sounds reasonable. I went ahead and implemented it as you suggest and pushed to the repository.

RTimothyEdwards · 2022-01-16T19:20:34Z

@antonblanchard : Re: PropertyOptimize(): I'll take a look at that.

RTimothyEdwards · 2022-01-16T19:47:08Z

@antonblanchard : I can't find anything wrong with your code change, so I'm going ahead and implementing it and pushing to the repository. From 90 minutes down to 3 minutes run time is good progress for a couple of days' work!

summary, so that the summary lists the total number of devices as well as the number of devices after parallel optimization, in the form "device_name (M->N)", where "M" is the total number of devices, and "N" is the number of devices after parallel combination. This makes the output somewhat more meaningful to the end user. Implementation as discussed in github issue #47.

by Anton Blanchard, which prevents the double-loop in the PropertyOptimize() routine from continuing the outer loop if all devices in the run have already been merged.

antonblanchard · 2022-01-17T10:59:55Z

Thanks @RTimothyEdwards!

antonblanchard · 2022-02-12T21:49:59Z

Considering the fixes for the identified issues are upstream, I'll close this. If I do uncover a new performance opportunity, I'll open a new bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netgen performance issues #47

netgen performance issues #47

antonblanchard commented Jan 14, 2022

RTimothyEdwards commented Jan 15, 2022

d-m-bailey commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

antonblanchard commented Jan 15, 2022

antonblanchard commented Jan 16, 2022

antonblanchard commented Jan 16, 2022

antonblanchard commented Jan 16, 2022

d-m-bailey commented Jan 16, 2022 •

edited

Loading

RTimothyEdwards commented Jan 16, 2022

RTimothyEdwards commented Jan 16, 2022

RTimothyEdwards commented Jan 16, 2022

antonblanchard commented Jan 17, 2022

antonblanchard commented Feb 12, 2022

netgen performance issues #47

netgen performance issues #47

Comments

antonblanchard commented Jan 14, 2022

RTimothyEdwards commented Jan 15, 2022

d-m-bailey commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

RTimothyEdwards commented Jan 15, 2022

antonblanchard commented Jan 15, 2022

antonblanchard commented Jan 16, 2022

antonblanchard commented Jan 16, 2022

antonblanchard commented Jan 16, 2022

d-m-bailey commented Jan 16, 2022 • edited Loading

RTimothyEdwards commented Jan 16, 2022

RTimothyEdwards commented Jan 16, 2022

RTimothyEdwards commented Jan 16, 2022

antonblanchard commented Jan 17, 2022

antonblanchard commented Feb 12, 2022

d-m-bailey commented Jan 16, 2022 •

edited

Loading