Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netgen performance issues #47

Open
antonblanchard opened this issue Jan 14, 2022 · 15 comments
Open

netgen performance issues #47

antonblanchard opened this issue Jan 14, 2022 · 15 comments

Comments

@antonblanchard
Copy link
Contributor

I have a design where netgen is taking over 90 minutes to complete. I notice a few things:

  1. We only ignore fill/tap/fake diode cells on GDS extraction (I'm using DEF extraction). Any reason for that?
    https://github.com/RTimothyEdwards/open_pdks/blob/e4b453599ad0b112e90c89220145c8219dda3cb0/sky130/netgen/sky130_setup.tcl#L335
  2. Even with GDS extraction, we don't ignore decap cells. If I do 1 and 2, netgen drops to taking about 2 minutes. I'm not sure if ignoring them all is dangerous, if so we could have a "go fast" option to use while working on a design, and switch to checking all cells on final tape out.
  3. A lot of the extra time is in CombineParallel(), where for every insertion we walk the entire property list:
            spropfirst = sproplast = NULL;
            for (ob2 = sob; ob2->type > FIRSTPIN || ob2 == sob; ob2 = ob2->next)
               pob = ob2;
            if (ob2->type == PROPERTY) spropfirst = ob2;
            for (; ob2->type == PROPERTY; ob2 = ob2->next)            <------------ HERE
               sproplast = ob2;

I don't quite have my head around how CombineParallel() works, but couldn't we just insert at the front of the list?

@RTimothyEdwards
Copy link
Owner

@antonblanchard :

(1) The dependence of the setup file on the environment variable MAGIC_EXT_USE_GDS was put in by the Openlane developers. I don't like it and I don't think it should be there. I guess if the layout is using abstract views of the standard cells, then these cells should appear in the layout because they're black boxes like all the abstract views. If the layout is using transistor-level views, then tap and fill cells get optimized out of the layout, so they have to be ignored or they'll fail to match since they exist in one netlist but not the other. Decap cells have devices in them, so they won't get optimized out.

(2) Most things that you could do, like placing the decap cell upside down or shorting across its power and ground, are going to be detected elsewhere, so it is probably okay to ignore decap as well. But it's not the same problem as the others, because all decap cells will be properly represented in both netlists. I want my default setup file to compare everything that is possible to be compared. If somebody wants to rewrite the setup script for their own purposes, they can do that.

(3) Almost certainly it can be optimized for speed. I hadn't noticed it being a bottleneck or I would have done something about it already. Mostly I'd like to get rid of all the atrociously inefficient linked lists in netgen's database, but that's a huge code re-write. But usually there are simple enough ways to rewrite the linked list handling so that doesn't blow up on big netlists.

@d-m-bailey
Copy link
Contributor

@antonblanchard @RTimothyEdwards

I believe MAGIC_EXT_USE_GDS was an addition that I implemented for device level LVS and was merged into openlane for the reasons that Tim mentioned. Device level LVS can't compare tap and fill cells, but LEF LVS is the only way to verify that the tap and fill counts match (or maybe it doesn't do this? Tim, when merging parallel black box circuits, is the count important? It seems to me that in most cases, it should be).

What version of magic are you using? Do your fill/tap cell counts match? If the cell counts don't match, netgen may try to flatten them. Recently, there was a code improvement for flattening that reduced device level LVS time from around 90 minutes to 20 minutes for some designs. However, if you're using LEF, the decap cells shouldn't have any devices to flatten.

You mentioned the CombineParallel routine, but I was under the impression that LEF/DEF based LVS does not have any low level devices, and thus no property records. Am I missing something?

How many cells are in the design (including fill, tap, and decap)? netgen has a fixed hash size that may (or may not) be relevant.

There has been some discussion of changing MAGIC_EXT_USE_GDS to some other name since the LVS on mag views has a related problem. See RTimothyEdwards/open_pdks#206

Any suggestions?

@RTimothyEdwards
Copy link
Owner

@antonblanchard : The code can be tricky, but yes, it is possible to insert properties at the beginning of the list instead of at the end, and avoid the ever-increasing search for the end of the property list. I just pushed a correction to the code which should do this (netgen version 1.5.218, just pushed to the opencircuitdesign.com repository). Please try it out and let me know by how much it reduces the runtime for your example.

@RTimothyEdwards
Copy link
Owner

@d-m-bailey : The device count itself is a property, for the reason that you have to be able to match a transistor (W = 1, M = 2) to a transistor (W = 2, M = 1). The best way to handle that is to combine all parallel devices up front and keep a list of all unique sets of properties for those devices, including the number of devices sharing the same (other) properties. For simple subcircuits like standard cells that don't have any other properties, they just accumulate a bunch of records indicating a property "M = 1". These get combined later during device matching by adding them all up and replacing them with a single property. This may not be the most efficient way to do it, but as long as the runtime doesn't grow exponentially with the size of the circuit, I'm not going to spend the time figuring out if I can optimize it further.

@RTimothyEdwards
Copy link
Owner

@d-m-bailey : It might be helpful for the subcircuit summary to count out the total number of devices. For example, all digital synthesized circuits in sky130 will collapse to decap_X = 1 (e.g., sky130_fd_sc_hd__decap_3 (1) in the output). This output is misleading because it suggests that the subcircuit has only one decap_3 cell in it, instead of potentially thousands. It would probably be better to output something like sky130_fd_sc_hd__decap_3 (1) (M=1032) or even just sky130_fd_sc_hd__decap_3 (1032). The latter form could be misleading as well, though, because in the case I mentioned above for a transistor (W = 1, M = 2) vs. (W = 2, M = 1), the subcircuit summary would say there are two devices on the left and one on the right, but insist that this is a match. An alternative method would be where it outputs "Merged 1032 devices" to break that out into a count per device type ("Merged 1032 sky130_fd_sc_hd__decap_3 devices").

@antonblanchard
Copy link
Contributor Author

@RTimothyEdwards @d-m-bailey thanks for the extensive replies. I tried the patch, and it does make a big difference - 95 minutes down to 17 minutes, over 5x faster. Very nice.

@antonblanchard
Copy link
Contributor Author

I found another O(n^2) issue, this time in PropertyOptimize():

    if (comb == FALSE) {
       for (i = 0; i < run - 1; i++) {
         for (j = i + 1; j < run; j++) {

Could we exit out once we've run out of things to merge? Something like:

diff --git a/base/netcmp.c b/base/netcmp.c
index 9b7456d..76d7dcd 100644
--- a/base/netcmp.c
+++ b/base/netcmp.c
@@ -4860,6 +4860,7 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
    // Now combine records with same properties by summing M (S).
    if (comb == FALSE) {
       for (i = 0; i < run - 1; i++) {
+        int nr_empty = 0;
         for (j = i + 1; j < run; j++) {
            pmatch = 0;
            for (p = 1; p < pcount; p++) {
@@ -4982,9 +4983,14 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
               else if (vlist[0][j]->value.ival > 0) {
                  vlist[0][i]->value.ival += vlist[0][j]->value.ival;
                  vlist[0][j]->value.ival = 0;
+              } else {
+                 nr_empty++;
               }
            }
         }
+        // Were all potential matches empty?
+        if (nr_empty == (run-(i+1)))
+           break;
       }
    }

I'm not sure if that is safe to do, but my test case continues to work and it takes the rundown down from 17 minutes to just under 3 minutes. I can imagine there is a worst case where we have to iterate the list n^2 times, but it would seem a common case where we'll only take a few iterations (or even 1).

@antonblanchard
Copy link
Contributor Author

What version of magic are you using? Do your fill/tap cell counts match? If the cell counts don't match, netgen may try to flatten them. Recently, there was a code improvement for flattening that reduced device level LVS time from around 90 minutes to 20 minutes for some designs. However, if you're using LEF, the decap cells shouldn't have any devices to flatten.

I'm using the most recent openlane images (eg 2022.01.13_01.51.43), so both magic and netgen are a month or two old. The fill and tap counts do look to match.

How many cells are in the design (including fill, tap, and decap)? netgen has a fixed hash size that may (or may not) be relevant.

There's 722953 devices, but it's overwhelmingly decap, tap and fill cells. After the optimisation phase I have 69363 devices.

There has been some discussion of changing MAGIC_EXT_USE_GDS to some other name since the LVS on mag views has a related problem. See RTimothyEdwards/open_pdks#206

Thanks. I didn't full appreciate the differences between transistor level extraction, and black box device level extraction, but this discussion has helped.

@antonblanchard
Copy link
Contributor Author

(2) Most things that you could do, like placing the decap cell upside down or shorting across its power and ground, are going to be detected elsewhere, so it is probably okay to ignore decap as well. But it's not the same problem as the others, because all decap cells will be properly represented in both netlists. I want my default setup file to compare everything that is possible to be compared. If somebody wants to rewrite the setup script for their own purposes, they can do that.

That makes sense. I made the suggestion before understanding why GDS based extraction has to ignore some cells, and fixing the performance issues where we find them is the best way forward, not removing checking.

@d-m-bailey
Copy link
Contributor

d-m-bailey commented Jan 16, 2022

It would probably be better to output something like sky130_fd_sc_hd__decap_3 (1) (M=1032) or even just sky130_fd_sc_hd__decap_3 (1032).

@RTimothyEdwards How about sky130_fd_sc_hd__decap_3 (1032->1)?

@RTimothyEdwards
Copy link
Owner

@d-m-bailey : Sounds reasonable. I went ahead and implemented it as you suggest and pushed to the repository.

@RTimothyEdwards
Copy link
Owner

@antonblanchard : Re: PropertyOptimize(): I'll take a look at that.

@RTimothyEdwards
Copy link
Owner

@antonblanchard : I can't find anything wrong with your code change, so I'm going ahead and implementing it and pushing to the repository. From 90 minutes down to 3 minutes run time is good progress for a couple of days' work!

RTimothyEdwards added a commit that referenced this issue Jan 17, 2022
summary, so that the summary lists the total number of devices as well
as the number of devices after parallel optimization, in the form
"device_name (M->N)", where "M" is the total number of devices, and
"N" is the number of devices after parallel combination.  This makes
the output somewhat more meaningful to the end user.  Implementation
as discussed in github issue #47.
RTimothyEdwards added a commit that referenced this issue Jan 17, 2022
by Anton Blanchard, which prevents the double-loop in the
PropertyOptimize() routine from continuing the outer loop if
all devices in the run have already been merged.
@antonblanchard
Copy link
Contributor Author

Thanks @RTimothyEdwards!

@antonblanchard
Copy link
Contributor Author

Considering the fixes for the identified issues are upstream, I'll close this. If I do uncover a new performance opportunity, I'll open a new bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants