-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netgen performance issues #47
Comments
(1) The dependence of the setup file on the environment variable MAGIC_EXT_USE_GDS was put in by the Openlane developers. I don't like it and I don't think it should be there. I guess if the layout is using abstract views of the standard cells, then these cells should appear in the layout because they're black boxes like all the abstract views. If the layout is using transistor-level views, then tap and fill cells get optimized out of the layout, so they have to be ignored or they'll fail to match since they exist in one netlist but not the other. Decap cells have devices in them, so they won't get optimized out. (2) Most things that you could do, like placing the decap cell upside down or shorting across its power and ground, are going to be detected elsewhere, so it is probably okay to ignore decap as well. But it's not the same problem as the others, because all decap cells will be properly represented in both netlists. I want my default setup file to compare everything that is possible to be compared. If somebody wants to rewrite the setup script for their own purposes, they can do that. (3) Almost certainly it can be optimized for speed. I hadn't noticed it being a bottleneck or I would have done something about it already. Mostly I'd like to get rid of all the atrociously inefficient linked lists in netgen's database, but that's a huge code re-write. But usually there are simple enough ways to rewrite the linked list handling so that doesn't blow up on big netlists. |
@antonblanchard @RTimothyEdwards I believe What version of magic are you using? Do your fill/tap cell counts match? If the cell counts don't match, netgen may try to flatten them. Recently, there was a code improvement for flattening that reduced device level LVS time from around 90 minutes to 20 minutes for some designs. However, if you're using LEF, the decap cells shouldn't have any devices to flatten. You mentioned the CombineParallel routine, but I was under the impression that LEF/DEF based LVS does not have any low level devices, and thus no property records. Am I missing something? How many cells are in the design (including fill, tap, and decap)? netgen has a fixed hash size that may (or may not) be relevant. There has been some discussion of changing Any suggestions? |
@antonblanchard : The code can be tricky, but yes, it is possible to insert properties at the beginning of the list instead of at the end, and avoid the ever-increasing search for the end of the property list. I just pushed a correction to the code which should do this (netgen version 1.5.218, just pushed to the opencircuitdesign.com repository). Please try it out and let me know by how much it reduces the runtime for your example. |
@d-m-bailey : The device count itself is a property, for the reason that you have to be able to match a transistor (W = 1, M = 2) to a transistor (W = 2, M = 1). The best way to handle that is to combine all parallel devices up front and keep a list of all unique sets of properties for those devices, including the number of devices sharing the same (other) properties. For simple subcircuits like standard cells that don't have any other properties, they just accumulate a bunch of records indicating a property "M = 1". These get combined later during device matching by adding them all up and replacing them with a single property. This may not be the most efficient way to do it, but as long as the runtime doesn't grow exponentially with the size of the circuit, I'm not going to spend the time figuring out if I can optimize it further. |
@d-m-bailey : It might be helpful for the subcircuit summary to count out the total number of devices. For example, all digital synthesized circuits in sky130 will collapse to decap_X = 1 (e.g., |
@RTimothyEdwards @d-m-bailey thanks for the extensive replies. I tried the patch, and it does make a big difference - 95 minutes down to 17 minutes, over 5x faster. Very nice. |
I found another O(n^2) issue, this time in
Could we exit out once we've run out of things to merge? Something like: diff --git a/base/netcmp.c b/base/netcmp.c
index 9b7456d..76d7dcd 100644
--- a/base/netcmp.c
+++ b/base/netcmp.c
@@ -4860,6 +4860,7 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
// Now combine records with same properties by summing M (S).
if (comb == FALSE) {
for (i = 0; i < run - 1; i++) {
+ int nr_empty = 0;
for (j = i + 1; j < run; j++) {
pmatch = 0;
for (p = 1; p < pcount; p++) {
@@ -4982,9 +4983,14 @@ int PropertyOptimize(struct objlist *ob, struct nlist *tp, int run, int series,
else if (vlist[0][j]->value.ival > 0) {
vlist[0][i]->value.ival += vlist[0][j]->value.ival;
vlist[0][j]->value.ival = 0;
+ } else {
+ nr_empty++;
}
}
}
+ // Were all potential matches empty?
+ if (nr_empty == (run-(i+1)))
+ break;
}
} I'm not sure if that is safe to do, but my test case continues to work and it takes the rundown down from 17 minutes to just under 3 minutes. I can imagine there is a worst case where we have to iterate the list n^2 times, but it would seem a common case where we'll only take a few iterations (or even 1). |
I'm using the most recent openlane images (eg 2022.01.13_01.51.43), so both magic and netgen are a month or two old. The fill and tap counts do look to match.
There's 722953 devices, but it's overwhelmingly decap, tap and fill cells. After the optimisation phase I have 69363 devices.
Thanks. I didn't full appreciate the differences between transistor level extraction, and black box device level extraction, but this discussion has helped. |
That makes sense. I made the suggestion before understanding why GDS based extraction has to ignore some cells, and fixing the performance issues where we find them is the best way forward, not removing checking. |
@RTimothyEdwards How about |
@d-m-bailey : Sounds reasonable. I went ahead and implemented it as you suggest and pushed to the repository. |
@antonblanchard : Re: PropertyOptimize(): I'll take a look at that. |
@antonblanchard : I can't find anything wrong with your code change, so I'm going ahead and implementing it and pushing to the repository. From 90 minutes down to 3 minutes run time is good progress for a couple of days' work! |
summary, so that the summary lists the total number of devices as well as the number of devices after parallel optimization, in the form "device_name (M->N)", where "M" is the total number of devices, and "N" is the number of devices after parallel combination. This makes the output somewhat more meaningful to the end user. Implementation as discussed in github issue #47.
by Anton Blanchard, which prevents the double-loop in the PropertyOptimize() routine from continuing the outer loop if all devices in the run have already been merged.
Thanks @RTimothyEdwards! |
Considering the fixes for the identified issues are upstream, I'll close this. If I do uncover a new performance opportunity, I'll open a new bug. |
I have a design where netgen is taking over 90 minutes to complete. I notice a few things:
https://github.com/RTimothyEdwards/open_pdks/blob/e4b453599ad0b112e90c89220145c8219dda3cb0/sky130/netgen/sky130_setup.tcl#L335
CombineParallel()
, where for every insertion we walk the entire property list:I don't quite have my head around how
CombineParallel()
works, but couldn't we just insert at the front of the list?The text was updated successfully, but these errors were encountered: