Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(xid): simplify xid component #321

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

feat(xid): simplify xid component #321

wants to merge 6 commits into from

Conversation

cardyok
Copy link
Collaborator

@cardyok cardyok commented Jan 21, 2025

No description provided.

@cardyok cardyok marked this pull request as draft January 21, 2025 11:58
@cardyok cardyok force-pushed the simplify-xid branch 3 times, most recently from 18dac48 to 2310b28 Compare January 21, 2025 13:42
@cardyok cardyok force-pushed the simplify-xid branch 4 times, most recently from cc5d178 to a75939e Compare January 21, 2025 23:39
@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 33.62319% with 229 lines in your changes missing coverage. Please review.

Project coverage is 21.15%. Comparing base (02d1814) to head (eddea4c).

Files with missing lines Patch % Lines
...mponents/accelerator/nvidia/error/xid/component.go 8.33% 132 Missing ⚠️
components/accelerator/nvidia/error/xid/helper.go 78.57% 19 Missing and 2 partials ⚠️
internal/session/serve.go 0.00% 20 Missing ⚠️
components/db/events.go 81.25% 4 Missing and 2 partials ⚠️
internal/server/server.go 0.00% 6 Missing ⚠️
components/metrics/metrics.go 0.00% 3 Missing ⚠️
...omponents/accelerator/nvidia/bad-envs/component.go 0.00% 1 Missing ⚠️
...onents/accelerator/nvidia/clock-speed/component.go 0.00% 1 Missing ⚠️
components/accelerator/nvidia/ecc/component.go 0.00% 1 Missing ⚠️
...nts/accelerator/nvidia/error-xid-sxid/component.go 0.00% 1 Missing ⚠️
... and 37 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #321      +/-   ##
==========================================
- Coverage   21.21%   21.15%   -0.06%     
==========================================
  Files         300      301       +1     
  Lines       27089    27302     +213     
==========================================
+ Hits         5747     5776      +29     
- Misses      20697    20895     +198     
+ Partials      645      631      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

components/accelerator/nvidia/error/xid/helper.go Outdated Show resolved Hide resolved
components/accelerator/nvidia/error/xid/helper.go Outdated Show resolved Hide resolved
components/accelerator/nvidia/error/xid/helper.go Outdated Show resolved Hide resolved
components/accelerator/nvidia/error/xid/helper.go Outdated Show resolved Hide resolved
components/accelerator/nvidia/error/xid/helper.go Outdated Show resolved Hide resolved
@cardyok cardyok force-pushed the simplify-xid branch 4 times, most recently from 4cb7df8 to aa900f3 Compare January 24, 2025 00:33
@cardyok cardyok marked this pull request as ready for review January 24, 2025 00:39
gyuho added a commit that referenced this pull request Jan 24, 2025
@cardyok cardyok force-pushed the simplify-xid branch 2 times, most recently from c5c0ae7 to 9e03f54 Compare January 24, 2025 10:44
@cardyok cardyok force-pushed the simplify-xid branch 2 times, most recently from f4a90c6 to 1ab7366 Compare January 24, 2025 15:57
@cardyok cardyok force-pushed the simplify-xid branch 3 times, most recently from b5ece5b to 7ee6b83 Compare January 25, 2025 13:28
gyuho added a commit that referenced this pull request Jan 25, 2025
Signed-off-by: cardyok <[email protected]>
Signed-off-by: cardyok <[email protected]>
Signed-off-by: cardyok <[email protected]>
Signed-off-by: cardyok <[email protected]>
@gyuho
Copy link
Collaborator

gyuho commented Jan 25, 2025

Also need to use this events DB in:

// no need to check duplicate entries, assuming nvml event poller does not return old events
ctx, cancel := context.WithTimeout(inst.rootCtx, 10*time.Second)
werr := nvidia_xid_sxid_state.InsertEvent(ctx, inst.dbRW, nvidia_xid_sxid_state.Event{
UnixSeconds: event.Time.Unix(),
DataSource: "nvml",
EventType: "xid",
EventID: int64(event.Xid),
DeviceID: event.DeviceUUID,
EventDetails: string(eb),
})
cancel()
if werr != nil {

(this PR only covers dmesg ones, we also need the ones from NVML)


TODO: need to deprecate old xid matcher:

dmesgProcessMatched := func(ts time.Time, line []byte, matchedFilter *query_log_common.Filter) {
if ts.IsZero() {
return
}
if len(line) == 0 {
return
}
if matchedFilter == nil {
return
}
cctx, ccancel := context.WithTimeout(ctx, 10*time.Second)
defer ccancel()
for _, ref := range matchedFilter.OwnerReferences {
switch ref {
case nvidia_component_error_xid_id.Name:
ev, err := nvidia_query_xid.ParseDmesgLogLine(metav1.Time{Time: ts}, string(line))
if err != nil {
log.Logger.Errorw("failed to parse xid dmesg line", "line", string(line), "error", err)
continue
}
if ev.Detail == nil {
log.Logger.Errorw("failed to parse xid dmesg line", "line", string(line), "error", "no detail")
continue
}
eventToInsert := nvidia_xid_sxid_state.Event{
UnixSeconds: ts.Unix(),
DataSource: "dmesg",
EventType: "xid",
EventID: int64(ev.Detail.Xid),
DeviceID: ev.DeviceUUID,
EventDetails: ev.LogItem.Line,
}
if xidSxidEventDeduper.Get(eventToInsert) {
log.Logger.Debugw("xid event already exists in cache -- deduped", "event", eventToInsert)
continue
}
if err := xidSxidEventDeduper.Add(eventToInsert); err != nil {
log.Logger.Errorw("failed to add xid event to deduper", "error", err)
continue
}
// not found in cache, fallback to db lookup
found, err := nvidia_xid_sxid_state.FindEvent(cctx, dbRO, eventToInsert)
if err != nil {
log.Logger.Errorw("failed to find xid event in database", "error", err)
continue
}
if found {
log.Logger.Debugw("xid event already exists in database", "event", eventToInsert)
continue
}
if werr := nvidia_xid_sxid_state.InsertEvent(cctx, dbRW, eventToInsert); werr != nil {
log.Logger.Errorw("failed to insert xid event into database", "error", werr)
continue
}

@gyuho
Copy link
Collaborator

gyuho commented Jan 25, 2025

Discussed offline, we will do NVML in a separate PR

return fmt.Errorf("failed to marshal extra info: %w", err)
}
}
if ev.SuggestedActions != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2025

let us improve the test coverage to make the CI green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants