Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ksymbols): reimplement ksymbols #4464

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

oshaked1
Copy link
Contributor

@oshaked1 oshaked1 commented Dec 25, 2024

1. Explain what the PR does

The previous ksymbols implementation used a lazy lookup method, where only symbols marked as required ahead of time were stored. Trying to lookup a symbol that was not stored resulted in /proc/kallsyms being read and parsed in its entirety.
While most symbols being looked up were registered as required ahead of time, some weren't (in particular symbols needed for kprobe attachment) which incurred significant overhead when tracee is being initialized.

This new implementation stores all symbols, or if a requiredDataSymbolsOnly flag is used when creating the symbol table (used by default), only non-data symbols are stored (and required data symbols must be registered before updating). Some additional memory usage optimizations are included, for example encoding symbol owners as an index into a list of owner names, and also lazy symbol name lookups where the map of symbol name to symbol is populated only for symbols that were looked up once.

From measurements I performed, the extra memory consumption is around 21MB (from ~159MB to ~180MB when running tracee with no arguments on my machine).

Under the hood, this ksymbols implementation uses a generic symbol table implementation that can be used by future code for managing executable file symbols.

A significant advantage gained by storing all non-data symbols is the ability to lookup a function symbol that contains a given code address, a feature that I plan to use in the future.

This PR closes #4463 and renders #4325 irrelevant (because /proc/kallsyms reads no-longer happen "spontaneously").

Copy link
Collaborator

@NDStrahilevitz NDStrahilevitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall, though I do have some comments in mind.

pkg/utils/symbol_table.go Show resolved Hide resolved
pkg/utils/symbol_table.go Show resolved Hide resolved
pkg/utils/symbol_table.go Show resolved Hide resolved
pkg/utils/symbol_table.go Outdated Show resolved Hide resolved
pkg/utils/symbol_table.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/ebpf/tracee.go Show resolved Hide resolved
@oshaked1 oshaked1 force-pushed the kallsyms branch 7 times, most recently from a37c7cd to f3e5990 Compare December 29, 2024 10:58
@yanivagman yanivagman linked an issue Dec 29, 2024 that may be closed by this pull request
@oshaked1 oshaked1 force-pushed the kallsyms branch 2 times, most recently from e5c5324 to eaa12ab Compare January 1, 2025 15:41
@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 1, 2025

I added an additional memory optimization - kernel symbols now only store the lower 48 bits of the address with the assumption that all addresses begin with 0xffff. We ignore any symbols whose address doesn't start with 0xffff, which is only percpu symbols. This allows us to encode the address and owner index together which eliminates 8 bytes per symbol for a total memory saving of around 3-4MB.

Copy link
Collaborator

@NDStrahilevitz NDStrahilevitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one critical request to make: please avoid the mixed mutex transactions in the kernel symbols table. From experience this tends to cause transaction mixes where one write operation will happen in the middle of a mixed operation for example. Imagine the following methods m1 and m2 where m1 is w and m2 has rw. The following could occur:

  1. m2 - r
  2. m1 - wait for w
  3. m2 - release r
  4. m1 - back for w
  5. m1 - release
  6. m2 - back to w, wiht r assumptions changed due to m1.

Please either pick for each method that it is R or W and make the lock last for the whole operation. You could even opt for a regular mutex instead of a RWMutex, I don't think we have very frequent reads or writes to this struct anyway.

@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 2, 2025

Symbols lookups could be very frequent in the future (stack trace processing). Using a write lock for the entire duration of KernelSymbolTable.UpdateFromReader will prevent symbol lookups for a significant duration. The same applies to SymbolTable.LookupByName.

In both functions, the write operations only add new data, they don't change or remove existing data. In the case of SymbolTable.LookupByName, the worst case scenario for an outdated assumption means we add the same name to symbol mapping twice (the added data will always be the same). For KernelSymbolTable.UpdateFromReader, the worst case scenario is the same owner gets added twice to idxToSymbolOwner, but all data remains valid.

I could solve the issue with UpdateFromReader by adding a third lock that makes sure only a single update operation can happen at a time (which prevents 2 concurrent goroutines from wanting to add a new symbol owner).

@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 2, 2025

I could also change the API of the kernel symbol table so that reading /proc/kallsyms happens once when creating the symbol table, and if the user wants to update it, he must create a new one. This prevents the need for locks at all.

It also solves a race condition where if a lookup happens between kst.symbols.Clear() and kst.symbols.AddSymbols(symbols) the lookup will fail.

@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 2, 2025

Wouldn't this be an issue for your stack trace feature if you miss a symbol in the initial request?

Do you mean with the current implementation? If so then yes, I only noticed it now.

So this is a valid concern... I can't think of a better solution, and it sounds reasonable that these internal structures should have separate access control. So i'm ok with that. Make the change and i'll +1.

Which solution do prefer? IMO the second one (RO symbol table, create a new one instead of updating) is favorable.

Additionally, I could change SymbolTable.symbolsByName to an LRU which would solve the concurrency issue. It would also allow us to control the memory usage of name mappings.

@NDStrahilevitz
Copy link
Collaborator

NDStrahilevitz commented Jan 2, 2025

Additionally, I could change SymbolTable.symbolsByName to an LRU which would solve the concurrency issue. It would also allow us to control the memory usage of name mappings.

That is just equivalent to giving it a separate mutex, and I don't see why we would want to limit the owner number. It might be slightly more readable. BTW, you may want to use the set.Set data structure for this instead, as the owner symbols just are a set, and I believe i've included a mutex with it. Though it may not fit with the int<->string translation scheme you've made.

Wouldn't this be an issue for your stack trace feature if you miss a symbol in the initial request?

Do you mean with the current implementation? If so then yes, I only noticed it now.

I mean if you make it RO and require full regeneration with a new symbol you, I think you may easily encounter multiple regenerations per stack trace extraction.

So this is a valid concern... I can't think of a better solution, and it sounds reasonable that these internal structures should have separate access control. So i'm ok with that. Make the change and i'll +1.

Which solution do prefer? IMO the second one (RO symbol table, create a new one instead of updating) is favorable.

Which is why the RO symbol table seems like a bad choice for your future needs, unless I am missing something in what you've suggested. Therefore I suggest you use a set, LRU, mutex on the owners, such that the overall concurrency mutex is limited to its relevant data.

@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 2, 2025

Additionally, I could change SymbolTable.symbolsByName to an LRU which would solve the concurrency issue. It would also allow us to control the memory usage of name mappings.

That is just equivalent to giving it a separate mutex, and I don't see why we would want to limit the owner number. It might be slightly more readable. BTW, you may want to use the set.Set data structure for this instead, as the owner symbols just are a set, and I believe i've included a mutex with it. Though it may not fit with the int<->string translation scheme you've made.

symbolsByName is not for the owners, it's for the cached symbol name to symbol structure mapping (which probably should be size limited). The owners cannot be a set for the reason you mentioned, there is no way to index into it.

I mean if you make it RO and require full regeneration with a new symbol you, I think you may easily encounter multiple regenerations per stack trace extraction.

There is no regeneration on failed request, only manually (used in processDoInitModule). Creating a new kernel symbol table instead of updating it will ensure that for the users, the update is atomic (a new structure gets assigned to t.kernelSymbols).

Copy link
Member

@geyslan geyslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass so far. I've only reviewed kernelSymbolInternal with some suggestions that could make the code easier to read and change - later I'm going to reach the rest.

pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
Copy link
Member

@geyslan geyslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one pass more.

Besides my thoughts put, I would recommend you bring also a test file for concurrency like this: https://github.com/aquasecurity/tracee/blob/main/pkg/capabilities/capabilities_test.go

I didn't reviewed KernelSymbolTable yet. It will wait for the next pass. 👍🏼

}
}

// Sort the symbols slice by address in descending order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps sorting it in ascending order would bring a small performance and API features (as boolean found or the ordered index where a value should be inserted when not found) by using slices.BinarySearch() or slices.BinarySearchFunc(). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what's the benefit of sorting in ascending order. I already perform binary searches (using sort.Search).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slices.BinarySearchFunc() returns the index and found boolean. So you'll be able to shortcut the later checking in LookupByAddressExact(), by instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this function behaves when there are multiple symbols with the same requested address, I need to find all of them. sort.Search guarantees to find me the first match and then I iterate forward, finding the rest. I could also iterate backwards, but I don't think this would end up being simpler than the current method.

Comment on lines +90 to +112
st.mu.RLock()
// We call RUnlock manually and not using defer because we may need to upgrade to a write lock later

// Lookup the name in the name to symbol mapping
if symbols, found := st.symbolsByName[name]; found {
st.mu.RUnlock()
return symbols, nil
}

// Lazy name lookup is disabled, the lookup failed
if !st.lazyNameLookup {
st.mu.RUnlock()
return nil, ErrSymbolNotFound
}

// Lazy name lookup is enabled, perform a linear search to find the requested name
symbols := []*T{}
for _, symbol := range st.sortedSymbols {
if (*symbol).Name() == name {
symbols = append(symbols, symbol)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about moving this reading logic into a private method, so all read mutex control would be managed by it... i.e.: symbols := st.lookupByName()

Then we should write lock only if symbols len is > 0.

If you agree, provide comments about the locking in the inner call to warn about unexpected deadlocks if it's used out of context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the opposite effect of what @NDStrahilevitz commented regarding race conditions with mixed mutex transactions (assumption from read lock is outdated by the time we acquire the write lock) because it places the read unlock and the write lock further apart.

As I commented above, the worst case scenario for such a race condition is duplicate work with no negative effect on the data (the same name is looked up linearly more than once). WDYT? Is such a race condition even worth consideration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that you're only checking for the existence of symbols with the name passed. I believe that the worst scenario is getting RLock avoiding a write to it (to happen) that would be the finding to update the symbolsByName. So it will be a mismatch even the symbol write locked and waiting for the reading release. What are the odds?

If it is susceptible to happen very often, I would use a write lock for the entire scope then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understood what you mean. The odds of 2 lookups for the same name at very similar times that result in a duplicate linear search is very low (there aren't many name lookups). IMO it's not worth accounting for this.

Anyway regarding your proposal, we would need to return from the internal st.lookupByName an indication on whether st.symbolsByName needs to be updated, I'm not sure there is much point in separating the two.

pkg/utils/symbol_table.go Show resolved Hide resolved
Comment on lines +139 to +143
// Not found or not exact match
if idx == len(st.sortedSymbols) || (*st.sortedSymbols[idx]).Address() != address {
return nil, ErrSymbolNotFound
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, I suppose that by using BinarySearch this check might not be necessary.

Copy link
Contributor Author

@oshaked1 oshaked1 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, I do use binary search. A failed search will return idx == len(st.sortedSymbols). The search condition is that the address of the current symbol is smaller or equal to the address being looked up, so we have to verify that it's equal.

pkg/utils/symbol_table.go Show resolved Hide resolved
pkg/utils/symbol_table_test.go Outdated Show resolved Hide resolved
Copy link
Member

@geyslan geyslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final pass, I believe. Anything, just ping me.

return &KernelSymbol{
Name: symbol.Name(),
Address: symbol.Address(),
Owner: kst.idxToSymbolOwner[symbol.owner()],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the unique access, how about moving it to above, reducing the RLock time even more? I mean:

    kst.ownersMutex.RLock()
    owner := symbol.owner()
	kst.ownersMutex.RUnlock()

pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
pkg/utils/environment/kernel_symbols.go Outdated Show resolved Hide resolved
func (k *KernelSymbolTable) GetSymbolByAddr(addr uint64) ([]KernelSymbol, error) {
k.updateLock.Lock()
defer k.updateLock.Unlock()
func (kst *KernelSymbolTable) symbolFromInternal(symbol *kernelSymbolInternal) *KernelSymbol {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is called from loops, so it will lock/unlock many times when the iteration is greater than one. Is that behaviour expected? If it isn't, consider to move the locking calls upper.

@oshaked1 oshaked1 force-pushed the kallsyms branch 2 times, most recently from a346008 to 31b26bc Compare January 13, 2025 13:23
@oshaked1
Copy link
Contributor Author

oshaked1 commented Jan 13, 2025

@geyslan some of your remaining comments are regarding lock behavior in kernel_symbols.go. As I proposed in a previous comment, I could change the API such that KernelSymbolTable is RO and when it needs to be updated (currently only happens when a kernel module is loaded) a new one is created instead. This would completely remove the need for locks in this file.

WDYT? @NDStrahilevitz @yanivagman it would be great if you could weigh in as well

@NDStrahilevitz
Copy link
Collaborator

@oshaked1 I worry that jt would expose an easy attack against tracee, simply load and unload a module many times. of course this is already suspicious behavior, and not the only "smoke screen" tactic possible, but adding another one isn't great. That said considering it is not the only such tactic maybe it shouldn't be a blocker for the option.

a way to circumvent this altogether would be if you could cache the delta per module load. if we know for sure that the module in a load loop is the same one, there's no need to calculate the differences each time. although this might be too memory heavy.

Overall no strong opinion on going that way, going RO has many obvious benefits, but the exploit surface slightly worries me.

@oshaked1
Copy link
Contributor Author

This attack is also a problem with the current implementation, because we clear the underlying symbol table and add the symbols from /proc/kallsyms again. The RO implementation is actually more resistant to this attack becuase lookups can still happen normally while a new kernel symbol table is being created, and only when it's ready t.kernelSymbols will be replaced atomically.

@NDStrahilevitz
Copy link
Collaborator

This attack is also a problem with the current implementation, because we clear the underlying symbol table and add the symbols from /proc/kallsyms again. The RO implementation is actually more resistant to this attack becuase lookups can still happen normally while a new kernel symbol table is being created, and only when it's ready t.kernelSymbols will be replaced atomically.

True, you're right. So, on my end, I see no issue with moving to an RO implementation.

@geyslan
Copy link
Member

geyslan commented Jan 13, 2025

This attack is also a problem with the current implementation, because we clear the underlying symbol table and add the symbols from /proc/kallsyms again. The RO implementation is actually more resistant to this attack becuase lookups can still happen normally while a new kernel symbol table is being created, and only when it's ready t.kernelSymbols will be replaced atomically.

True, you're right. So, on my end, I see no issue with moving to an RO implementation.

Me neither. 👍🏼

@oshaked1 oshaked1 force-pushed the kallsyms branch 2 times, most recently from ce508b3 to 5008a8b Compare January 13, 2025 16:28
@oshaked1
Copy link
Contributor Author

@NDStrahilevitz @geyslan I changed KernelSymbolTable to RO and guarded t.kernelSymbols behind an atomic getter and setter.

@oshaked1 oshaked1 force-pushed the kallsyms branch 4 times, most recently from cd5b576 to 47d48e1 Compare January 14, 2025 10:17
@@ -123,6 +122,9 @@ type Tracee struct {
policyManager *policy.Manager
// The dependencies of events used by Tracee
eventsDependencies *dependencies.Manager
// A reference to a environment.KernelSymbolTable that might change at runtime.
// This should only be accessed using t.getKernelSymbols() and t.setKernelSymbols()
kernelSymbols unsafe.Pointer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using unsafe.Pointer and not *environment.KernelSymbolTable?

Copy link
Contributor Author

@oshaked1 oshaked1 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to ensure that it's only accessed using the safe getter and setter.

Comment on lines 142 to 149
func (t *Tracee) getKernelSymbols() *environment.KernelSymbolTable {
return (*environment.KernelSymbolTable)(atomic.LoadPointer(&t.kernelSymbols))
}

func (t *Tracee) setKernelSymbols(kernelSymbols *environment.KernelSymbolTable) {
atomic.StorePointer(&t.kernelSymbols, unsafe.Pointer(kernelSymbols))
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm don't understand why we need those... What does the atomic protect from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanivagman The atomic operations protect from simultaneous reads of t.kernelSymbols and an update that replaces it. While there is probably no actual risk, go defines such simultaneous access as a data race so it's just to be safe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we place this file under environment? I don't think this logic will have any use out of ksymbols context, isn't that so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to use it in the future for ELF symbols

This implementation stores all symbols, or if a `requiredDataSymbolsOnly`
flag is used when creating the symbol table, only non-data symbols are saved
(and required data symbols must be registered before updating).
This new implementation uses a generic symbol table implementation that is
responsible for managing symbol lookups, and can be used by future code for
managing exeutable file symbols.
After running the init function of a kernel module, the kernel frees the memory that was allocated for it but doesn't remove its symbol from kallsyms.
This resulsts in a scenario where a subsequent loaded module can be allocated to the same area as the free'd init function of the prevous module.
This could result in 2 symbols at the same address, one is the free'd init function and another from the newly loaded module.
This caused an undeterminism in which symbol is used by the hooked_syscall event, which only used the first symbol that was found, resulting in random test failures.
This commit changes the hooked_syscall event to emit one event for each found symbol.
Copy link
Collaborator

@NDStrahilevitz NDStrahilevitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after all the changes, nice work.

The thread stack area was extracted by finding the VMA containing the SP of the new thread,
but because the SP might be just past the end of its allocated VMA (top of the stack), sometimes the correct VMA was not found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants