Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOC extraction process slow #422

Open
regulartim opened this issue Jan 9, 2025 · 1 comment
Open

IOC extraction process slow #422

regulartim opened this issue Jan 9, 2025 · 1 comment
Assignees

Comments

@regulartim
Copy link
Collaborator

The celery tasks that extract data from elastic and write it to the GreedyBear DB have become much slower over time. This seems to correlate with a growing GreedyBear DB. On an empty DB, all extraction processes combined took about 20 seconds. Now, with about 500k IOCs in the DB, they take over 2 minutes.
One of the performance killers seems to be the _check_first_time_run method in attacks.py:

def _check_first_time_run(self, honeypot_flag, general=False):
all_ioc = IOC.objects.all()
if not all_ioc:
# plus, we extract the sensors addresses so we can whitelist them
ExtractSensors().execute()
self.first_time_run = True
else:
# if this is not the overall first time, it could that honeypot first time
# FEEDS for a general honeypot it needs to be checked if it's in the list
if not general:
honeypot_ioc = IOC.objects.filter(**{f"{honeypot_flag}": True})
else:
honeypot_ioc = IOC.objects.filter(**{"general_honeypot__name__iexact": honeypot_flag})
if not honeypot_ioc:
# first time we execute this project.
# So we increment the time range to get the data from the last 3 days
self.first_time_run = True

In line 100 and 112 QuerySet evaluation is forced by testing it in a boolean context. To my understanding this means that all IOCs are fetched from the DB, which is very expensive. This should rather be done by using QuerySet.exists(). I will further investigate this and open a PR.

@regulartim regulartim self-assigned this Jan 9, 2025
@drosetti
Copy link
Contributor

drosetti commented Jan 9, 2025

I'll assign you this issue, thanks!

drosetti pushed a commit that referenced this issue Jan 20, 2025
* create index on name field of IOC model to speed up _add_ioc function

* use QuerySet.exist() for better performance

* hand over previously added IOC record to _get_sessions method to reduce number of DB queries

* fix returning wrong IOC object

* add more error-resistant time window calculation

* document additional_lookback argument

* minor improvements to get_time_window function

* add test cases for get_time_window function

* fix error in docstring

* remove argument from function that is already a configuration setting and adapt tests accordingly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants