Skip to content

Commit

Permalink
Remove whitelist and aliases (#61)
Browse files Browse the repository at this point in the history
* remove whitelist and aliases #57

* remove associated include files

* removing old whitelist alias info

* remove wl refs

---------

Co-authored-by: David G <[email protected]>
  • Loading branch information
fqrious and himynamesdave authored Nov 13, 2024
1 parent 8eaf275 commit 87c764d
Show file tree
Hide file tree
Showing 38 changed files with 28 additions and 6,830 deletions.
19 changes: 7 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,11 @@ The general design goal of txt2stix was to keep it flexible, but simple, so that
In short txt2stix;

1. takes a txt file input
2. (OPTIONAL) rewrites file with enabled aliases
3. extracts observables for enabled extractions (ai, pattern, or lookup)
4. (OPTIONAL) removes any extractions that match whitelists
5. converts extracted observables to STIX 2.1 objects
6. generates the relationships between extracted observables (ai, standard)
7. converts extracted relationships to STIX 2.1 SRO objects
8. outputs a STIX 2.1 bundle
2. extracts observables for enabled extractions (ai, pattern, or lookup)
3. converts extracted observables to STIX 2.1 objects
4. generates the relationships between extracted observables (ai, standard)
5. converts extracted relationships to STIX 2.1 SRO objects
6. outputs a STIX 2.1 bundle

## tl;dr

Expand Down Expand Up @@ -88,8 +86,6 @@ How the extractions are performed
* `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
* Important: if using any AI extractions, you must set an OpenAI API key in your `.env` file
* Important: if you are using any MITRE ATT&CK, CAPEC, CWE, ATLAS or Location extractions you must set `CTIBUTLER` or NVD CPE or CVE extractions you must set `VULMATCH` settings in your `.env` file
* `--use_aliases` (OPTIONAL): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no aliases applied.
* `--use_whitelist` (OPTIONAL): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no whitelists applied.
* `--relationship_mode` (REQUIRED): either.
* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
Expand All @@ -108,13 +104,12 @@ If any AI extractions, or AI relationship mode is set, you must set the followin
* similar to `ai_settings_extractions` but defines the model used to generate relationships. Only one model can be provided. Passed in same format as `ai_settings_extractions`
* See `tests/manual-tests/cases-ai-relationships.md` for some examples

## Adding new extractions/lookups/aliases
## Adding new extractions

It is very likely you'll want to extend txt2stix to include new extractions, aliases, and/or lookups. The following is possible:
It is very likely you'll want to extend txt2stix to include new extractions to;

* Add a new lookup extraction: add your lookup to `includes/lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `includes/extractions/lookup/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
* Add a new AI extraction: Edit `includes/extractions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
* Add a new alias: add a your alias to `includes/aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with. Once this is added, update `includes/extractions/alias/config.yaml` with a new record pointing to your alias. You can now use this lookup time at script run-time.

Currently it is not possible to easily add any other types of extractions (without modifying the logic at a code level).

Expand Down
53 changes: 3 additions & 50 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,56 +32,9 @@ Here is an overview of how the txt2stix processes txt files into STIX 2.1 bundle

https://miro.com/app/board/uXjVKEyFzB8=/

### Aliases

In many cases two extractions might be related to the same thing. For example, the extraction `USA` and `United States` and `United States of America` are all referring to the same thing.

Aliases normalise the input text before extractions happen so that the same extraction is used. e.g. changing `United States` -> `USA`.

Aliases are applied before extractions. Essentially the first step of processing is to replace the alias values, with the desired value.

The aliaases are set in the `includes/extractions/config.yaml`

To demonstrate, lets say the alias config file (in `aliases/`) looks like so;

```yaml
country_iso3_to_iso2:
name: Turns Country ISO2 values into ISO3
description:
created: 2020-01-01
modified: 2020-01-01
created_by: signalscorps
version: 1.0.0
file: /aliases/default/country_iso3_alias.csv
```
This aliases uses the alias file `country_iso3_alias.csv`.

The contents of an alias file has two columns, `value` and `alias`

```csv
value,alias
AFG,AF
ALA,AX
ALB,AL
DZA,DZ
```

This will turn all references of AFG in the inp

### Whitelists

In many cases files will have IoC extractions that are not malicious. e.g. `google.com` (and thus they don't want them to appear in final bundle).

Whitelists provide a list of values to be compared to extractions. If a whitelist value matches an extraction, that extraction is removed and any relationships where is the `source_ref` or `target_ref` are also removed so that a user does not see them.

Design decision: This is done after extractions to save tokens with AI providers (otherwise might be easily passing 10000+ more tokens to the AI).

Note, whitelists are designed to be simplistic in txt2stix. If you want more advanced removal of potential benign extractions you should use another tool, like a Threat Intelligence Platform.

### Extractions

After aliasing has been applied, extractions happen. There are 3 types of extractions in txt2stix.
There are 3 types of extractions in txt2stix.

1. Pattern: Pattern extraction type works by using regex patterns to extract data from the inputted document.
* when to use: for pattern based extractions that are easy to detect
Expand All @@ -97,7 +50,7 @@ A user can use a mix of all extractions in any request.

#### A note on extraction logic

When searching in written reports, extractions/aliasing is not always obvious to a machine (when pattern matching).
When searching in written reports, extractions are not always obvious to a machine (when pattern matching).

e.g. lets say `MITRE ATT&CK` was in a report, and country code alpha2 extraction was on (IT and AT might be extracted incorrectly as countries).

Expand All @@ -123,7 +76,7 @@ To make it clear, the above formats will all extract. The logic for txt2stix ext

As you can see, this logic would avoid the issue shown in the `MITRE ATT&CK` example.

Design decision: this does not apply to AI mode extractions (but still applies for aliasing before extractions) because assumption is AI model is smart enough to deal with extracting data in a more intelligent manner.
Design decision: this does not apply to AI mode extractions because assumption is AI model is smart enough to deal with extracting data in a more intelligent manner.

### Relationship modes

Expand Down
2 changes: 1 addition & 1 deletion docs/stix-mapping.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ All files uploaded are represented as a unique [STIX Report SDO](https://docs.oa
"created": "<ITEM INGEST DATE>",
"modified": "<ITEM INGEST DATE>",
"name": "<NAME ENTERED ON UPLOAD>",
"description": "<FULL BODY OF TEXT FROM FILE WITH ALIASES APPLIED>",
"description": "<FULL BODY OF TEXT FROM FILE>",
"confidence": "<CONFIDENCE VALUE PASSED AT CLI, IF EXISTS, ELSE NOT PRINTED>",
"published": "<ITEM INGEST DATE>",
"object_marking_refs": [
Expand Down
99 changes: 0 additions & 99 deletions includes/aliases/_README.md

This file was deleted.

Loading

0 comments on commit 87c764d

Please sign in to comment.