Remove whitelist and aliases (#61)

* remove whitelist and aliases #57 * remove associated include files * removing old whitelist alias info * remove wl refs --------- Co-authored-by: David G <[email protected]>
muchdogesec · Nov 13, 2024 · 87c764d · 87c764d
1 parent 8eaf275
commit 87c764d
Show file tree

Hide file tree

Showing 38 changed files with 28 additions and 6,830 deletions.
diff --git a/README.md b/README.md
@@ -11,13 +11,11 @@ The general design goal of txt2stix was to keep it flexible, but simple, so that
 In short txt2stix;
 
 1. takes a txt file input
-2. (OPTIONAL) rewrites file with enabled aliases
-3. extracts observables for enabled extractions (ai, pattern, or lookup)
-4. (OPTIONAL) removes any extractions that match whitelists
-5. converts extracted observables to STIX 2.1 objects
-6. generates the relationships between extracted observables (ai, standard)
-7. converts extracted relationships to STIX 2.1 SRO objects
-8. outputs a STIX 2.1 bundle
+2. extracts observables for enabled extractions (ai, pattern, or lookup)
+3. converts extracted observables to STIX 2.1 objects
+4. generates the relationships between extracted observables (ai, standard)
+5. converts extracted relationships to STIX 2.1 SRO objects
+6. outputs a STIX 2.1 bundle
 
 ## tl;dr
 
@@ -88,8 +86,6 @@ How the extractions are performed
 * `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
 	* Important: if using any AI extractions, you must set an OpenAI API key in your `.env` file
 	* Important: if you are using any MITRE ATT&CK, CAPEC, CWE, ATLAS or Location extractions you must set `CTIBUTLER` or NVD CPE or CVE extractions you must set `VULMATCH` settings in your `.env` file
-* `--use_aliases` (OPTIONAL): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no aliases applied.
-* `--use_whitelist` (OPTIONAL): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no whitelists applied.
 * `--relationship_mode` (REQUIRED): either.
 	* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
 	* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
@@ -108,13 +104,12 @@ If any AI extractions, or AI relationship mode is set, you must set the followin
 	* similar to `ai_settings_extractions` but defines the model used to generate relationships. Only one model can be provided. Passed in same format as `ai_settings_extractions`
 	* See `tests/manual-tests/cases-ai-relationships.md` for some examples
 
-## Adding new extractions/lookups/aliases
+## Adding new extractions
 
-It is very likely you'll want to extend txt2stix to include new extractions, aliases, and/or lookups. The following is possible:
+It is very likely you'll want to extend txt2stix to include new extractions to;
 
 * Add a new lookup extraction: add your lookup to `includes/lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `includes/extractions/lookup/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
 * Add a new AI extraction: Edit `includes/extractions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
-* Add a new alias: add a your alias to `includes/aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with. Once this is added, update `includes/extractions/alias/config.yaml` with a new record pointing to your alias. You can now use this lookup time at script run-time.
 
 Currently it is not possible to easily add any other types of extractions (without modifying the logic at a code level).
 

diff --git a/docs/README.md b/docs/README.md
@@ -32,56 +32,9 @@ Here is an overview of how the txt2stix processes txt files into STIX 2.1 bundle
 
 https://miro.com/app/board/uXjVKEyFzB8=/
 
-### Aliases
-
-In many cases two extractions might be related to the same thing. For example, the extraction `USA` and `United States` and `United States of America` are all referring to the same thing.
-
-Aliases normalise the input text before extractions happen so that the same extraction is used. e.g. changing `United States` -> `USA`.
-
-Aliases are applied before extractions. Essentially the first step of processing is to replace the alias values, with the desired value.
-
-The aliaases are set in the `includes/extractions/config.yaml`
-
-To demonstrate, lets say the alias config file (in `aliases/`) looks like so;
-
-```yaml
-country_iso3_to_iso2:
-  name: Turns Country ISO2 values into ISO3
-  description:
-  created: 2020-01-01
-  modified:  2020-01-01
-  created_by: signalscorps
-  version: 1.0.0
-  file: /aliases/default/country_iso3_alias.csv
-```
-
-This aliases uses the alias file `country_iso3_alias.csv`.
-
-The contents of an alias file has two columns, `value` and `alias`
-
-```csv
-value,alias
-AFG,AF
-ALA,AX
-ALB,AL
-DZA,DZ
-```
-
-This will turn all references of AFG in the inp
-
-### Whitelists
-
-In many cases files will have IoC extractions that are not malicious. e.g. `google.com` (and thus they don't want them to appear in final bundle).
-
-Whitelists provide a list of values to be compared to extractions. If a whitelist value matches an extraction, that extraction is removed and any relationships where is the `source_ref` or `target_ref` are also removed so that a user does not see them.
-
-Design decision: This is done after extractions to save tokens with AI providers (otherwise might be easily passing 10000+ more tokens to the AI).
-
-Note, whitelists are designed to be simplistic in txt2stix. If you want more advanced removal of potential benign extractions you should use another tool, like a Threat Intelligence Platform.
-
 ### Extractions
 
-After aliasing has been applied, extractions happen. There are 3 types of extractions in txt2stix.
+There are 3 types of extractions in txt2stix.
 
 1. Pattern: Pattern extraction type works by using regex patterns to extract data from the inputted document.
     * when to use: for pattern based extractions that are easy to detect
@@ -97,7 +50,7 @@ A user can use a mix of all extractions in any request.
 
 #### A note on extraction logic
 
-When searching in written reports, extractions/aliasing is not always obvious to a machine (when pattern matching).
+When searching in written reports, extractions are not always obvious to a machine (when pattern matching).
 
 e.g. lets say `MITRE ATT&CK` was in a report, and country code alpha2 extraction was on (IT and AT might be extracted incorrectly as countries).
 
@@ -123,7 +76,7 @@ To make it clear, the above formats will all extract. The logic for txt2stix ext
 
 As you can see, this logic would avoid the issue shown in the `MITRE ATT&CK` example.
 
-Design decision: this does not apply to AI mode extractions (but still applies for aliasing before extractions) because assumption is AI model is smart enough to deal with extracting data in a more intelligent manner.
+Design decision: this does not apply to AI mode extractions because assumption is AI model is smart enough to deal with extracting data in a more intelligent manner.
 
 ### Relationship modes
 

diff --git a/docs/stix-mapping.md b/docs/stix-mapping.md
@@ -49,7 +49,7 @@ All files uploaded are represented as a unique [STIX Report SDO](https://docs.oa
     "created": "<ITEM INGEST DATE>",
     "modified": "<ITEM INGEST DATE>",
     "name": "<NAME ENTERED ON UPLOAD>",
-    "description": "<FULL BODY OF TEXT FROM FILE WITH ALIASES APPLIED>",
+    "description": "<FULL BODY OF TEXT FROM FILE>",
     "confidence": "<CONFIDENCE VALUE PASSED AT CLI, IF EXISTS, ELSE NOT PRINTED>",
     "published": "<ITEM INGEST DATE>",
     "object_marking_refs": [

diff --git a/includes/aliases/_README.md b/includes/aliases/_README.md