V0.3.3 (#37)

* Change rule and revert - change rule to 93 limit to not be inclusive - revert prior to making fs loading changes to instead investigate extending the bypass flag for stdin * Test: Read buffer testing read buffer implementation to increase reading speed of large files and seeing how memory could be optimized in scenarios * Test: Update buffer to 2 GB after testing between 1 GB and 5 GB there doesn't seem to be a lot of difference past 2GB estimated * Test: Update buffer size Seems like the unused buffer is freed pretty quickly so having more only helps more with large files. This implementation is faster than the original in all cases just fine-tuning the default buffer size at either 2GB or 4GB. Leaning towards 4GB because there have been examples of almost 30 second faster times than the 2GB buffer and I would expect users to use -f on a system with at least 8 GB of RAM. * Add new regram mode added a new mode called regram
JakeWnuk · Sep 17, 2024 · 02c0eba · 02c0eba
1 parent 649811a
commit 02c0eba
Show file tree

Hide file tree

Showing 7 changed files with 190 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ git clone https://github.com/JakeWnuk/ptt && cd ptt && docker build -t ptt . &&
 
 ### Usage:
 ```
-Usage of Password Transformation Tool (ptt) version (0.3.2):
+Usage of Password Transformation Tool (ptt) version (0.3.3):
 
 ptt [options] [...]
 Accepts standard input and/or additonal arguments.
@@ -64,7 +64,7 @@ These modify or filter the transformation mode.
   -m int
         Minimum numerical frequency to include in output.
   -n int
-        Maximum number of items to return in output. 
+        Maximum number of items to return in output.
   -o string
         Output to JSON file in addition to stdout.
   -p int
@@ -87,7 +87,7 @@ These modify or filter the transformation mode.
   -vvv
         Show verbose statistics output when possible.
   -w int
-        Number of words to generate for passphrases if applicable.
+        Number of words to use for a transformation if applicable.
 -------------------------------------------------------------------------------------------------------------
 Transformation Modes:
 These create or alter based on the selected mode.
@@ -114,6 +114,8 @@ These create or alter based on the selected mode.
         Transforms input by swapping tokens from a partial mask file and a input file.
   -t passphrase -w [words] -tf [file]
         Transforms input by randomly generating passphrases with a given number of words and separators from a file.
+  -t regram -w [words]
+        Transforms input by 'regramming' sentences into new n-grams with a given number of words.
   -t replace-all -tf [file]
         Transforms input by replacing all strings with all matches from a ':' separated file.
   -t rule-append

diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -1,5 +1,5 @@
 # Password Transformation Tool (PTT) Usage Guide
-## Version 0.3.0
+## Version 0.3.3
 
 ### Table of Contents
 #### Getting Started
@@ -37,6 +37,7 @@
 2. [Encoding and Decoding](#encoding-and-decoding)
 3. [Hex and Dehex](#hex-and-dehex)
 4. [Substrings](#substrings)
+5. [Regram](#regram)
 
 ## Getting Started
 
@@ -112,25 +113,46 @@ their collective values combined. The rest of the flags can only be used once.
 These flags work with files and directories.
 
 #### Options:
-- `-b`: Bypass map creation and use stdout as primary output.
-- `-d`: Enable debug mode with verbosity levels [0-2].
-- `-f`: Read additional files for input.
-- `-i`: Starting index for transformations if applicable. Accepts ranges separated by '-'.
-- `-k`: Only keep items in a file.
-- `-l`: Only output items of a certain length (does not adjust for rules). Accepts ranges separated by '-'.
-- `-m`: Minimum numerical frequency to include in output.
-- `-n`: Maximum number of items to return in output.
-- `-o`: Output to JSON file in addition to stdout.
-- `-p`: Change parsing mode for URL input. [0 = Strict, 1 = Permissive, 2 = Maximum].
-- `-r`: Only keep items not in a file.
-- `-rm`: Replacement mask for transformations if applicable. (default "uldsbt")
-- `-t`: Transformation to apply to input.
-- `-tf`: Read additional files for transformations if applicable.
-- `-tp`: Read a template file for multiple transformations and operations.
-- `-u`: Read additional URLs for input.
-- `-v`: Show verbose output when possible.
-- `-vv`: Show statistics output when possible.
-- `-vvv`: Show verbose statistics output when possible.
+```
+  -b    Bypass map creation and use stdout as primary output.
+  -d int
+        Enable debug mode with verbosity levels [0-2].
+  -f value
+        Read additional files for input.
+  -i value
+        Starting index for transformations if applicable. Accepts ranges separated by '-'.
+  -k value
+        Only keep items in a file.
+  -l value
+        Only output items of a certain length (does not adjust for rules). Accepts ranges separated by '-'.
+  -m int
+        Minimum numerical frequency to include in output.
+  -n int
+        Maximum number of items to return in output.
+  -o string
+        Output to JSON file in addition to stdout.
+  -p int
+        Change parsing mode for URL input. [0 = Strict, 1 = Permissive, 2 = Maximum] [0-2].
+  -r value
+        Only keep items not in a file.
+  -rm string
+        Replacement mask for transformations if applicable. (default "uldsbt")
+  -t string
+        Transformation to apply to input.
+  -tf value
+        Read additional files for transformations if applicable.
+  -tp value
+        Read a template file for multiple transformations and operations.
+  -u value
+        Read additional URLs for input.
+  -v    Show verbose output when possible.
+  -vv
+        Show statistics output when possible.
+  -vvv
+        Show verbose statistics output when possible.
+  -w int
+        Number of words to use for a transformation if applicable.
+```
 
 #### Transformations:
 The following transformations can be used with the `-t` flag:
@@ -157,6 +179,8 @@ The following transformations can be used with the `-t` flag:
         Transforms input by swapping tokens from a partial mask file and a input file.
   -t passphrase -w [words] -tf [file]
         Transforms input by randomly generating passphrases with a given number of words and separators from a file.
+  -t regram -w [words]
+        Transforms input by 'regramming' sentences into new n-grams with a given number of words.
   -t replace-all -tf [file]
         Transforms input by replacing all strings with all matches from a ':' separated file.
   -t rule-append
@@ -650,3 +674,14 @@ changed to the length of the input.
 This transformation can be used to extract specific parts of the input for
 further processing.
 
+### Regram
+This mode allows 'regramming' sentences into new n-grams with a given number of words. The syntax is as follows:
+```
+ptt -f <input_file> -t regram -w <word_count>
+```
+The `regram` transformation will generate new n-grams from the input by
+combining words from the input. The number of words to use in the n-gram is
+specified by the `-w` flag. The output will be the new n-grams generated from
+the input.
+
+
diff --git a/main.go b/main.go
@@ -15,7 +15,7 @@ import (
 	"github.com/jakewnuk/ptt/pkg/utils"
 )
 
-var version = "0.3.2"
+var version = "0.3.3"
 var wg sync.WaitGroup
 var mutex = &sync.Mutex{}
 var retain models.FileArgumentFlag
@@ -65,6 +65,7 @@ func main() {
 			"passphrase -w [words] -tf [file]":   "Transforms input by randomly generating passphrases with a given number of words and separators from a file.",
 			"substring -i [index]":               "Transforms input by extracting substrings starting at index and ending at index.",
 			"replace-all -tf [file]":             "Transforms input by replacing all strings with all matches from a ':' separated file.",
+			"regram -w [words]":                  "Transforms input by 'regramming' sentences into new n-grams with a given number of words.",
 		}
 
 		// Sort and print transformation modes
@@ -93,7 +94,7 @@ func main() {
 	bypassMap := flag.Bool("b", false, "Bypass map creation and use stdout as primary output.")
 	debugMode := flag.Int("d", 0, "Enable debug mode with verbosity levels [0-2].")
 	URLParsingMode := flag.Int("p", 0, "Change parsing mode for URL input. [0 = Strict, 1 = Permissive, 2 = Maximum] [0-2].")
-	passPhraseWords := flag.Int("w", 0, "Number of words to generate for passphrases if applicable.")
+	passPhraseWords := flag.Int("w", 0, "Number of words to use for a transformation if applicable.")
 	flag.Var(&retain, "k", "Only keep items in a file.")
 	flag.Var(&remove, "r", "Only keep items not in a file.")
 	flag.Var(&readFiles, "f", "Read additional files for input.")

diff --git a/pkg/models/models.go b/pkg/models/models.go
@@ -3,6 +3,7 @@ package models
 
 import (
 	"fmt"
+	"io"
 	"os"
 	"strings"
 )
@@ -97,13 +98,49 @@ func (p PairList) Swap(i, j int)      { p[i], p[j] = p[j], p[i] }
 // or from a mock file system for testing
 type FileSystem interface {
 	ReadFile(filename string) ([]byte, error)
+	Open(filename string) (File, error)
+}
+
+// File is an interface that represents a file
+type File interface {
+	Read(p []byte) (n int, err error)
+	Close() error
 }
 
 // MockFileSystem is used to read files from the mock file system
 type MockFileSystem struct {
 	Files map[string][]byte
 }
 
+// MockFile represents a mock file
+type MockFile struct {
+	Data   []byte
+	Offset int64
+}
+
+// Read reads data from the mock file
+func (m *MockFile) Read(p []byte) (n int, err error) {
+	if m.Offset >= int64(len(m.Data)) {
+		return 0, io.EOF
+	}
+	n = copy(p, m.Data[m.Offset:])
+	m.Offset += int64(n)
+	return n, nil
+}
+
+// Close closes the mock file (no-op for mock)
+func (m *MockFile) Close() error {
+	return nil
+}
+
+// Open opens a mock file and returns a File interface
+func (m *MockFileSystem) Open(filename string) (File, error) {
+	if data, ok := m.Files[filename]; ok {
+		return &MockFile{Data: data}, nil
+	}
+	return nil, fmt.Errorf("file not found: %s", filename)
+}
+
 // ReadFile Implements the ReadFile method of the FileSystem interface for the MockFileSystem
 func (m *MockFileSystem) ReadFile(filename string) ([]byte, error) {
 	if data, ok := m.Files[filename]; ok {
@@ -120,6 +157,11 @@ func (r *RealFileSystem) ReadFile(filename string) ([]byte, error) {
 	return os.ReadFile(filename)
 }
 
+// Open opens a file and returns a File interface
+func (fs RealFileSystem) Open(filename string) (File, error) {
+	return os.Open(filename)
+}
+
 // Scanner is an interface that is used to read lines from a file
 type Scanner interface {
 	Scan() bool

diff --git a/pkg/rule/rule.go b/pkg/rule/rule.go
@@ -122,7 +122,7 @@ func FormatCharToRuleOutput(strs ...string) (output string) {
 		output = output[:len(output)-1] + ":"
 	}
 
-	if output != "" && len(output) <= 93 {
+	if output != "" && len(output) < 93 {
 		return strings.TrimSpace(output)
 	}
 
@@ -157,7 +157,7 @@ func FormatCharToIteratingRuleOutput(index int, strs ...string) (output string)
 		}
 	}
 
-	if output != "" && len(output) <= 93 {
+	if output != "" && len(output) < 93 {
 		return strings.TrimSpace(output)
 	}
 

diff --git a/pkg/transform/transform.go b/pkg/transform/transform.go
@@ -5,6 +5,7 @@ import (
 	"fmt"
 	"math/rand"
 	"os"
+	"strings"
 
 	"github.com/jakewnuk/ptt/pkg/format"
 	"github.com/jakewnuk/ptt/pkg/mask"
@@ -129,6 +130,12 @@ func TransformationController(input map[string]int, mode string, startingIndex i
 			os.Exit(1)
 		}
 		output = ReplaceAllKeysInMap(input, transformationFilesMap, bypass, functionDebug)
+	case "regram":
+		if passphraseWords == 0 {
+			fmt.Fprintf(os.Stderr, "[!] Regram operations require use of the -w flag to specify the number of words to use\n")
+			os.Exit(1)
+		}
+		output = GenerateNGramMap(input, passphraseWords, bypass, functionDebug)
 	default:
 		output = input
 	}
@@ -308,3 +315,48 @@ func GeneratePassphrase(passWords map[string]int, transformationFilesMap map[str
 
 	return newKeyPhrase
 }
+
+// GenerateNGramMap takes a map of keys and values and generates a new map
+// using the utils.GenerateNGrams function and combines the results. This
+// function is used to generate n-grams from the input map for the regram
+// transformation mode.
+//
+// Args:
+//
+//	input (map[string]int): The original map to generate n-grams from
+//	ngramSize (int): The size of the n-grams to generate
+//	bypass (bool): If true, the map is not used for output or filtering
+//	debug (bool): If true, print additional debug information to stderr
+//
+// Returns:
+//
+//	(map[string]int): A new map with the n-grams generated
+func GenerateNGramMap(input map[string]int, ngramSize int, bypass bool, debug bool) map[string]int {
+	newMap := make(map[string]int)
+	for key, value := range input {
+		newKeyArray := utils.GenerateNGrams(key, ngramSize)
+		for _, newKey := range newKeyArray {
+
+			if debug {
+				fmt.Fprintf(os.Stderr, "Key: %s\n", key)
+				fmt.Fprintf(os.Stderr, "New Key: %s\n", newKey)
+			}
+
+			newKey = strings.TrimSpace(newKey)
+			newKey = strings.TrimLeft(newKey, ",")
+			newKey = strings.TrimRight(newKey, ",")
+			newKey = strings.TrimLeft(newKey, " ")
+
+			if !bypass {
+				if newMap[newKey] == 0 {
+					newMap[newKey] = value
+				} else {
+					newMap[newKey] += value
+				}
+			} else {
+				fmt.Println(newKey)
+			}
+		}
+	}
+	return newMap
+}
diff --git a/pkg/utils/utils.go b/pkg/utils/utils.go
@@ -40,6 +40,8 @@ import (
 //	(map[string]int): A map of words from the files
 func ReadFilesToMap(fs models.FileSystem, filenames []string) map[string]int {
 	wordMap := make(map[string]int)
+	// 4 GB read buffer
+	chunkSize := int64(4 * 1024 * 1024 * 1024)
 
 	i := 0
 	for i < len(filenames) {
@@ -52,21 +54,40 @@ func ReadFilesToMap(fs models.FileSystem, filenames []string) map[string]int {
 			}
 			filenames = append(filenames, files...)
 		} else {
-			data, err := fs.ReadFile(filename)
+			file, err := fs.Open(filename)
 			if err != nil {
-				fmt.Fprintf(os.Stderr, "[!] Error reading file %s\n", filename)
+				fmt.Fprintf(os.Stderr, "[!] Error opening file %s\n", filename)
 				os.Exit(1)
 			}
+			defer file.Close()
+
+			buffer := make([]byte, chunkSize)
+			for {
+				bytesRead, err := file.Read(buffer)
+				if err != nil && err != io.EOF {
+					fmt.Fprintf(os.Stderr, "[!] Error reading file %s\n", filename)
+					os.Exit(1)
+				}
+				if bytesRead == 0 {
+					break
+				}
 
-			err = json.Unmarshal(data, &wordMap)
-			if err == nil {
-				fmt.Fprintf(os.Stderr, "[*] Detected ptt JSON output. Importing...\n")
-				continue
-			}
+				data := buffer[:bytesRead]
 
-			fileWords := strings.Split(string(data), "\n")
-			for _, word := range fileWords {
-				wordMap[word]++
+				err = json.Unmarshal(data, &wordMap)
+				if err == nil {
+					fmt.Fprintf(os.Stderr, "[*] Detected ptt JSON output. Importing...\n")
+					continue
+				}
+
+				fileWords := strings.Split(string(data), "\n")
+				for _, word := range fileWords {
+					wordMap[word]++
+				}
+
+				if err == io.EOF {
+					break
+				}
 			}
 		}
 		i++