Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean #40

Merged
merged 15 commits into from
Dec 24, 2024
Merged

Clean #40

merged 15 commits into from
Dec 24, 2024

Conversation

Zffu
Copy link
Member

@Zffu Zffu commented Dec 24, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new assembly function for executing machine code.
    • Added a new function for parsing function invocations.
  • Bug Fixes

    • Corrected error messages for already defined assembly functions.
  • Refactor

    • Updated multiple type definitions for lexer results and tokens for consistency.
    • Simplified function signatures across various components.
  • Chores

    • Removed unused header files and utility functions to streamline the codebase.
    • Enhanced lexer functionality to accept input length as a parameter.

@Zffu Zffu self-assigned this Dec 24, 2024
Copy link

coderabbitai bot commented Dec 24, 2024

Walkthrough

This pull request introduces a comprehensive refactoring of type definitions and naming conventions across the Quickfall compiler's codebase. The primary focus is on standardizing type representations, particularly for lexer results, tokens, and AST nodes. The changes involve replacing struct-based types with typedef definitions, updating function signatures, and ensuring consistent type usage across multiple components of the compiler. These modifications aim to improve code readability and maintainability without significantly altering the underlying functionality.

Changes

File Change Summary
benchmarks/main.c Updated result type from struct LexerResult to LEXER_RESULT
examples/assembly.qf Added new main function with assembly function myAssemblyFunction
examples/helloWorld.qf Deleted file
src/cli/main.c Removed code_section array, updated type references
src/compiler/compiler.c Updated parameter documentation
src/compiler/ir.h Standardized type definitions with typedefs
src/compiler/pe/format.h Removed include directive for units.h
src/compiler/units.h Deleted file with type definitions
src/compiler/utils.c Deleted utility functions for file writing
src/lexer/* Comprehensive type updates for lexer result and token types
src/parser/* Updated function signatures and type definitions for AST nodes
tests/* Updated test files to use new type definitions

Suggested Labels

ast, refactoring, type-system

Poem

🐰 Hop, hop, through the code we go,
Structs and types in a rhythmic flow
Typedef magic, names refined
Compiler's dance, elegantly designed!
Refactoring rabbit's joyful leap 🌈


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Zffu Zffu merged commit 9802cf3 into master Dec 24, 2024
1 of 3 checks passed
@Zffu Zffu deleted the clean branch December 24, 2024 12:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (7)
examples/assembly.qf (1)

1-3: Consider capturing the return value from the assembly function.
Although this simple example may not need a return value, in real scenarios, you may want to explicitly handle or communicate the result of your assembly function.

src/parser/asts/math.h (1)

7-8: Header guard naming check.
It's good to see header guards. However, to avoid potential conflicts with the standard math library header <math.h>, consider renaming your guard to something more specific, like QF_AST_MATH_H.

src/parser/ast.c (1)

13-13: Check for a failed memory allocation.
malloc may return NULL if the allocation fails. Consider adding a check for node == NULL to prevent subsequent operations on a null pointer.

 AST_NODE* createASTNode(AST_NODE_TYPE type) {
     AST_NODE* node = malloc(sizeof(AST_NODE));
+    if (!node) {
+        // Handle allocation failure (e.g., log and return NULL or exit)
+        return NULL;
+    }
     node->valueSize = 0;
     ...
 }
src/parser/asts/functions.h (1)

40-45: New function addition looks good.

  1. The choice to use LEXER_RESULT result gives this new function consistency with existing ones.
  2. Ensure you add matching documentation or usage references in .c files that call parseFunctionInvoke, so that external callers or future maintainers can easily locate and adopt it.
src/lexer/lexer.c (1)

27-27: Memory allocation check.

result.tokens = malloc(sizeof(TOKEN) * 1024); works but consider:

  1. Checking for a NULL pointer to avoid potential crashes.
  2. Supporting dynamic resizing if 1024 tokens is not guaranteed to be sufficient.
src/lexer/tokens.h (2)

Line range hint 4-26: Consider enhancing enum documentation and debugging support

While the token types are well-organized, consider these improvements for better maintainability:

  1. Add documentation comments for each token type
  2. Consider adding explicit values for easier debugging
  3. Group tokens more systematically (e.g., keywords, operators, delimiters)

Example enhancement:

 typedef enum {
-    ASM_FUNCTION,
-    FUNCTION,
+    // Assembly function declaration
+    ASM_FUNCTION = 1000,
+    // Regular function declaration
+    FUNCTION = 1001,
     // ... other tokens with similar documentation and values
 } TOKEN_TYPE;

28-30: Enhance structure documentation

The current documentation is quite basic. Consider adding more details about:

  • The purpose and lifecycle of tokens
  • The ownership semantics of the value field
  • Example usage

Example enhancement:

 /**
- * An lexer token generated by the Lexer.
+ * Represents a token generated during lexical analysis.
+ * 
+ * The token consists of its type and an optional string value.
+ * For literals and identifiers, the value field contains the actual text.
+ * For keywords and operators, the value field may be NULL.
+ * 
+ * Note: The caller is responsible for managing the memory of the value field.
+ * 
+ * Example:
+ * TOKEN number_token = {
+ *     .type = NUMBER,
+ *     .value = "42"
+ * };
  */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25d1634 and f39393b.

📒 Files selected for processing (25)
  • benchmarks/main.c (1 hunks)
  • examples/assembly.qf (1 hunks)
  • examples/helloWorld.qf (0 hunks)
  • src/cli/main.c (2 hunks)
  • src/compiler/compiler.c (1 hunks)
  • src/compiler/ir.h (2 hunks)
  • src/compiler/pe/format.h (0 hunks)
  • src/compiler/units.h (0 hunks)
  • src/compiler/utils.c (0 hunks)
  • src/lexer/lexer.c (1 hunks)
  • src/lexer/lexer.h (1 hunks)
  • src/lexer/tokens.h (2 hunks)
  • src/parser/ast.c (1 hunks)
  • src/parser/ast.h (2 hunks)
  • src/parser/asts/functions.c (7 hunks)
  • src/parser/asts/functions.h (1 hunks)
  • src/parser/asts/math.c (1 hunks)
  • src/parser/asts/math.h (1 hunks)
  • src/parser/asts/variables.c (2 hunks)
  • src/parser/asts/variables.h (1 hunks)
  • src/parser/parser.c (1 hunks)
  • src/parser/parser.h (1 hunks)
  • tests/ir.c (2 hunks)
  • tests/lexer.c (1 hunks)
  • tests/parser.c (3 hunks)
💤 Files with no reviewable changes (4)
  • src/compiler/pe/format.h
  • src/compiler/units.h
  • examples/helloWorld.qf
  • src/compiler/utils.c
🔇 Additional comments (36)
examples/assembly.qf (1)

5-7: Validate the assembly instructions.
The provided hex instructions (0xB8 0x4C 0x00 0x00 0x00 0xC3) appear to move a literal into a register and then return. Confirm this is the intended behavior for your use case and complies with your target architecture constraints.

src/parser/asts/math.h (1)

15-15: Ensure consistent usage of typedefed lexer result.
Using LEXER_RESULT is consistent with other refactors in the PR. Just ensure all references to the old struct LexerResult are successfully removed and replaced.

tests/lexer.c (1)

5-5: String library inclusion is appropriate.
Including <string.h> is necessary for strlen(buff). This looks correct.

src/parser/parser.h (1)

17-17: Ensure consistent usage across the codebase.

The function’s signature transition from struct LexerResult to LEXER_RESULT and from enum ASTNodeType to AST_NODE_TYPE is aligned with the new typedef-based style. Confirm that all calls to parseNodes have been updated accordingly to avoid compatibility issues.

src/parser/asts/variables.h (2)

17-17: Good standardization of the lexer result type.

This change to parseVariableValue(LEXER_RESULT result, int index) makes the function signature consistent with the new typedef approach introduced in the lexer component. No further issues spotted here.


24-24: Confirm updates in all usage sites.

Similarly, replacing struct LexerResult with LEXER_RESULT ensures consistency. Make sure that any code invoking parseVariableDeclaration is updated to match this signature.

src/lexer/lexer.h (3)

10-13: Clear definition with typedef.

Defining a LEXER_RESULT struct via typedef is a clean approach to standardize how lexer output is represented. The structure is concise and improves readability. No immediate issues detected.


19-19: Interface update in runLexer.

The function now returns the newly typedef'd LEXER_RESULT. Ensure all references to runLexer have been updated to handle the second argument (input length) properly across the codebase.


26-26: Revising token insertion logic.

Changing the signature to void pushToken(LEXER_RESULT* result, TOKEN_TYPE type) aligns with the new typedef approach. Validate that any token creation or usage within pushToken still functions correctly alongside this updated type.

tests/ir.c (2)

5-5: String handling library inclusion.

Including <string.h> is necessary for strlen. This resolves any potential missing declaration warnings and ensures correct compilation.


19-19: Confirm proper lexer call usage.

Switching from struct LexerResult to LEXER_RESULT and calling runLexer(buff, strlen(buff)) is consistent with the new interface. Please verify that no other test files still invoke the old one-parameter version.

src/compiler/ir.h (3)

12-20: Well-defined enum type naming.

Changing from enum IRNodeType to a typedef enum { ... } IR_TYPE aligns well with the new naming conventions, improving clarity and consistency across the codebase.


Line range hint 22-45: Consistent IR_NODE struct definition.

Renaming struct IRNode to IR_NODE harmonizes the naming scheme throughout the project. The usage of struct IR_NODE** variables for nested IR nodes is a clear way to reference child or linked nodes.


47-58: Uniform IR_CTX struct usage.

Defining the IR context as a typedef struct { ... } IR_CTX matches the new approach used elsewhere. This enhanced naming structure makes the IR context definition more approachable.

src/parser/asts/math.c (1)

17-17: Verified parameter type transition.

Switching from struct LexerResult to LEXER_RESULT is consistent with the broader codebase shift. Ensure that any references to parseMathematicalOpNode use the updated type to maintain consistency.

tests/parser.c (3)

5-5: Justified header inclusion.

Including <string.h> is appropriate when using strlen, ensuring the function is declared before usage.


17-17: Consistent lexer invocation.

Defining LEXER_RESULT result = runLexer(buff, strlen(buff)); reflects the standardized lexer result type and properly calculates the buffer length for the lexer.


26-26: Parameter type updated for AST dumping.

Switching the dumpASTTree function parameter from struct ASTNode* to AST_NODE* aligns with the new typedef-based approach, enhancing readability.

src/parser/ast.h (4)

5-6: Clear macro guard naming.

Renaming from AST_2_H to AST_H clarifies the header guard and is consistent with the file’s purpose.


Line range hint 11-36: Enum redefinition for AST_NODE_TYPE.

Transitioning from enum ASTNodeType to typedef enum { ... } AST_NODE_TYPE is a sensible enhancement for readability and code uniformity.


41-53: AST_NODE struct redesign.

Renaming struct ASTNode to AST_NODE adheres to the new naming scheme. The updated field structure remains intuitive, and the references to child and sibling pointers (left, right, next) remain comprehensible under the new typedef.


59-59: Method signature alignment.

Adjusting createASTNode to accept AST_NODE_TYPE instead of enum ASTNodeType completes the transition to the newly defined enum, maintaining consistency across declarations and implementations.

src/parser/asts/functions.h (4)

17-17: Consistent type usage.

Replacing struct LexerResult with LEXER_RESULT keeps the code aligned with the new typedef approach. This change should help maintain uniformity throughout the codebase.


24-24: Standardized parameter type.

Continuing the alignment with LEXER_RESULT reinforces consistency. No issues found here.


31-31: Uniform naming across function signatures.

Again, this is consistent with the rest of the refactor. Good job keeping the naming and type usage uniform.


38-38: ASM function declaration consistency.

Swapping to LEXER_RESULT seamlessly integrates with the new type system. No concerns here.

src/parser/parser.c (2)

20-20: Refined parseNodes function signature.

Switching to LEXER_RESULT and AST_NODE_TYPE demonstrates consistent naming conventions. Keep verifying that all references to this function have been updated in related .h or test files.


25-25: Potential out-of-bounds access concern.

Here TOKEN t = result.tokens[index] is retrieved. Although the loop checks index <= result.size, be mindful of subsequent lookups at index + 1. Confirm that all such references remain within bounds to avoid out-of-bounds errors.

src/parser/asts/variables.c (2)

12-13: Variable value parsing updates.

  1. The updated type usage (LEXER_RESULT and TOKEN) is consistent with the refactor.
  2. Ensure that index is valid to avoid referencing result.tokens[index] when index > result.size.

53-53: Variable declaration refactor alignment.

Utilizing LEXER_RESULT clarifies the intent in variable-related parsing. This maintains uniform handling of token arrays.

src/lexer/lexer.c (3)

18-18: Function signature consistency.

Changing pushToken to use LEXER_RESULT* and TOKEN_TYPE is consistent with the updated typedef approach, ensuring codebase-wide uniformity.


23-23: Refined lexer entry point.

Switching runLexer to return LEXER_RESULT is a clean way to unify type usage. The function name and purpose remain clear.


24-24: Initialization details.

Declaring a LEXER_RESULT result; is straightforward. Confirm that all fields in result are appropriately initialized, especially for error handling or postprocessing.

src/cli/main.c (1)

128-128: Ensure the file pointer is properly closed or flushed.
While this line is blank, it may be indicative of intermediate logic or cleanup that was removed. Verify that other code paths handle closing the file pointer.

✅ Verification successful

Let me verify the file handling by checking for file operations and error handling patterns.


File handling is properly implemented with necessary closure

The file operations in src/cli/main.c are properly handled:

  • Input file is opened with error checking and closed after reading
  • Output file is opened for writing in a separate operation

The blank line at 128 is not related to any missing file handling logic, as all file operations have proper open/close pairs:

  1. Input file: fopen → error check → read operations → fclose
  2. Output file: fopen → compile operation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for file closure references in this file
rg -A 2 $'fclose\\(' src/cli/main.c

Length of output: 100


Script:

#!/bin/bash
# Search for file operations and related patterns
rg -A 5 'fopen|fptr|FILE \*' src/cli/main.c

Length of output: 819

src/compiler/compiler.c (1)

138-138: Doc comment updated successfully.
Renaming the parameter name from char to out clarifies its usage. This change improves the clarity of the documentation.

src/parser/asts/functions.c (1)

23-23: Type alias usage aligns with the new codebase conventions.
Switching from struct LexerResult to LEXER_RESULT is consistent with the broader refactoring initiative. This update enhances maintainability and readability.

Also applies to: 80-80, 109-109, 149-149, 200-200

#include "../src/lexer/lexer.h"
#include "../src/lexer/tokens.h"

/**
* Runs the lexer test.
*/
int runLexerTest(char* buff) {
struct LexerResult result = runLexer(buff);
LEXER_RESULT result = runLexer(buff, strlen(buff));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Buffer validation is necessary as lexer implementation assumes valid input

Based on the code analysis:

  1. The lexer implementation in src/lexer/lexer.c directly accesses the input buffer without validation
  2. The lexer allocates memory and iterates through the buffer assuming it's valid
  3. While main.c has proper buffer handling for file reads, the test function doesn't validate its input

Adding the suggested validation is important because:

  • A NULL buffer would cause a segmentation fault in strlen()
  • An empty buffer would waste resources allocating tokens array
  • The lexer implementation assumes the buffer is valid and directly accesses it
🔗 Analysis chain

Consider validating the buffer length before passing to the lexer.
If buff is empty or NULL, strlen(buff) might cause undefined behavior or an immediate zero-length parse. Safeguard the call by checking buff and its length beforehand.

 if (!buff || strlen(buff) == 0) {
     fprintf(stderr, "Error: input buffer is empty\n");
     return -1;
 }
 LEXER_RESULT result = runLexer(buff, strlen(buff));

Let me gather more information about the lexer implementation and its usage to verify this concern.

Let me also check the test file context and any existing validation patterns:


Let me check the lexer implementation to understand how it handles buffer validation:


Let me check the lexer implementation more specifically to see how it handles the input buffer:

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Check the lexer implementation and how it handles NULL/empty buffers
ast-grep --pattern 'runLexer($_, $_)'

# Look for existing buffer validations
rg -A 3 "strlen\(.*buff.*\)" tests/

Length of output: 635


Script:

#!/bin/bash
# Get context around the lexer test
rg -B 5 -A 5 "LEXER_RESULT.*runLexer" tests/

# Look for any NULL checks on buff
rg -B 2 -A 2 "if.*\!.*buff" tests/

Length of output: 1336


Script:

#!/bin/bash
# Check the lexer implementation
rg -B 5 -A 5 "runLexer" src/

# Look for NULL checks with correct regex
rg -B 2 -A 2 "if.*(!|! ).*buff" tests/

# Check lexer.h for any documentation about buffer requirements
fd -e h -x cat {} \; -E "node_modules|build|dist" | rg -A 5 "LEXER_RESULT"

Length of output: 3312


Script:

#!/bin/bash
# Check the lexer.c implementation for buffer handling
rg -B 10 -A 10 "LEXER_RESULT runLexer" src/lexer/lexer.c

# Look for any validation in main.c where the file is read
rg -B 5 -A 5 "fread.*buff" src/

Length of output: 1036

Comment on lines +116 to +117
LEXER_RESULT result = runLexer(buff, size);
AST_NODE* root = parseNodes(result, 0, AST_ROOT);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Potential null pointer and memory safety issues detected

The code has several safety issues that need to be addressed:

  1. runLexer allocates memory for tokens but doesn't check if the allocation succeeded:

    result.tokens = malloc(sizeof(TOKEN) * 1024);  // No NULL check
  2. parseNodes is called without validating if result.tokens is valid, which could lead to a null pointer dereference.

  3. The fixed size allocation of 1024 tokens could be insufficient for larger inputs, potentially leading to buffer overflows.

Recommendations:

  • Add a NULL check after the malloc in runLexer and return an error state if allocation fails
  • Validate result.tokens before calling parseNodes
  • Consider using dynamic allocation or at least add size validation
🔗 Analysis chain

Check potential null pointer usage.
If runLexer(buff, size) fails or returns an invalid structure, we must ensure parseNodes can handle it gracefully. Consider validating the result before proceeding with parseNodes.

You can confirm correct handling by searching for additional checks of result.tokens:


Let me gather more information about the LEXER_RESULT structure and error handling in runLexer.


Let me check how parseNodes handles the tokens array and any potential error cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for any checks on the result structure or parseNodes usage 
# to ensure safety when runLexer fails or returns invalid data.
rg -A 5 $'parseNodes' 

Length of output: 1983


Script:

#!/bin/bash
# Check the LEXER_RESULT structure definition and runLexer implementation
ast-grep --pattern 'typedef struct $_ {
  $$$
} LEXER_RESULT;'

rg -A 5 'runLexer\(' 

Length of output: 1780


Script:

#!/bin/bash
# Check parseNodes implementation and token access
rg -A 10 'for\(.+; index <= result\.size;' 

# Check if there's any error handling in lexer.c
rg -A 5 'malloc.*TOKEN.*1024' src/lexer/lexer.c

Length of output: 186


AST_NODE* root = createASTNode(AST_PARAMETER);
AST_NODE* current = root;

int stack = 0;

for(; index < result.size + 1; ++index) {
struct Token t = result.tokens[index];
TOKEN t = result.tokens[index];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Array bounds validation needs improvement in multiple locations

The verification reveals several issues with array bounds checking:

  • In src/parser/asts/functions.c, there are multiple unsafe accesses:

    • Line 31: result.tokens[index] is used without bounds check
    • Line 49: result.tokens[index + 1] is accessed in a switch statement without prior bounds validation
    • Line 85: Unsafe access in result.tokens[index + 1].type
    • Line 176: Another unchecked access to result.tokens[index + 1]
    • The loop condition index < result.size + 1 is problematic as it allows accessing one position beyond the array bounds
  • In contrast, src/parser/asts/variables.c and src/parser/asts/math.c show proper bounds checking:

    • Example of good practice: if(result.size >= index + 1 && result.tokens[index + 1].type == MATH_OP)
    • Example of good practice: if(result.size >= index + 2 && result.tokens[index + 2].type == DECLARE)

The code should be modified to include proper bounds checking before each array access, following the pattern used in the variables.c and math.c files.

🔗 Analysis chain

Validate array index boundaries.
When accessing tokens via result.tokens[index + 1] or iterating up to result.size + 1, ensure you’re not reading out of bounds.

Also applies to: 49-49, 85-85, 176-176


Let me gather more information about the bounds checking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for usage of index+1 across the code to ensure no out-of-bounds occur 
# when result.size is close to index+1
rg -A 3 $'tokens\\[\\s*index\\s*\\+\\s*1' src/parser/asts/

Length of output: 1940


Script:

#!/bin/bash
# Search for size checks before accessing tokens[index + 1]
rg -B 3 'result\.size' src/parser/asts/

# Search for the function definitions to understand the context
ast-grep --pattern 'AST_NODE* parse$_($_, $_) {
  $$$
}'

Length of output: 1187

@@ -123,7 +123,7 @@ void main(int argc, char* argv[]) {
endTimer(i, 0);
startTimer();

struct LexerResult result = runLexer(buff, size);
LEXER_RESULT result = runLexer(buff, size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Memory leak detected in lexer implementation
The code review has revealed two memory leaks:

  1. result.tokens is allocated with malloc in runLexer but never freed after use
  2. For each token, a buff is allocated with malloc but never freed

Specific locations:

  • src/lexer/lexer.c: result.tokens = malloc(sizeof(TOKEN) * 1024)
  • src/lexer/lexer.c: char* buff = malloc(buffLen) inside the loop
  • benchmarks/main.c:126: The result from runLexer is used but its memory is not freed
  • tests/lexer.c: The test also doesn't free the allocated memory
🔗 Analysis chain

Confirm memory deallocation.
After using LEXER_RESULT result, confirm that any dynamically allocated fields within result are freed, if relevant. This ensures we avoid memory leaks in multiple runs.


Let me gather more information about the LEXER_RESULT structure and its memory management.


Let me try a different approach to find the structure and memory management patterns.


Let me check the lexer implementation files to understand the memory management better.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for usage of result tokens disposal or free calls for the LEXER_RESULT structure
rg -A 5 $'free\\(result\\.tokens\\)' benchmarks/ src/ tests/

Length of output: 58


Script:

#!/bin/bash
# First, let's find the definition of LEXER_RESULT structure
ast-grep --pattern 'typedef struct $_$ LEXER_RESULT;'

# Also search for the runLexer function implementation
ast-grep --pattern 'LEXER_RESULT runLexer($$$)'

# Let's also look for any cleanup or free functions related to lexer results
rg -i "free.*lexer|lexer.*free" -A 5

Length of output: 144


Script:

#!/bin/bash
# Look for struct definitions that might be related to lexer results
rg -A 10 "struct.*[Ll]exer.*{|typedef.*[Ll]exer"

# Search for any memory management patterns in the lexer implementation
rg -A 5 "free\(|malloc\(" src/lexer

# Look for the implementation file containing runLexer
fd -e c -e h "lexer"

Length of output: 735


Script:

#!/bin/bash
# Check the content of lexer files
cat src/lexer/lexer.h
cat src/lexer/lexer.c

# Also check the test file for any cleanup patterns
cat tests/lexer.c

Length of output: 3969

Comment on lines +31 to +34
typedef struct {
TOKEN_TYPE type;
char* value;
};
} TOKEN;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Memory management issues confirmed in token handling

The code review reveals several critical memory management issues:

  1. Memory leaks in token value handling:

    • malloc is used to allocate buffers for token values (char* buff = malloc(buffLen)) but these are never freed
    • The token array itself (result.tokens = malloc(sizeof(TOKEN) * 1024)) is never freed
    • No cleanup function exists for LEXER_RESULT or TOKEN structures
  2. Buffer overflow risks:

    • Fixed buffer size of 32 bytes for token values without bounds checking
    • Hard-coded token array size of 1024 without overflow protection
  3. String handling issues:

    • String tokenization doesn't properly handle escape sequences
    • No NULL checks after malloc calls
    • Potential buffer overrun in string literal parsing loop

Note: The codebase needs a proper memory management strategy with:

  • A cleanup function to free token values and arrays
  • Dynamic buffer resizing or proper size checks
  • Safe string handling with bounds checking
  • Clear ownership semantics for token values
🔗 Analysis chain

Consider adding safeguards for string value management

The value field being a raw char* raises some concerns:

  1. No indication of string ownership
  2. No size field to prevent buffer overflows
  3. No distinction between NULL and empty string

Let's check how the value field is managed in the codebase:


Let's search for more specific patterns in the implementation to understand how tokens are handled.


Let's examine the lexer implementation to understand how tokens are created and managed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for TOKEN value field allocations and deallocations
echo "Searching for value field memory management..."
rg -A 2 "token\.(value|->value)\s*=\s*" 
rg -A 2 "free\s*\(\s*token\.(value|->value)\s*\)"

Length of output: 196


Script:

#!/bin/bash
# Let's try a different approach to find TOKEN usage patterns
echo "Searching for TOKEN struct usage..."
rg -A 5 "TOKEN\s+\w+\s*=" 

echo -e "\nSearching for malloc/strdup with TOKEN..."
rg -A 3 "(malloc|strdup).*TOKEN"

echo -e "\nSearching for free with TOKEN..."
rg -B 2 -A 2 "free.*TOKEN"

echo -e "\nLooking for lexer implementation files..."
fd "lexer.*\.(c|h)$"

Length of output: 2440


Script:

#!/bin/bash
# Let's examine the lexer implementation
echo "Checking lexer.c content..."
cat src/lexer/lexer.c

echo -e "\nChecking lexer.h content..."
cat src/lexer/lexer.h

echo -e "\nLooking for token cleanup/destruction..."
rg -B 3 -A 3 "cleanup|destroy|free" src/lexer/

Length of output: 3845

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant