Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler V3 #39

Merged
merged 53 commits into from
Dec 23, 2024
Merged

Compiler V3 #39

merged 53 commits into from
Dec 23, 2024

Conversation

Zffu
Copy link
Member

@Zffu Zffu commented Dec 23, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a new clean target for build cleanup.
    • Added support for compiling programs into the Portable Executable (PE) format.
    • Enhanced lexer to handle specified input sizes and improved token parsing.
    • Added functionality for handling function invocations in the parser.
  • Bug Fixes

    • Ensured proper initialization of valueSize in AST nodes.
    • Improved error handling for file operations.
  • Documentation

    • Updated function signatures and parameter descriptions in headers.
  • Chores

    • Added utility functions for file operations and data manipulation.

Zffu and others added 30 commits December 19, 2024 11:57
…fHeader, writeWinSTDFields, writeWinSpecificFields and writeWinSection
@Zffu Zffu added compiler Signifies a compiler change. experiment Includes an experimental feature ast Signifies a AST / parser change. labels Dec 23, 2024
@Zffu Zffu self-assigned this Dec 23, 2024
Copy link

coderabbitai bot commented Dec 23, 2024

Walkthrough

This pull request introduces significant enhancements to the Quickfall toolchain, focusing on improving the compiler's infrastructure for generating Portable Executable (PE) files. The changes span multiple components, including the lexer, parser, compiler, and build system. Key modifications include updating function signatures, adding PE file format support, introducing new utility functions for file operations, and refining the AST and IR representations. The overall goal appears to be establishing a more robust and flexible compilation process with explicit support for Windows executable generation.

Changes

File Changes
Makefile - Updated prepare_build target message
- Added new clean target to remove build artifacts
benchmarks/main.c - Updated runLexer function signature to include buffer size
- Modified compilation process using new IR context
src/cli/main.c - Added code_section global variable
- Updated file reading and lexer processing
src/compiler/compiler.c - Modified makeContext and compile function logic
- Updated main function validation
src/lexer/lexer.c - Updated runLexer to accept input size
- Enhanced token parsing logic
src/parser/asts/functions.c - Added parseFunctionInvoke function
- Updated function parsing logic
Multiple new files - Added pe/format.h, pe/pe.c, pe/pe.h
- Added compiler/units.h, compiler/utils.h, compiler/utils.c

Sequence Diagram

sequenceDiagram
    participant CLI as CLI
    participant Lexer
    participant Parser
    participant Compiler
    participant PEGenerator

    CLI->>Lexer: runLexer(input, size)
    Lexer-->>Parser: Return lexer result
    Parser->>Parser: Parse AST
    Parser-->>Compiler: Generate IR Context
    Compiler->>PEGenerator: compilePE(file, program, size)
    PEGenerator->>PEGenerator: Generate PE headers
    PEGenerator->>PEGenerator: Write program data
    PEGenerator-->>Compiler: Return compiled executable
Loading

Possibly Related PRs

Poem

🐰 Compiling code with rabbit might,
PE headers dancing in binary light,
Lexers parsing, parsers gleam,
A toolchain built on coding's dream!
Quickfall rises, executable and bright! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🔭 Outside diff range comments (1)
src/compiler/compiler.c (1)

Buffer reallocation has incorrect size calculation

The reallocation code has a critical issue:

  • The realloc(ctx->nodes, buffSize) call uses raw buffSize value instead of multiplying it by sizeof(IR_NODE), unlike the initial allocation malloc(sizeof(IR_NODE) * 32)
  • This mismatch will lead to buffer overflow as the reallocation size is much smaller than required

The correct reallocation should be:

ctx->nodes = realloc(ctx->nodes, sizeof(IR_NODE) * buffSize);
🔗 Analysis chain

Line range hint 32-92: Potential buffer resizing issue for variables and function nodes.

The code grows "buffSize" by 1.5 when the nodeIndex surpasses the current threshold, but then reuses the same pointer "ctx->nodes" with "realloc" for IR_NODE array. Verify that the new allocation size (buffSize * 1.5) meets memory alignment requirements and ensures no out-of-bounds writes.


Let me gather more information about the memory allocation and buffer handling in the codebase.

This will help us understand:

  1. How buffSize is initialized and managed
  2. The IR_NODE structure definition and alignment requirements
  3. The node creation and memory allocation patterns

Let me check for the IR_NODE structure definition and the size calculations.

This will help us understand:

  1. The exact structure layout of IR_NODE
  2. How the size calculations are done
  3. Any predefined buffer size constants

Let me try one more search with a modified pattern to find the IR_NODE structure definition.

This will help us understand:

  1. The IR_NODE structure definition that we missed earlier
  2. The complete context around the realloc operation
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for buffer size and memory allocation patterns
rg -U "buffSize|realloc|malloc" src/compiler/compiler.c

# Look for IR_NODE struct definition and related memory handling
rg -U "struct\s+IR_NODE|IR_NODE\s*\*" src/compiler/

# Search for createIRNode implementation
ast-grep --pattern 'IR_NODE* createIRNode($$$)'

Length of output: 1166


Script:

#!/bin/bash
# Look for IR_NODE structure definition
rg -U "struct\s+IR_NODE\s*{[^}]*}" src/compiler/

# Check for sizeof usage with IR_NODE
rg -U "sizeof\s*\(\s*IR_NODE\s*\)" src/compiler/

# Look for buffer size constants or defines
rg -U "#define.*BUFF|BUFFER|SIZE" src/compiler/

Length of output: 408


Script:

#!/bin/bash
# Look for IR_NODE structure definition with a simpler pattern
rg -U "struct.*IR_NODE" -A 20 src/compiler/

# Check the realloc usage more carefully
rg -U "realloc.*buffSize" -B 2 -A 2 src/compiler/

Length of output: 1111

🧹 Nitpick comments (17)
src/lexer/lexer.c (2)

23-23: Ensure consistent naming for parameters.

The new parameter "int size" is clear, but consider using a more descriptive name such as "length" or "inputLen" for clarity.


45-60: Confirm numeric token parsing logic.

Including 'x' for numeric sequences can expand the range of valid tokens. Verify that invoking "foundX" after recognizing an 'x' meets the intended design and doesn't conflict with parsing hex or other numeric formats.

src/lexer/lexer.h (1)

19-19: Synchronize documentation with new signature.

Update doc comments to mention the "size" parameter, ensuring that it's explained as the string's length or a subset length to parse.

src/compiler/utils.h (2)

13-13: Consider basic error handling for file pointer.
Before writing data, it might be safer to check if the file pointer is valid (e.g., not NULL) to avoid segmentation faults.


38-38: Clarify the origin for file seeking.
Consider specifying (in documentation or in an additional parameter) the origin (SEEK_SET, SEEK_CUR, SEEK_END) to prevent ambiguity.

src/compiler/ir.h (2)

34-35: Validate memory usage for flexible 'value' pointer.
Switching from char* to void* increases flexibility but also complexity. Ensure there's a robust plan for memory allocation, ownership, and cleanup (especially with valueSize).


54-54: Add short documentation for mainFunc.
The new mainFunc pointer is helpful to directly access the main function node. Consider adding clarifying comments (e.g., how it is set, used, or if it can be NULL).

src/parser/asts/functions.h (1)

40-40: Add docstring for parseFunctionInvoke.
All other parser functions have a docstring. This function should follow suit for consistency and clarity.

 /**
  * Parses a function invocation.
  * @param result the lexer result.
  * @param index the current index in the tokens.
  * @return the function invocation AST node.
  */
+AST_NODE* parseFunctionInvoke(struct LexerResult result, int index);
src/compiler/utils.c (1)

16-16: Optional: Remove or refine debug log.

Line 16 prints the numeric value twice in different formats. If this is a temporary debug log, consider wrapping it in a debug flag or removing it for production.

src/compiler/pe/pe.c (1)

14-70: Validate section size and alignment.

The code sets the section’s VirtualSize to 0x1000 and overall SizeOfImage to 0x2000. For larger programs, these fixed values may be too small. Consider computing the section size dynamically or adjusting SizeOfImage to accommodate bigger sections.

src/compiler/pe/format.h (1)

2-2: Correct the typo in the file documentation.

"Defintion" should be "Definition."

- * Quickfall PE Format Defintion.
+ * Quickfall PE Format Definition.
src/cli/main.c (1)

130-133: Refactor commented-out compilePE usage
The commented-out line “compilePE(fptr, code_section, sizeof(code_section))” indicates partial usage of the new PE functionality. Consider enabling or removing it after confirming the correct use case and ensuring cross-platform safety.

src/parser/asts/functions.c (3)

68-68: Remove debug print
Printing "Type: %d" in production code might clutter logs. Consider upgrading it to a more descriptive message or removing it.


115-115: Replace vague debug print
"jd" is not self-explanatory. Consider refining the debug message to clearly indicate the unexpected condition.


201-213: New parseFunctionInvoke implementation
This new function is a good addition for parsing function calls. Verify whether support for complex expressions within function arguments is required.

src/compiler/compiler.h (1)

22-24: compile now accepts FILE instead of filename*
Switching to a FILE* parameter provides more flexibility in writing the output directly to a stream. Ensure that all error paths (e.g., NULL file pointer) are covered in the implementation.

Makefile (1)

85-85: Consider keeping build clearing indication in the message

The message change from "Clearing old builds" to "Starting building logic" makes the prepare_build target's purpose less clear. Consider using a message that indicates both actions, e.g., "[INFO] Clearing old builds and starting build logic"

-	@echo [INFO] Starting building logic
+	@echo [INFO] Clearing old builds and starting build logic
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cee137 and f99bc8a.

📒 Files selected for processing (20)
  • Makefile (1 hunks)
  • benchmarks/main.c (2 hunks)
  • src/cli/main.c (3 hunks)
  • src/compiler/compiler.c (8 hunks)
  • src/compiler/compiler.h (1 hunks)
  • src/compiler/ir.h (2 hunks)
  • src/compiler/pe/format.h (1 hunks)
  • src/compiler/pe/pe.c (1 hunks)
  • src/compiler/pe/pe.h (1 hunks)
  • src/compiler/units.h (1 hunks)
  • src/compiler/utils.c (1 hunks)
  • src/compiler/utils.h (1 hunks)
  • src/lexer/lexer.c (2 hunks)
  • src/lexer/lexer.h (1 hunks)
  • src/parser/ast.c (1 hunks)
  • src/parser/ast.h (2 hunks)
  • src/parser/asts/functions.c (4 hunks)
  • src/parser/asts/functions.h (1 hunks)
  • src/parser/parser.c (2 hunks)
  • src/utils/hashmap.c (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • src/utils/hashmap.c
  • src/compiler/units.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
src/parser/asts/functions.c

[error] 193-193: Common realloc mistake

(memleakOnRealloc)

src/compiler/utils.c

[error] 41-41: printf format string requires 5 parameters but only 4 are given.

(wrongPrintfScanfArgNum)

src/compiler/compiler.c

[error] 149-149: Memory leak

(memleak)


[error] 156-156: Memory leak

(memleak)

🔇 Additional comments (19)
src/compiler/compiler.c (1)

105-105: Initialize node->valueSize carefully.

Setting "node->valueSize = tree->valueSize" can be correct, but confirm that "tree->valueSize" is always defined for the relevant ASM function AST nodes, to avoid undefined behavior.

✅ Verification successful

Let me gather more information about the AST node initialization and value size handling in the compiler code.


Let me search for more context about the AST node structure and initialization patterns.


Based on the code context gathered, I can now provide a final response:

The valueSize initialization is safe in this context.

The code shows that node->valueSize = tree->valueSize is specifically used within the AST_ASM_FUNCTION_DECLARATION case, where node is created as an IR_ASM_FUNCTION type node. The createIRNode function properly allocates the node structure, and this assignment is part of a well-defined initialization sequence for assembly function nodes, along with other required properties like value and function arguments.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Let's examine the context around this line and how tree->valueSize is handled
rg -B 10 -A 10 "node->valueSize = tree->valueSize" src/compiler/compiler.c

# Let's also look for where valueSize is initialized in the tree structure
rg "valueSize" src/compiler/compiler.c

# Let's check the struct definition for both node and tree
rg -B 5 -A 5 "struct\s+\w*Node" src/compiler/compiler.c

Length of output: 869


Script:

#!/bin/bash
# Look for tree structure definition
rg -B 5 -A 5 "struct\s+(AST|Tree|IR)" src/compiler/

# Look for createIRNode function implementation to understand initialization
rg -B 10 -A 10 "createIRNode" src/compiler/

# Look for where valueSize is set in the tree
rg -B 5 -A 5 "->valueSize\s*=" src/compiler/

Length of output: 7066

src/parser/ast.c (1)

16-16: Clarify initialization usage.

Initializing "node->valueSize = 0" is a solid safeguard against uninitialized data. Confirm that subsequent code updates "valueSize" appropriately for AST nodes that store data.

src/compiler/utils.h (4)

1-4: Great documentation and clear purpose.
The introductory comments are concise, clearly describing the utilities' purpose for the Quickfall Compiler.


18-18: Same file pointer validation suggestion.
As with write8, adding checks on the file pointer helps ensure robust error handling throughout these utility functions.


23-23: Same file pointer validation suggestion.
Ensuring the file pointer is valid is part of standardized, defensive programming, especially for library utilities.


33-33: Same string validation suggestion.
For writestr16, ensure that the string pointer is also validated.

src/parser/ast.h (2)

26-30: Enums for function invocation and root are well-defined.
These new enum fields expand AST capabilities to represent function invocations and function roots. This is a clear, organized approach.


48-49: Initialize or validate valueSize usage.
You introduced valueSize as an additional field in ASTNode. Ensure it’s consistently set, used, and documented for all relevant node types to avoid confusion or misuse.

✅ Verification successful

Let me gather information about the usage and initialization of valueSize in the codebase.


Let me check for the initialization and usage patterns of valueSize in more detail.


Let me check one more thing to see how valueSize is used in the IR code since we saw it's also defined there.


Let's check the actual usage in those files we found to make a final determination.


valueSize is properly initialized and used consistently

The field is properly managed throughout the codebase:

  • Initialized to 0 in createASTNode (src/parser/ast.c)
  • Set to buffer size in function AST nodes (src/parser/asts/functions.c)
  • Correctly propagated to IR nodes during compilation (src/compiler/compiler.c)
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# First, let's look at the complete ASTNode definition and context
rg -B5 -A10 "valueSize" src/parser/ast.h

# Then, let's search for all occurrences of valueSize across the codebase
rg "valueSize" --type cpp

# Let's also check for any constructor or initialization patterns
ast-grep --pattern 'struct ASTNode {
  $$$
}'

Length of output: 451


Script:

#!/bin/bash
# Check for ASTNode creation and initialization
rg -B2 -A5 "Creates a new AST Node" src/parser/ast.h src/parser/ast.c

# Look for valueSize assignments or usage
rg -B2 -A2 "->valueSize" src/parser/ast.c src/compiler/ir.c

# Check if there's any documentation about valueSize in comments
rg -B2 -A2 "valueSize" src/parser/ast.c

Length of output: 1264


Script:

#!/bin/bash
# Check IR code for valueSize usage and initialization
rg -B5 -A5 "valueSize" src/compiler/ir.h src/compiler/ir.c

# Look for any value assignments to node->valueSize
rg -l "node.*valueSize.*=" 

Length of output: 513


Script:

#!/bin/bash
# Check the actual valueSize assignments in the identified files
rg -B2 -A2 "node.*valueSize.*=" src/compiler/compiler.c src/parser/ast.c src/parser/asts/functions.c

# Also check for any value-related functions that might use valueSize
rg -B2 -A2 "value.*Size" src/parser/asts/

Length of output: 1072

src/parser/parser.c (3)

62-71: Ensure correct handling of function invocations for KEYWORD tokens.

The new KEYWORD case checks if the next token is PAREN_OPEN, then calls parseFunctionInvoke. This is a valid approach for capturing function invocations such as "returnValue(...)". Verify that all KEYWORD tokens that might require different handling (e.g., conditionals or loops) are accounted for elsewhere or are not impacted by this logic.


37-37: Validate skipping the FUNCTION token.

Parsing begins at index + 1, effectively skipping the FUNCTION token. Confirm that this is intentional and that no relevant token logic is missed.

✅ Verification successful

Let me gather information about the function parsing logic to verify if skipping the FUNCTION token is intentional.


Let me check the token types and function parsing implementation more thoroughly.


From the gathered context, I can now generate a final response as we have enough information about the function parsing implementation.

Skipping the FUNCTION token is correct and intentional.

The implementation shows that:

  • The lexer identifies 'func' keyword and creates a FUNCTION token
  • When the parser encounters a FUNCTION token, it passes index + 1 to parseFunctionDeclaration
  • Inside parseFunctionDeclaration, it expects a KEYWORD token at the passed index, which would be the function name
  • This design correctly separates the function keyword from its name in the parsing process
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Let's look at the parser implementation and surrounding context
rg -B 5 -A 10 "parseFunctionDeclaration" src/

Length of output: 2402


Script:

#!/bin/bash
# Let's look at the token types and lexer implementation
rg -B 2 -A 2 "FUNCTION" src/

# Also check for any test cases related to function parsing
rg -B 2 -A 2 "test.*function" src/

Length of output: 5870


30-34: Consider implications of early return.

When encountering BRACKETS_CLOSE while parsing an AST_FUNCTION_ROOT, we immediately set endingIndex and return. This approach is correct if the function definition always terminates on the next BRACKETS_CLOSE token. However, ensure you’ve accounted for all possible nested structures or tokens that may appear before the function truly ends.

✅ Verification successful

Let me gather more information about the parser implementation and token handling to verify this concern.


Let me check how function declarations are parsed and how nested structures are handled.


Let me check the AST type definitions and parser implementation to understand how nested structures are handled.


Based on the gathered information, I can now provide a final response:

Early return is correctly implemented for function body parsing

The implementation is correct because:

  1. parseNodes is called with AST_FUNCTION_ROOT type specifically for function bodies
  2. The AST structure shows AST_FUNCTION_ROOT is a distinct type meant for function body blocks
  3. The parser correctly handles nested structures by:
    • Using a recursive descent approach where parseFunctionDeclaration calls parseNodes for the body
    • Processing all tokens sequentially until a matching BRACKETS_CLOSE is found
    • Maintaining proper scope through the AST node structure
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Let's first look at the parser implementation to understand the token handling
rg -A 10 "case BRACKETS_CLOSE:" src/parser/parser.c

# Also check for any nested structure handling
rg "AST_FUNCTION_ROOT" src/parser/parser.c

# Look for token definitions to understand BRACKETS_CLOSE
rg "BRACKETS_CLOSE" src/

Length of output: 648


Script:

#!/bin/bash
# Check the function declaration parsing implementation
cat src/parser/asts/functions.c

# Look for any AST node type definitions
rg -l "enum.*AST" src/

# Check if there are any test cases for nested functions
rg -l "test.*function" src/tests/

Length of output: 4681


Script:

#!/bin/bash
# Check AST type definitions
cat src/parser/ast.h

# Look at the main parser implementation
cat src/parser/parser.h

# Check the parseNodes implementation which is called for function body
rg -A 10 "parseNodes" src/parser/parser.c

Length of output: 1982

src/cli/main.c (5)

12-12: Include <stdint.h> for fixed-width integer types
Including <stdint.h> ensures consistent behavior of fixed-width types and is a good practice for cross-platform development.


116-116: Validate the size of bytes read
Assigning the result of fread back into size ensures that the true number of bytes read is tracked. This is a good approach to avoid processing uninitialized data.


120-120: Passing correct size to the lexer
The lexer is now receiving the exact data size. This prevents out-of-bounds reads. No further action is needed.


28-30: New global code_section array
These hardcoded bytes appear to be machine instructions. Verify that they match the intended architecture and are only used in the correct context (e.g., Windows PE).

✅ Verification successful

Let me gather more context about how these machine code bytes are being used by examining the surrounding code and related files.


Let me check the specific usage of this code_section array to understand its purpose and context better.


Based on the gathered context, I can now provide a final response about the machine code bytes:

Machine code appears to be Windows x64 PE-specific but is currently unused
The bytes 0xB8, 0x4C, 0x00, 0x00, 0x00, 0xC3 represent x64 assembly (mov eax, 76; ret) and while the codebase has PE file generation support for Windows x64 (as seen in pe.c), this code_section is currently commented out in its usage (//compilePE(fptr, code_section, sizeof(code_section));). Since the code is not actively used and the PE generation is properly configured for x64 Windows executables, there are no immediate concerns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Check the content of the main.c file to understand the context
rg -B 5 -A 10 "code_section\[" src/cli/main.c

# Look for other occurrences of this byte sequence
rg -l "0xB8.*0x4C.*0x00.*0x00.*0x00.*0xC3"

# Find any architecture-specific code or platform checks
rg -i "architecture|x86|amd64|windows|platform"

Length of output: 1449


Script:

#!/bin/bash
# Look for where code_section is used
rg -B 5 -A 15 "code_section" --type c

# Check PE-related code since we see Windows references
rg -B 5 -A 10 "pe\.c" --type c

# Look for any assembly or machine code related functions
rg -B 3 -A 3 "0x[0-9A-F]{2}" --type c

Length of output: 6020


20-20: Importing PE-related functions
Including "../compiler/pe/pe.h" aligns with the new PE compilation feature. Ensure that the introduced functionality is conditionally invoked only where necessary to avoid potential cross-platform issues.

✅ Verification successful

Let me verify the concern about conditional invocation of PE-related functionality by examining the codebase.

This will help us understand:

  1. If PE-related code is properly guarded with platform-specific conditionals
  2. Where PE-related functions are being used in main.c
  3. How the PE header file itself handles platform specificity

Let me check how this PE functionality is used in the actual implementation.

This will help us understand:

  1. How PE compilation is integrated into the main flow
  2. If there are platform-specific build configurations
  3. How the PE compiler implementation handles platform specificity

Let me check one final thing to see how the PE compilation is invoked in the main flow.

This will help us understand:

  1. The context in which PE compilation is triggered
  2. If there are any platform-specific guards in the PE format definitions

PE compilation is currently commented out and properly isolated

The PE compilation functionality is already properly isolated:

  • The PE-related code is commented out in main.c (//compilePE(fptr, code_section, sizeof(code_section));)
  • The PE header files have proper include guards (#ifndef COMPILER_PE)
  • The implementation is self-contained in the PE-specific directory

No further action is needed as the code already follows good practices for cross-platform compatibility.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Check for PE-related conditional compilation directives
rg -B2 -A2 "#if.*WIN|#ifdef.*WIN" src/cli/main.c

# Look for PE-related function calls
rg "pe_" src/cli/main.c

# Check the PE header file for any platform-specific guards
cat src/compiler/pe/pe.h

Length of output: 337


Script:

#!/bin/bash
# Check the main function implementation for PE-related logic
rg -A 10 "int main" src/cli/main.c

# Look for platform-specific compilation flags in build files
fd -e mk -e makefile -e cmake . -x cat {}

# Check PE compiler implementation
cat src/compiler/pe/pe.c

Length of output: 2830


Script:

#!/bin/bash
# Search for compile-related code in main.c to see how PE compilation is triggered
rg -B5 -A10 "compilePE" src/cli/main.c

# Check if there are any platform-specific guards in the format header
cat src/compiler/pe/format.h

Length of output: 2851

src/parser/asts/functions.c (2)

183-183: Rejecting non-hex data
The updated check for t.type != NUMBER ensures only numeric tokens are accepted. Verify that the compiler supports the correct numeric formats (e.g., hex, decimal).


65-66: Avoid silent skip for PAREN_OPEN
Continuing on PAREN_OPEN without further logic might mask potential parsing errors. Confirm if this is intentional (e.g., nested parentheses).

✅ Verification successful

Let me analyze the code context around this PAREN_OPEN handling to understand its purpose and potential implications.


Let me try to find the token definitions and examine the function parsing logic more thoroughly.


After analyzing the code context, I can now provide a final response:

PAREN_OPEN skip is correct in parameter parsing context
The continue on PAREN_OPEN is intentional and correct in this context. This is part of the parseParameters function which processes function parameter declarations. The opening parenthesis is just a delimiter for the parameter list, and skipping it is the right behavior as the actual parameter parsing logic handles the content between parentheses. The error handling is properly implemented through other cases, including the default case that returns NULL for unexpected tokens.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Let's look at the surrounding context of this PAREN_OPEN handling
rg -B 10 -A 10 "case PAREN_OPEN:" src/parser/asts/functions.c

Length of output: 909


Script:

#!/bin/bash
# Let's also check if there are other PAREN_OPEN handlers in the codebase for comparison
rg "PAREN_OPEN" src/parser/

Length of output: 196


Script:

#!/bin/bash
# Let's examine the token definitions and related enums
fd "token" -e h -e c src/parser/ --exec cat {}

Length of output: 48


Script:

#!/bin/bash
# Let's find and examine token-related header files
fd -e h . --exec cat {} 2>/dev/null

Length of output: 10469


Script:

#!/bin/bash
# Let's examine the entire functions.c file for better context
cat src/parser/asts/functions.c

Length of output: 4495


Script:

#!/bin/bash
# Let's also check for any test files that might show the expected behavior
fd -e test -e spec src/parser/ --exec cat {}

Length of output: 398

benchmarks/main.c (1)

126-126: Passing size to runLexer
Ensuring runLexer processes only valid buffer contents helps avoid off-by-one or out-of-bounds conditions.

Comment on lines +71 to +72
++i;
c = string[i];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Now I have enough information to provide a final response based on the code analysis:

Buffer overflow and loop boundary issues detected

  1. The buff is allocated with size buffLen = 32 but there are no bounds checks in any of the loops:

    • Number parsing loop can overflow if input > 32 chars
    • String parsing loop can overflow if string > 32 chars
    • Keyword parsing loop can overflow if identifier > 32 chars
  2. The string parsing loop has a potential infinite loop issue - it only checks for a closing quote (c != '"') without checking for end of input, which could lead to buffer overflow and out-of-bounds access if the string is not properly terminated.

  3. The increment of i inside loops combined with the outer loop's ++i can cause skipping characters in certain cases, particularly after processing keywords where there's a manual decrement (--i) to compensate.

🔗 Analysis chain

Validate string & keyword loops.

  1. Carefully check that incrementing "i" inside these loops doesn't overshoot the main for loop in certain edge cases.
  2. The null terminator appending is correct, but consider verifying that "buff" stays within bounds when processing very long strings or identifiers.

Also applies to: 84-88, 91-92


Let me analyze the lexer code to verify these concerns about loop bounds and buffer safety.


Let me try different commands to better understand the code structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Let's examine the lexer.c file to understand the loop structure and buffer handling
rg -A 10 -B 10 "++i;" src/lexer/lexer.c

Length of output: 121


Script:

#!/bin/bash
# Look for buffer size definitions and declarations
rg "buff\[" src/lexer/lexer.c

Length of output: 176


Script:

#!/bin/bash
# Check for any buffer size constants or definitions
rg "#define.*BUFF" src/lexer/lexer.c

Length of output: 38


Script:

#!/bin/bash
# Look for the main loop structure
ast-grep --pattern 'for ($init; $cond; $update) { $$$ }'  src/lexer/lexer.c

Length of output: 76


Script:

#!/bin/bash
# Let's examine the lexer.c file content to understand the loop structure
cat src/lexer/lexer.c

Length of output: 3021


Script:

#!/bin/bash
# Look for buffer declarations to understand its size
rg "char.*buff" src/lexer/lexer.c

Length of output: 67


Script:

#!/bin/bash
# Search for string and keyword related loops with proper escaping
rg "while.*\(.*c.*==.*" src/lexer/lexer.c

Length of output: 133

Comment on lines +31 to +60
for(int i = 0; i < size; ++i) {

c = string[i];

int buffLen = 32;
char* buff = malloc(buffLen);
buff[0] = '\0';

if(c == ' ' || c == '\t' || c == '\n') {
continue;
} else if (isdigit(c)) {
}
else if(c == '\0') {
break;
}
else if (isdigit(c)) {
int numLen = 0;

while(isdigit(c) || c == 'x') {
int foundX = 0;
while(isdigit(c) || c == 'x' || (foundX == 1 && c != '\0' && c != ' ' && c != '\n')) {
buff[numLen] = c;
numLen++;

c = *string++;
if(c == 'x') foundX = 1;

++i;
c = string[i];
}

buff[numLen] = '\0';

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the gathered information, I can now provide the final response:

Buffer overflow risk confirmed - action required

The code has multiple issues that need to be addressed:

  • Fixed buffer size of 32 bytes (buffLen = 32) with no overflow checks when writing to buff using numLen as index
  • No dynamic buffer resizing logic in the lexer while processing numeric tokens
  • Token array is pre-allocated with fixed size (malloc(sizeof(struct Token) * 1024)
  • No bounds checking between numLen and buffLen in the numeric token processing loop

Required changes:

  • Add bounds checking: if (numLen >= buffLen - 1) { /* handle overflow */ }
  • Implement buffer resizing when needed: buff = realloc(buff, buffLen * 2)
  • Consider using a dynamic array for tokens instead of fixed 1024 size
🔗 Analysis chain

Validate index ranges & handle dynamic buffer growth.

  1. The for loop correctly bounds iteration by "size," which prevents reading beyond the input buffer. Ensure that the calling code passes a valid size.
  2. "buff" is allocated with a fixed size of 32; consider dynamic resizing if a token can exceed this length.

Let me verify the concerns by analyzing the codebase further.

The scripts will help verify:

  1. How the lexer is called and if size validation exists
  2. If there's any buffer overflow protection or dynamic resizing
  3. If token length is checked against buffer size

Let me analyze the results to provide a concrete verification of the concerns.


Let me try to find more specific information about the lexer implementation and its usage.

This will help us understand:

  1. The complete lexer function signature and context
  2. Token structure and any size constraints
  3. Any existing buffer size validations
  4. Where tokens are used, to assess potential buffer overflow risks
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for the function/method that calls this lexer code
ast-grep --pattern 'lex($$$)'

# Search for any buffer overflow checks or dynamic buffer resizing logic
rg -A 5 "malloc.*buff" 
rg -A 5 "realloc.*buff"

# Look for token length validation
rg "numLen.*buff" 

Length of output: 1199


Script:

#!/bin/bash
# Find the lexer function definition and its context
rg -B 5 -A 10 "struct Token\* lex"

# Look for token struct definition to understand size constraints
rg -B 2 -A 5 "struct Token"

# Search for buffer size checks in lexer
rg -B 2 -A 5 "buffLen|numLen" src/lexer/

# Find where tokens are processed/consumed
rg -l "Token.*tokens"

Length of output: 4538

Comment on lines +147 to +150
if(hashGet(ctx->nodeMap, h) == NULL) {
printf("Error: the main function wasn't defined!\n");
return;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix memory leaks on early returns.

When returning at lines 149 or 157, the allocated buffer "buff" is never freed.

Apply this minimal patch to free memory before returning:

 if(hashGet(ctx->nodeMap, h) == NULL) {
   printf("Error: the main function wasn't defined!\n");
+  free(buff);
   return;
 }

 if(node->nodeType != IR_FUNCTION) {
   printf("Error: main must be a function!\n");
+  free(buff);
   return;
 }

Also applies to: 155-157

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 149-149: Memory leak

(memleak)

Comment on lines +186 to +187
//todo: change format based on user
compilePE(out, buff, i);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Deallocate "buff" after compiling.

Add a free call to avoid leaking memory in the normal completion path:

 compilePE(out, buff, i);
+free(buff);
}

Committable suggestion skipped: line range outside the PR's diff.

/**
* Compiles into PE format.
*/
void compilePE(FILE* fptr, uint8_t program[], int programSize);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

I/O error handling is missing in PE compilation process

The code review reveals several portability and reliability concerns:

  • The compilePE function performs multiple fwrite operations without checking their return values
  • The calling code in compiler.c doesn't handle potential I/O errors from compilePE
  • The function closes the file handle internally, which should be the caller's responsibility
  • No error status is returned to indicate success/failure of the PE compilation

Key locations that need attention:

  • src/compiler/pe/pe.c: Add error checking for all fwrite operations and return status
  • src/compiler/pe/pe.h: Change signature to return an error status
  • src/compiler/compiler.c: Add error handling for compilePE call
  • Remove fclose from compilePE as it violates the single responsibility principle
🔗 Analysis chain

Confirm portability of the PE-building API.

The "compilePE" function signature is straightforward; confirm that calling code properly checks return values or handles potential I/O errors.


Let me search for the usage of compilePE function to verify how it's being called and if error handling is implemented properly.


Let me check the implementation and error handling around the compilePE function calls.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for the implementation of compilePE function
ast-grep --pattern 'void compilePE(FILE* $_, uint8_t $_, int $_) {
  $$$
}'

# Search for calls to compilePE
rg "compilePE\(" -A 3

Length of output: 735


Script:

#!/bin/bash
# Get the context of compilePE implementation
cat src/compiler/pe/pe.c

# Check error handling in compiler.c where compilePE is called
cat src/compiler/compiler.c

Length of output: 6566

Comment on lines +193 to +197
buff = realloc(buff, sizeof(uint8_t) * buffIndex);

node->valueSize = buffIndex;
node->value = buff;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle realloc carefully to avoid memory leaks
Realloc can fail, returning NULL and discarding the original pointer if not handled properly. Consider using a temporary pointer:

 uint8_t* tmp = realloc(buff, sizeof(uint8_t) * buffIndex);
-if(tmp == NULL) {
-    return NULL; // Handle error
-}
-buff = tmp;
 node->valueSize = buffIndex;
 node->value = buff;

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 193-193: Common realloc mistake

(memleakOnRealloc)

return NULL;
}

buff[buffIndex] = strtol(t.value, NULL, 16);
buffIndex++;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Buffer overflow possible - bounds check missing
The code allocates a fixed buffer of 32 bytes but increments buffIndex without checking against buffSize. This can lead to a buffer overflow when processing more than 32 numbers. Add a bounds check before incrementing buffIndex.

  • In src/parser/asts/functions.c, add a check before writing to buff[buffIndex]:
if (buffIndex >= buffSize) {
    // Either reallocate with larger size or handle error
    buffSize *= 2;
    buff = realloc(buff, sizeof(uint8_t) * buffSize);
    if (!buff) return NULL;
}
buff[buffIndex] = strtol(t.value, NULL, 16);
buffIndex++;
🔗 Analysis chain

Increment buffIndex carefully
Ensure buffIndex never exceeds buffSize. Otherwise, the memory write might go out of bounds.


Let me analyze the code around this line to verify the bounds checking.


Let me check how buff is allocated and if there's a buffSize variable to verify bounds checking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Get the context around buffIndex increment
rg -B 10 -A 5 "buffIndex\+\+" src/parser/asts/functions.c

Length of output: 301


Script:

#!/bin/bash
# Get the context around buff allocation and buffSize
rg -B 5 -A 15 "buff\s*=\s*malloc" src/parser/asts/functions.c

Length of output: 434

Comment on lines +143 to +145
node->right = parseNodes(result, params->endingIndex, AST_FUNCTION_ROOT);

node->endingIndex = node->right->endingIndex;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the code analysis, I can now provide a final response:

Add NULL check for parseNodes return value
The parseNodes function can return NULL in various scenarios as seen in other parsing functions in the codebase. The current code assumes node->right is always valid when assigning node->endingIndex = node->right->endingIndex. Add a NULL check before accessing node->right to prevent potential crashes:

node->right = parseNodes(result, params->endingIndex, AST_FUNCTION_ROOT);
if (node->right == NULL) return NULL;
node->endingIndex = node->right->endingIndex;
🔗 Analysis chain

Ensure node->right and node->endingIndex are valid
Assuming node->right and node->right->endingIndex are non-null can cause crashes if parseNodes fails. Consider adding safety checks before assigning node->endingIndex.


Let me analyze the parseNodes function and its error handling to verify this concern.


Let me check the implementation of parseNodes and its error handling to better understand if it can return NULL.


Let me try a different approach to find the error handling in the parser.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for parseNodes function definition and its return value handling
ast-grep --pattern 'ASTNode* parseNodes($$$)'

# Also search for any NULL checks on parseNodes result
rg "parseNodes\(" -A 5 -B 5

Length of output: 3584


Script:

#!/bin/bash
# Look for NULL returns in parseNodes implementation
rg "return NULL" -B 10 src/parser/parser.c

# Look for the createASTNode function since it's used in parseNodes
ast-grep --pattern 'AST_NODE* createASTNode($$$)'

Length of output: 96


Script:

#!/bin/bash
# Search for createASTNode definition and usage
rg "createASTNode" -A 5 -B 5

# Look for error handling patterns in functions.c
rg "return NULL|return node" -B 10 src/parser/asts/functions.c

Length of output: 12766

Comment on lines +139 to +146
fptr = fopen("output.txt", "w");

endTimer(i, 3);

startTimer();

fptr = fopen("output.txt", "w");
fprintf(fptr, compiled);
fclose(fptr);
IR_CTX* ctx = makeContext(node);
compile(ctx, fptr);

endTimer(i, 4);
fclose(fptr);

free(compiled);
endTimer(i, 3);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add error handling for IR context creation and file operations
The codebase shows inconsistent error handling patterns. While src/cli/main.c properly checks for both file operations (fopen) and IR context creation (makeContext), the benchmark code lacks these safety checks. Add the following error handling:

  • Check if fopen returns NULL for the output file
  • Verify if makeContext returns NULL, as seen in tests/ir.c and src/cli/main.c
🔗 Analysis chain

Consolidate compilation steps
Opening the file, creating the IR context, compiling, and closing the file are well-structured. Double-check error handling (e.g., verify IR_CTX != NULL) for production readiness.


Let me gather information about error handling in the codebase, particularly around IR_CTX creation and file operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Check IR_CTX creation and error handling patterns
ast-grep --pattern 'makeContext($_)'

# Look for error handling patterns around fopen
rg "fopen.*\(" -A 3

# Check IR_CTX struct definition and related error handling
rg "IR_CTX" -A 3

Length of output: 2457

Comment on lines +87 to 91
clean:
$(RM) build
$(RM) $(TARGET)
$(RM) $(TEST_TARGET)
$(RM) $(BENCH_TARGET)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add .PHONY declaration for the clean target

The clean target should be declared as .PHONY to ensure it runs even if a file named 'clean' exists in the directory. This is a common Makefile best practice.

Add this line at the beginning of the Makefile:

+.PHONY: clean

Committable suggestion skipped: line range outside the PR's diff.

@Zffu Zffu merged commit 25d1634 into master Dec 23, 2024
2 of 3 checks passed
@Zffu Zffu linked an issue Dec 24, 2024 that may be closed by this pull request
@Zffu Zffu deleted the comp branch January 12, 2025 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ast Signifies a AST / parser change. compiler Signifies a compiler change. experiment Includes an experimental feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compiler v3
1 participant