Compiler V3 #39

Zffu · 2024-12-23T21:52:33Z

Summary by CodeRabbit

New Features
- Introduced a new clean target for build cleanup.
- Added support for compiling programs into the Portable Executable (PE) format.
- Enhanced lexer to handle specified input sizes and improved token parsing.
- Added functionality for handling function invocations in the parser.
Bug Fixes
- Ensured proper initialization of valueSize in AST nodes.
- Improved error handling for file operations.
Documentation
- Updated function signatures and parameter descriptions in headers.
Chores
- Added utility functions for file operations and data manipulation.

…fHeader, writeWinSTDFields, writeWinSpecificFields and writeWinSection

…riteWinExecutableHeader)

…rs of writeWinExecutable

coderabbitai · 2024-12-23T21:52:41Z

Walkthrough

This pull request introduces significant enhancements to the Quickfall toolchain, focusing on improving the compiler's infrastructure for generating Portable Executable (PE) files. The changes span multiple components, including the lexer, parser, compiler, and build system. Key modifications include updating function signatures, adding PE file format support, introducing new utility functions for file operations, and refining the AST and IR representations. The overall goal appears to be establishing a more robust and flexible compilation process with explicit support for Windows executable generation.

Changes

File	Changes
`Makefile`	- Updated `prepare_build` target message - Added new `clean` target to remove build artifacts
`benchmarks/main.c`	- Updated `runLexer` function signature to include buffer size - Modified compilation process using new IR context
`src/cli/main.c`	- Added `code_section` global variable - Updated file reading and lexer processing
`src/compiler/compiler.c`	- Modified `makeContext` and `compile` function logic - Updated main function validation
`src/lexer/lexer.c`	- Updated `runLexer` to accept input size - Enhanced token parsing logic
`src/parser/asts/functions.c`	- Added `parseFunctionInvoke` function - Updated function parsing logic
Multiple new files	- Added `pe/format.h`, `pe/pe.c`, `pe/pe.h` - Added `compiler/units.h`, `compiler/utils.h`, `compiler/utils.c`

Sequence Diagram

sequenceDiagram
    participant CLI as CLI
    participant Lexer
    participant Parser
    participant Compiler
    participant PEGenerator

    CLI->>Lexer: runLexer(input, size)
    Lexer-->>Parser: Return lexer result
    Parser->>Parser: Parse AST
    Parser-->>Compiler: Generate IR Context
    Compiler->>PEGenerator: compilePE(file, program, size)
    PEGenerator->>PEGenerator: Generate PE headers
    PEGenerator->>PEGenerator: Write program data
    PEGenerator-->>Compiler: Return compiled executable

Possibly Related PRs

Compiler V2 #30: Modifications to the Makefile build process
[feat] Better IR #37: Additional Makefile improvements for testing support

Poem

🐰 Compiling code with rabbit might,
PE headers dancing in binary light,
Lexers parsing, parsers gleam,
A toolchain built on coding's dream!
Quickfall rises, executable and bright! 🚀

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 13

🔭 Outside diff range comments (1)

src/compiler/compiler.c (1)
Buffer reallocation has incorrect size calculation

The reallocation code has a critical issue:

The realloc(ctx->nodes, buffSize) call uses raw buffSize value instead of multiplying it by sizeof(IR_NODE), unlike the initial allocation malloc(sizeof(IR_NODE) * 32)

This mismatch will lead to buffer overflow as the reallocation size is much smaller than required

The correct reallocation should be:
ctx->nodes = realloc(ctx->nodes, sizeof(IR_NODE) * buffSize);
🔗 Analysis chain

Line range hint 32-92: Potential buffer resizing issue for variables and function nodes.

The code grows "buffSize" by 1.5 when the nodeIndex surpasses the current threshold, but then reuses the same pointer "ctx->nodes" with "realloc" for IR_NODE array. Verify that the new allocation size (buffSize * 1.5) meets memory alignment requirements and ensures no out-of-bounds writes.

Let me gather more information about the memory allocation and buffer handling in the codebase.

This will help us understand:

How buffSize is initialized and managed

The IR_NODE structure definition and alignment requirements

The node creation and memory allocation patterns

Let me check for the IR_NODE structure definition and the size calculations.

This will help us understand:

The exact structure layout of IR_NODE

How the size calculations are done

Any predefined buffer size constants

Let me try one more search with a modified pattern to find the IR_NODE structure definition.

This will help us understand:

The IR_NODE structure definition that we missed earlier

The complete context around the realloc operation
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:
#!/bin/bash
# Search for buffer size and memory allocation patterns
rg -U "buffSize|realloc|malloc" src/compiler/compiler.c

# Look for IR_NODE struct definition and related memory handling
rg -U "struct\s+IR_NODE|IR_NODE\s*\*" src/compiler/

# Search for createIRNode implementation
ast-grep --pattern 'IR_NODE* createIRNode($$$)'
Length of output: 1166

Script:
#!/bin/bash
# Look for IR_NODE structure definition
rg -U "struct\s+IR_NODE\s*{[^}]*}" src/compiler/

# Check for sizeof usage with IR_NODE
rg -U "sizeof\s*$\s*IR_NODE\s*$" src/compiler/

# Look for buffer size constants or defines
rg -U "#define.*BUFF|BUFFER|SIZE" src/compiler/
Length of output: 408

Script:
#!/bin/bash
# Look for IR_NODE structure definition with a simpler pattern
rg -U "struct.*IR_NODE" -A 20 src/compiler/

# Check the realloc usage more carefully
rg -U "realloc.*buffSize" -B 2 -A 2 src/compiler/
Length of output: 1111

🧹 Nitpick comments (17)

src/lexer/lexer.c (2)

23-23: Ensure consistent naming for parameters.

The new parameter "int size" is clear, but consider using a more descriptive name such as "length" or "inputLen" for clarity.

45-60: Confirm numeric token parsing logic.

Including 'x' for numeric sequences can expand the range of valid tokens. Verify that invoking "foundX" after recognizing an 'x' meets the intended design and doesn't conflict with parsing hex or other numeric formats.

src/lexer/lexer.h (1)

19-19: Synchronize documentation with new signature.

Update doc comments to mention the "size" parameter, ensuring that it's explained as the string's length or a subset length to parse.

src/compiler/utils.h (2)

13-13: Consider basic error handling for file pointer.
Before writing data, it might be safer to check if the file pointer is valid (e.g., not NULL) to avoid segmentation faults.

38-38: Clarify the origin for file seeking.
Consider specifying (in documentation or in an additional parameter) the origin (SEEK_SET, SEEK_CUR, SEEK_END) to prevent ambiguity.

src/compiler/ir.h (2)

34-35: Validate memory usage for flexible 'value' pointer.
Switching from char* to void* increases flexibility but also complexity. Ensure there's a robust plan for memory allocation, ownership, and cleanup (especially with valueSize).

54-54: Add short documentation for mainFunc.
The new mainFunc pointer is helpful to directly access the main function node. Consider adding clarifying comments (e.g., how it is set, used, or if it can be NULL).
src/parser/asts/functions.h (1)
40-40: Add docstring for parseFunctionInvoke.
All other parser functions have a docstring. This function should follow suit for consistency and clarity.
 /**
  * Parses a function invocation.
  * @param result the lexer result.
  * @param index the current index in the tokens.
  * @return the function invocation AST node.
  */
+AST_NODE* parseFunctionInvoke(struct LexerResult result, int index);
src/compiler/utils.c (1)

16-16: Optional: Remove or refine debug log.

Line 16 prints the numeric value twice in different formats. If this is a temporary debug log, consider wrapping it in a debug flag or removing it for production.

src/compiler/pe/pe.c (1)

14-70: Validate section size and alignment.

The code sets the section’s VirtualSize to 0x1000 and overall SizeOfImage to 0x2000. For larger programs, these fixed values may be too small. Consider computing the section size dynamically or adjusting SizeOfImage to accommodate bigger sections.
src/compiler/pe/format.h (1)
2-2: Correct the typo in the file documentation.

"Defintion" should be "Definition."
- * Quickfall PE Format Defintion.
+ * Quickfall PE Format Definition.
src/cli/main.c (1)

130-133: Refactor commented-out compilePE usage
The commented-out line “compilePE(fptr, code_section, sizeof(code_section))” indicates partial usage of the new PE functionality. Consider enabling or removing it after confirming the correct use case and ensuring cross-platform safety.

src/parser/asts/functions.c (3)

68-68: Remove debug print
Printing "Type: %d" in production code might clutter logs. Consider upgrading it to a more descriptive message or removing it.

115-115: Replace vague debug print
"jd" is not self-explanatory. Consider refining the debug message to clearly indicate the unexpected condition.

201-213: New parseFunctionInvoke implementation
This new function is a good addition for parsing function calls. Verify whether support for complex expressions within function arguments is required.

src/compiler/compiler.h (1)

22-24: compile now accepts FILE instead of filename*
Switching to a FILE* parameter provides more flexibility in writing the output directly to a stream. Ensure that all error paths (e.g., NULL file pointer) are covered in the implementation.
Makefile (1)
85-85: Consider keeping build clearing indication in the message

The message change from "Clearing old builds" to "Starting building logic" makes the prepare_build target's purpose less clear. Consider using a message that indicates both actions, e.g., "[INFO] Clearing old builds and starting build logic"
-	@echo [INFO] Starting building logic
+	@echo [INFO] Clearing old builds and starting build logic

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cee137 and f99bc8a.

📒 Files selected for processing (20)

Makefile (1 hunks)
benchmarks/main.c (2 hunks)
src/cli/main.c (3 hunks)
src/compiler/compiler.c (8 hunks)
src/compiler/compiler.h (1 hunks)
src/compiler/ir.h (2 hunks)
src/compiler/pe/format.h (1 hunks)
src/compiler/pe/pe.c (1 hunks)
src/compiler/pe/pe.h (1 hunks)
src/compiler/units.h (1 hunks)
src/compiler/utils.c (1 hunks)
src/compiler/utils.h (1 hunks)
src/lexer/lexer.c (2 hunks)
src/lexer/lexer.h (1 hunks)
src/parser/ast.c (1 hunks)
src/parser/ast.h (2 hunks)
src/parser/asts/functions.c (4 hunks)
src/parser/asts/functions.h (1 hunks)
src/parser/parser.c (2 hunks)
src/utils/hashmap.c (1 hunks)

✅ Files skipped from review due to trivial changes (2)

src/utils/hashmap.c
src/compiler/units.h

🧰 Additional context used

🪛 cppcheck (2.10-2)

src/parser/asts/functions.c

[error] 193-193: Common realloc mistake

(memleakOnRealloc)

src/compiler/utils.c

[error] 41-41: printf format string requires 5 parameters but only 4 are given.

(wrongPrintfScanfArgNum)

src/compiler/compiler.c

[error] 149-149: Memory leak

(memleak)

[error] 156-156: Memory leak

(memleak)

🔇 Additional comments (19)

src/compiler/compiler.c (1)

105-105: Initialize node->valueSize carefully.

Setting "node->valueSize = tree->valueSize" can be correct, but confirm that "tree->valueSize" is always defined for the relevant ASM function AST nodes, to avoid undefined behavior.

✅ Verification successful

Let me gather more information about the AST node initialization and value size handling in the compiler code.

Let me search for more context about the AST node structure and initialization patterns.

Based on the code context gathered, I can now provide a final response:

The valueSize initialization is safe in this context.

The code shows that node->valueSize = tree->valueSize is specifically used within the AST_ASM_FUNCTION_DECLARATION case, where node is created as an IR_ASM_FUNCTION type node. The createIRNode function properly allocates the node structure, and this assignment is part of a well-defined initialization sequence for assembly function nodes, along with other required properties like value and function arguments.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Let's examine the context around this line and how tree->valueSize is handled
rg -B 10 -A 10 "node->valueSize = tree->valueSize" src/compiler/compiler.c

# Let's also look for where valueSize is initialized in the tree structure
rg "valueSize" src/compiler/compiler.c

# Let's check the struct definition for both node and tree
rg -B 5 -A 5 "struct\s+\w*Node" src/compiler/compiler.c

Length of output: 869

Script:

#!/bin/bash
# Look for tree structure definition
rg -B 5 -A 5 "struct\s+(AST|Tree|IR)" src/compiler/

# Look for createIRNode function implementation to understand initialization
rg -B 10 -A 10 "createIRNode" src/compiler/

# Look for where valueSize is set in the tree
rg -B 5 -A 5 "->valueSize\s*=" src/compiler/

Length of output: 7066

src/parser/ast.c (1)

16-16: Clarify initialization usage.

Initializing "node->valueSize = 0" is a solid safeguard against uninitialized data. Confirm that subsequent code updates "valueSize" appropriately for AST nodes that store data.

src/compiler/utils.h (4)

1-4: Great documentation and clear purpose.
The introductory comments are concise, clearly describing the utilities' purpose for the Quickfall Compiler.

18-18: Same file pointer validation suggestion.
As with write8, adding checks on the file pointer helps ensure robust error handling throughout these utility functions.

23-23: Same file pointer validation suggestion.
Ensuring the file pointer is valid is part of standardized, defensive programming, especially for library utilities.

33-33: Same string validation suggestion.
For writestr16, ensure that the string pointer is also validated.

src/parser/ast.h (2)

26-30: Enums for function invocation and root are well-defined.
These new enum fields expand AST capabilities to represent function invocations and function roots. This is a clear, organized approach.

48-49: Initialize or validate valueSize usage.
You introduced valueSize as an additional field in ASTNode. Ensure it’s consistently set, used, and documented for all relevant node types to avoid confusion or misuse.

✅ Verification successful

Let me gather information about the usage and initialization of valueSize in the codebase.

Let me check for the initialization and usage patterns of valueSize in more detail.

Let me check one more thing to see how valueSize is used in the IR code since we saw it's also defined there.

Let's check the actual usage in those files we found to make a final determination.

valueSize is properly initialized and used consistently

The field is properly managed throughout the codebase:

Initialized to 0 in createASTNode (src/parser/ast.c)
Set to buffer size in function AST nodes (src/parser/asts/functions.c)
Correctly propagated to IR nodes during compilation (src/compiler/compiler.c)

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# First, let's look at the complete ASTNode definition and context
rg -B5 -A10 "valueSize" src/parser/ast.h

# Then, let's search for all occurrences of valueSize across the codebase
rg "valueSize" --type cpp

# Let's also check for any constructor or initialization patterns
ast-grep --pattern 'struct ASTNode {
  $$$
}'

Length of output: 451

Script:

#!/bin/bash
# Check for ASTNode creation and initialization
rg -B2 -A5 "Creates a new AST Node" src/parser/ast.h src/parser/ast.c

# Look for valueSize assignments or usage
rg -B2 -A2 "->valueSize" src/parser/ast.c src/compiler/ir.c

# Check if there's any documentation about valueSize in comments
rg -B2 -A2 "valueSize" src/parser/ast.c

Length of output: 1264

Script:

#!/bin/bash
# Check IR code for valueSize usage and initialization
rg -B5 -A5 "valueSize" src/compiler/ir.h src/compiler/ir.c

# Look for any value assignments to node->valueSize
rg -l "node.*valueSize.*="

Length of output: 513

Script:

#!/bin/bash
# Check the actual valueSize assignments in the identified files
rg -B2 -A2 "node.*valueSize.*=" src/compiler/compiler.c src/parser/ast.c src/parser/asts/functions.c

# Also check for any value-related functions that might use valueSize
rg -B2 -A2 "value.*Size" src/parser/asts/

Length of output: 1072

src/parser/parser.c (3)

62-71: Ensure correct handling of function invocations for KEYWORD tokens.

The new KEYWORD case checks if the next token is PAREN_OPEN, then calls parseFunctionInvoke. This is a valid approach for capturing function invocations such as "returnValue(...)". Verify that all KEYWORD tokens that might require different handling (e.g., conditionals or loops) are accounted for elsewhere or are not impacted by this logic.

37-37: Validate skipping the FUNCTION token.

Parsing begins at index + 1, effectively skipping the FUNCTION token. Confirm that this is intentional and that no relevant token logic is missed.

✅ Verification successful

Let me gather information about the function parsing logic to verify if skipping the FUNCTION token is intentional.

Let me check the token types and function parsing implementation more thoroughly.

From the gathered context, I can now generate a final response as we have enough information about the function parsing implementation.

Skipping the FUNCTION token is correct and intentional.

The implementation shows that:

The lexer identifies 'func' keyword and creates a FUNCTION token
When the parser encounters a FUNCTION token, it passes index + 1 to parseFunctionDeclaration
Inside parseFunctionDeclaration, it expects a KEYWORD token at the passed index, which would be the function name
This design correctly separates the function keyword from its name in the parsing process

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Let's look at the parser implementation and surrounding context
rg -B 5 -A 10 "parseFunctionDeclaration" src/

Length of output: 2402

Script:

#!/bin/bash
# Let's look at the token types and lexer implementation
rg -B 2 -A 2 "FUNCTION" src/

# Also check for any test cases related to function parsing
rg -B 2 -A 2 "test.*function" src/

Length of output: 5870

30-34: Consider implications of early return.

When encountering BRACKETS_CLOSE while parsing an AST_FUNCTION_ROOT, we immediately set endingIndex and return. This approach is correct if the function definition always terminates on the next BRACKETS_CLOSE token. However, ensure you’ve accounted for all possible nested structures or tokens that may appear before the function truly ends.

✅ Verification successful

Let me gather more information about the parser implementation and token handling to verify this concern.

Let me check how function declarations are parsed and how nested structures are handled.

Let me check the AST type definitions and parser implementation to understand how nested structures are handled.

Based on the gathered information, I can now provide a final response:

Early return is correctly implemented for function body parsing

The implementation is correct because:

parseNodes is called with AST_FUNCTION_ROOT type specifically for function bodies
The AST structure shows AST_FUNCTION_ROOT is a distinct type meant for function body blocks
The parser correctly handles nested structures by:
- Using a recursive descent approach where parseFunctionDeclaration calls parseNodes for the body
- Processing all tokens sequentially until a matching BRACKETS_CLOSE is found
- Maintaining proper scope through the AST node structure

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Let's first look at the parser implementation to understand the token handling
rg -A 10 "case BRACKETS_CLOSE:" src/parser/parser.c

# Also check for any nested structure handling
rg "AST_FUNCTION_ROOT" src/parser/parser.c

# Look for token definitions to understand BRACKETS_CLOSE
rg "BRACKETS_CLOSE" src/

Length of output: 648

Script:

#!/bin/bash
# Check the function declaration parsing implementation
cat src/parser/asts/functions.c

# Look for any AST node type definitions
rg -l "enum.*AST" src/

# Check if there are any test cases for nested functions
rg -l "test.*function" src/tests/

Length of output: 4681

Script:

#!/bin/bash
# Check AST type definitions
cat src/parser/ast.h

# Look at the main parser implementation
cat src/parser/parser.h

# Check the parseNodes implementation which is called for function body
rg -A 10 "parseNodes" src/parser/parser.c

Length of output: 1982

src/cli/main.c (5)

12-12: Include <stdint.h> for fixed-width integer types
Including <stdint.h> ensures consistent behavior of fixed-width types and is a good practice for cross-platform development.

116-116: Validate the size of bytes read
Assigning the result of fread back into size ensures that the true number of bytes read is tracked. This is a good approach to avoid processing uninitialized data.

120-120: Passing correct size to the lexer
The lexer is now receiving the exact data size. This prevents out-of-bounds reads. No further action is needed.

28-30: New global code_section array
These hardcoded bytes appear to be machine instructions. Verify that they match the intended architecture and are only used in the correct context (e.g., Windows PE).

✅ Verification successful

Let me gather more context about how these machine code bytes are being used by examining the surrounding code and related files.

Let me check the specific usage of this code_section array to understand its purpose and context better.

Based on the gathered context, I can now provide a final response about the machine code bytes:

Machine code appears to be Windows x64 PE-specific but is currently unused
The bytes 0xB8, 0x4C, 0x00, 0x00, 0x00, 0xC3 represent x64 assembly (mov eax, 76; ret) and while the codebase has PE file generation support for Windows x64 (as seen in pe.c), this code_section is currently commented out in its usage (//compilePE(fptr, code_section, sizeof(code_section));). Since the code is not actively used and the PE generation is properly configured for x64 Windows executables, there are no immediate concerns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Check the content of the main.c file to understand the context
rg -B 5 -A 10 "code_section\[" src/cli/main.c

# Look for other occurrences of this byte sequence
rg -l "0xB8.*0x4C.*0x00.*0x00.*0x00.*0xC3"

# Find any architecture-specific code or platform checks
rg -i "architecture|x86|amd64|windows|platform"

Length of output: 1449

Script:

#!/bin/bash
# Look for where code_section is used
rg -B 5 -A 15 "code_section" --type c

# Check PE-related code since we see Windows references
rg -B 5 -A 10 "pe\.c" --type c

# Look for any assembly or machine code related functions
rg -B 3 -A 3 "0x[0-9A-F]{2}" --type c

Length of output: 6020

20-20: Importing PE-related functions
Including "../compiler/pe/pe.h" aligns with the new PE compilation feature. Ensure that the introduced functionality is conditionally invoked only where necessary to avoid potential cross-platform issues.

✅ Verification successful

Let me verify the concern about conditional invocation of PE-related functionality by examining the codebase.

This will help us understand:

If PE-related code is properly guarded with platform-specific conditionals
Where PE-related functions are being used in main.c
How the PE header file itself handles platform specificity

Let me check how this PE functionality is used in the actual implementation.

This will help us understand:

How PE compilation is integrated into the main flow
If there are platform-specific build configurations
How the PE compiler implementation handles platform specificity

Let me check one final thing to see how the PE compilation is invoked in the main flow.

This will help us understand:

The context in which PE compilation is triggered
If there are any platform-specific guards in the PE format definitions

PE compilation is currently commented out and properly isolated

The PE compilation functionality is already properly isolated:

The PE-related code is commented out in main.c (//compilePE(fptr, code_section, sizeof(code_section));)
The PE header files have proper include guards (#ifndef COMPILER_PE)
The implementation is self-contained in the PE-specific directory

No further action is needed as the code already follows good practices for cross-platform compatibility.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Check for PE-related conditional compilation directives
rg -B2 -A2 "#if.*WIN|#ifdef.*WIN" src/cli/main.c

# Look for PE-related function calls
rg "pe_" src/cli/main.c

# Check the PE header file for any platform-specific guards
cat src/compiler/pe/pe.h

Length of output: 337

Script:

#!/bin/bash
# Check the main function implementation for PE-related logic
rg -A 10 "int main" src/cli/main.c

# Look for platform-specific compilation flags in build files
fd -e mk -e makefile -e cmake . -x cat {}

# Check PE compiler implementation
cat src/compiler/pe/pe.c

Length of output: 2830

Script:

#!/bin/bash
# Search for compile-related code in main.c to see how PE compilation is triggered
rg -B5 -A10 "compilePE" src/cli/main.c

# Check if there are any platform-specific guards in the format header
cat src/compiler/pe/format.h

Length of output: 2851

src/parser/asts/functions.c (2)

183-183: Rejecting non-hex data
The updated check for t.type != NUMBER ensures only numeric tokens are accepted. Verify that the compiler supports the correct numeric formats (e.g., hex, decimal).

65-66: Avoid silent skip for PAREN_OPEN
Continuing on PAREN_OPEN without further logic might mask potential parsing errors. Confirm if this is intentional (e.g., nested parentheses).

✅ Verification successful

Let me analyze the code context around this PAREN_OPEN handling to understand its purpose and potential implications.

Let me try to find the token definitions and examine the function parsing logic more thoroughly.

After analyzing the code context, I can now provide a final response:

PAREN_OPEN skip is correct in parameter parsing context
The continue on PAREN_OPEN is intentional and correct in this context. This is part of the parseParameters function which processes function parameter declarations. The opening parenthesis is just a delimiter for the parameter list, and skipping it is the right behavior as the actual parameter parsing logic handles the content between parentheses. The error handling is properly implemented through other cases, including the default case that returns NULL for unexpected tokens.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Let's look at the surrounding context of this PAREN_OPEN handling
rg -B 10 -A 10 "case PAREN_OPEN:" src/parser/asts/functions.c

Length of output: 909

Script:

#!/bin/bash
# Let's also check if there are other PAREN_OPEN handlers in the codebase for comparison
rg "PAREN_OPEN" src/parser/

Length of output: 196

Script:

#!/bin/bash
# Let's examine the token definitions and related enums
fd "token" -e h -e c src/parser/ --exec cat {}

Length of output: 48

Script:

#!/bin/bash
# Let's find and examine token-related header files
fd -e h . --exec cat {} 2>/dev/null

Length of output: 10469

Script:

#!/bin/bash
# Let's examine the entire functions.c file for better context
cat src/parser/asts/functions.c

Length of output: 4495

Script:

#!/bin/bash
# Let's also check for any test files that might show the expected behavior
fd -e test -e spec src/parser/ --exec cat {}

Length of output: 398

benchmarks/main.c (1)

126-126: Passing size to runLexer
Ensuring runLexer processes only valid buffer contents helps avoid off-by-one or out-of-bounds conditions.

coderabbitai · 2024-12-23T21:57:13Z

src/lexer/lexer.c

+				++i;
+				c = string[i];


💡 Codebase verification

Now I have enough information to provide a final response based on the code analysis:

Buffer overflow and loop boundary issues detected

The buff is allocated with size buffLen = 32 but there are no bounds checks in any of the loops:

Number parsing loop can overflow if input > 32 chars

String parsing loop can overflow if string > 32 chars

Keyword parsing loop can overflow if identifier > 32 chars

The string parsing loop has a potential infinite loop issue - it only checks for a closing quote (c != '"') without checking for end of input, which could lead to buffer overflow and out-of-bounds access if the string is not properly terminated.

The increment of i inside loops combined with the outer loop's ++i can cause skipping characters in certain cases, particularly after processing keywords where there's a manual decrement (--i) to compensate.

🔗 Analysis chain

Validate string & keyword loops.

Carefully check that incrementing "i" inside these loops doesn't overshoot the main for loop in certain edge cases.

The null terminator appending is correct, but consider verifying that "buff" stays within bounds when processing very long strings or identifiers.

Also applies to: 84-88, 91-92

Let me analyze the lexer code to verify these concerns about loop bounds and buffer safety.

Let me try different commands to better understand the code structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Let's examine the lexer.c file to understand the loop structure and buffer handling rg -A 10 -B 10 "++i;" src/lexer/lexer.c

Length of output: 121

Script:

#!/bin/bash # Look for buffer size definitions and declarations rg "buff\[" src/lexer/lexer.c

Length of output: 176

Script:

#!/bin/bash # Check for any buffer size constants or definitions rg "#define.*BUFF" src/lexer/lexer.c

Length of output: 38

Script:

#!/bin/bash # Look for the main loop structure ast-grep --pattern 'for ($init; $cond; $update) { $$$ }' src/lexer/lexer.c

Length of output: 76

Script:

#!/bin/bash # Let's examine the lexer.c file content to understand the loop structure cat src/lexer/lexer.c

Length of output: 3021

Script:

#!/bin/bash # Look for buffer declarations to understand its size rg "char.*buff" src/lexer/lexer.c

Length of output: 67

Script:

#!/bin/bash # Search for string and keyword related loops with proper escaping rg "while.*\(.*c.*==.*" src/lexer/lexer.c

Length of output: 133

coderabbitai · 2024-12-23T21:57:14Z

src/lexer/lexer.c

+	for(int i = 0; i < size; ++i) {
+
+		c = string[i];

 		int buffLen = 32;
 		char* buff = malloc(buffLen);
+		buff[0] = '\0';

 		if(c == ' ' || c == '\t' || c == '\n') {
 			continue;
-		} else if (isdigit(c)) {
+		} 
+		else if(c == '\0') {
+			break;
+		}
+		else if (isdigit(c)) {
 			int numLen = 0;

-			while(isdigit(c) || c == 'x') {
+			int foundX = 0;
+			while(isdigit(c) || c == 'x' || (foundX == 1 && c != '\0' && c != ' ' && c != '\n')) {
 				buff[numLen] = c;
 				numLen++;

-				c = *string++;
+				if(c == 'x') foundX = 1;
+
+				++i;
+				c = string[i];
 			}

+			buff[numLen] = '\0';
+


💡 Codebase verification

Based on the gathered information, I can now provide the final response:

Buffer overflow risk confirmed - action required

The code has multiple issues that need to be addressed:

Fixed buffer size of 32 bytes (buffLen = 32) with no overflow checks when writing to buff using numLen as index

No dynamic buffer resizing logic in the lexer while processing numeric tokens

Token array is pre-allocated with fixed size (malloc(sizeof(struct Token) * 1024)

No bounds checking between numLen and buffLen in the numeric token processing loop

Required changes:

Add bounds checking: if (numLen >= buffLen - 1) { /* handle overflow */ }

Implement buffer resizing when needed: buff = realloc(buff, buffLen * 2)

Consider using a dynamic array for tokens instead of fixed 1024 size

🔗 Analysis chain

Validate index ranges & handle dynamic buffer growth.

The for loop correctly bounds iteration by "size," which prevents reading beyond the input buffer. Ensure that the calling code passes a valid size.

"buff" is allocated with a fixed size of 32; consider dynamic resizing if a token can exceed this length.

Let me verify the concerns by analyzing the codebase further.

The scripts will help verify:

How the lexer is called and if size validation exists

If there's any buffer overflow protection or dynamic resizing

If token length is checked against buffer size

Let me analyze the results to provide a concrete verification of the concerns.

Let me try to find more specific information about the lexer implementation and its usage.

This will help us understand:

The complete lexer function signature and context

Token structure and any size constraints

Any existing buffer size validations

Where tokens are used, to assess potential buffer overflow risks

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Search for the function/method that calls this lexer code ast-grep --pattern 'lex($$$)' # Search for any buffer overflow checks or dynamic buffer resizing logic rg -A 5 "malloc.*buff" rg -A 5 "realloc.*buff" # Look for token length validation rg "numLen.*buff"

Length of output: 1199

Script:

#!/bin/bash # Find the lexer function definition and its context rg -B 5 -A 10 "struct Token\* lex" # Look for token struct definition to understand size constraints rg -B 2 -A 5 "struct Token" # Search for buffer size checks in lexer rg -B 2 -A 5 "buffLen|numLen" src/lexer/ # Find where tokens are processed/consumed rg -l "Token.*tokens"

Length of output: 4538

coderabbitai · 2024-12-23T21:57:14Z

src/compiler/compiler.c

+	if(hashGet(ctx->nodeMap, h) == NULL) {
+		printf("Error: the main function wasn't defined!\n");
+		return;
+	}


⚠️ Potential issue

Fix memory leaks on early returns.

When returning at lines 149 or 157, the allocated buffer "buff" is never freed.

Apply this minimal patch to free memory before returning:

if(hashGet(ctx->nodeMap, h) == NULL) { printf("Error: the main function wasn't defined!\n"); + free(buff); return; } if(node->nodeType != IR_FUNCTION) { printf("Error: main must be a function!\n"); + free(buff); return; }

Also applies to: 155-157

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 149-149: Memory leak

(memleak)

coderabbitai · 2024-12-23T21:57:14Z

src/compiler/compiler.c

+	//todo: change format based on user
+	compilePE(out, buff, i);


🛠️ Refactor suggestion

Deallocate "buff" after compiling.

Add a free call to avoid leaking memory in the normal completion path:

compilePE(out, buff, i); +free(buff); }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-23T21:57:14Z

src/compiler/pe/pe.h

+/**
+ * Compiles into PE format.
+ */
+void compilePE(FILE* fptr, uint8_t program[], int programSize);


💡 Codebase verification

I/O error handling is missing in PE compilation process

The code review reveals several portability and reliability concerns:

The compilePE function performs multiple fwrite operations without checking their return values

The calling code in compiler.c doesn't handle potential I/O errors from compilePE

The function closes the file handle internally, which should be the caller's responsibility

No error status is returned to indicate success/failure of the PE compilation

Key locations that need attention:

src/compiler/pe/pe.c: Add error checking for all fwrite operations and return status

src/compiler/pe/pe.h: Change signature to return an error status

src/compiler/compiler.c: Add error handling for compilePE call

Remove fclose from compilePE as it violates the single responsibility principle

🔗 Analysis chain

Confirm portability of the PE-building API.

The "compilePE" function signature is straightforward; confirm that calling code properly checks return values or handles potential I/O errors.

Let me search for the usage of compilePE function to verify how it's being called and if error handling is implemented properly.

Let me check the implementation and error handling around the compilePE function calls.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Search for the implementation of compilePE function ast-grep --pattern 'void compilePE(FILE* $_, uint8_t $_, int $_) { $$$ }' # Search for calls to compilePE rg "compilePE\(" -A 3

Length of output: 735

Script:

#!/bin/bash # Get the context of compilePE implementation cat src/compiler/pe/pe.c # Check error handling in compiler.c where compilePE is called cat src/compiler/compiler.c

Length of output: 6566

coderabbitai · 2024-12-23T21:57:14Z

src/parser/asts/functions.c

+	buff = realloc(buff, sizeof(uint8_t) * buffIndex);
+
+	node->valueSize = buffIndex;
+	node->value = buff;
+


⚠️ Potential issue

Handle realloc carefully to avoid memory leaks
Realloc can fail, returning NULL and discarding the original pointer if not handled properly. Consider using a temporary pointer:

uint8_t* tmp = realloc(buff, sizeof(uint8_t) * buffIndex); -if(tmp == NULL) { - return NULL; // Handle error -} -buff = tmp; node->valueSize = buffIndex; node->value = buff;

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 193-193: Common realloc mistake

(memleakOnRealloc)

coderabbitai · 2024-12-23T21:57:15Z

src/parser/asts/functions.c

 			return NULL;
 		}

 		buff[buffIndex] = strtol(t.value, NULL, 16);
+		buffIndex++;


💡 Codebase verification

Buffer overflow possible - bounds check missing
The code allocates a fixed buffer of 32 bytes but increments buffIndex without checking against buffSize. This can lead to a buffer overflow when processing more than 32 numbers. Add a bounds check before incrementing buffIndex.

In src/parser/asts/functions.c, add a check before writing to buff[buffIndex]:

if (buffIndex >= buffSize) { // Either reallocate with larger size or handle error buffSize *= 2; buff = realloc(buff, sizeof(uint8_t) * buffSize); if (!buff) return NULL; } buff[buffIndex] = strtol(t.value, NULL, 16); buffIndex++;

🔗 Analysis chain

Increment buffIndex carefully
Ensure buffIndex never exceeds buffSize. Otherwise, the memory write might go out of bounds.

Let me analyze the code around this line to verify the bounds checking.

Let me check how buff is allocated and if there's a buffSize variable to verify bounds checking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Get the context around buffIndex increment rg -B 10 -A 5 "buffIndex\+\+" src/parser/asts/functions.c

Length of output: 301

Script:

#!/bin/bash # Get the context around buff allocation and buffSize rg -B 5 -A 15 "buff\s*=\s*malloc" src/parser/asts/functions.c

Length of output: 434

coderabbitai · 2024-12-23T21:57:15Z

src/parser/asts/functions.c

+	node->right = parseNodes(result, params->endingIndex, AST_FUNCTION_ROOT);
+
+	node->endingIndex = node->right->endingIndex;


💡 Codebase verification

Based on the code analysis, I can now provide a final response:

Add NULL check for parseNodes return value
The parseNodes function can return NULL in various scenarios as seen in other parsing functions in the codebase. The current code assumes node->right is always valid when assigning node->endingIndex = node->right->endingIndex. Add a NULL check before accessing node->right to prevent potential crashes:

node->right = parseNodes(result, params->endingIndex, AST_FUNCTION_ROOT); if (node->right == NULL) return NULL; node->endingIndex = node->right->endingIndex;

🔗 Analysis chain

Ensure node->right and node->endingIndex are valid
Assuming node->right and node->right->endingIndex are non-null can cause crashes if parseNodes fails. Consider adding safety checks before assigning node->endingIndex.

Let me analyze the parseNodes function and its error handling to verify this concern.

Let me check the implementation of parseNodes and its error handling to better understand if it can return NULL.

Let me try a different approach to find the error handling in the parser.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Search for parseNodes function definition and its return value handling ast-grep --pattern 'ASTNode* parseNodes($$$)' # Also search for any NULL checks on parseNodes result rg "parseNodes\(" -A 5 -B 5

Length of output: 3584

Script:

#!/bin/bash # Look for NULL returns in parseNodes implementation rg "return NULL" -B 10 src/parser/parser.c # Look for the createASTNode function since it's used in parseNodes ast-grep --pattern 'AST_NODE* createASTNode($$$)'

Length of output: 96

Script:

#!/bin/bash # Search for createASTNode definition and usage rg "createASTNode" -A 5 -B 5 # Look for error handling patterns in functions.c rg "return NULL|return node" -B 10 src/parser/asts/functions.c

Length of output: 12766

coderabbitai · 2024-12-23T21:57:15Z

benchmarks/main.c

+	fptr = fopen("output.txt", "w");

-        endTimer(i, 3);
-
-        startTimer();
-
-        fptr = fopen("output.txt", "w");
-	fprintf(fptr, compiled);
-	fclose(fptr);
+        IR_CTX* ctx = makeContext(node);
+	compile(ctx, fptr);

-        endTimer(i, 4);
+	fclose(fptr);

-	free(compiled);
+        endTimer(i, 3);


💡 Codebase verification

Add error handling for IR context creation and file operations
The codebase shows inconsistent error handling patterns. While src/cli/main.c properly checks for both file operations (fopen) and IR context creation (makeContext), the benchmark code lacks these safety checks. Add the following error handling:

Check if fopen returns NULL for the output file

Verify if makeContext returns NULL, as seen in tests/ir.c and src/cli/main.c

🔗 Analysis chain

Consolidate compilation steps
Opening the file, creating the IR context, compiling, and closing the file are well-structured. Double-check error handling (e.g., verify IR_CTX != NULL) for production readiness.

Let me gather information about error handling in the codebase, particularly around IR_CTX creation and file operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash # Check IR_CTX creation and error handling patterns ast-grep --pattern 'makeContext($_)' # Look for error handling patterns around fopen rg "fopen.*\(" -A 3 # Check IR_CTX struct definition and related error handling rg "IR_CTX" -A 3

Length of output: 2457

coderabbitai · 2024-12-23T21:57:15Z

Makefile

+clean:
 	$(RM) build
 	$(RM) $(TARGET)
 	$(RM) $(TEST_TARGET)
 	$(RM) $(BENCH_TARGET)


🛠️ Refactor suggestion

Add .PHONY declaration for the clean target

The clean target should be declared as .PHONY to ensure it runs even if a file named 'clean' exists in the directory. This is a common Makefile best practice.

Add this line at the beginning of the Makefile:

+.PHONY: clean

Committable suggestion skipped: line range outside the PR's diff.

Zffu and others added 30 commits December 19, 2024 11:57

feat: added compiler utils header

f04018a

feat: added write8, write16 and write32 into compiler utils header

8228e15

chore: added clean target

09f6fb2

feat: added docs for previously added functions

19fa7f2

feat: added writestr8 and writestr16 functions in header

5c6553b

feat: added seek in header

b117b32

feat: added align_to in header

a659532

feat: added implementations of compiler utils

0a991ed

fix: added missing includes

a2ebf86

fix: misc impl errors

d80c897

feat: started working on compiler win header

0e0f010

feat: added win compiling constants

b1bbe29

feat: added writWinExecutableHeader, writeWinPESignature, writeWinCof…

56e1f10

…fHeader, writeWinSTDFields, writeWinSpecificFields and writeWinSection

feat: started working on the win writing functions implementations (w…

9187c95

…riteWinExecutableHeader)

fix: fixed header constants names

3939f06

feat: added writeWinExecutable in header

aef6ef9

feat: added two implementations of win comp header

537d402

refactor: changed functions to pass calculated vars

914f9d6

feat: writeCoffHeader impl

2834230

feat: added writeWinSTDFields

0dad446

feat: added win specific fields

edfed2a

feat: added windows data fields

290f135

feat: added section headers

a3b33d7

tables

8a995c5

feat: added win .idata tables sizes

b5d4d0c

feat: added import table for win

959d83d

feat: added win hint table

b04bf02

feat: started working on passing imports size and imports in paramete…

d44e242

…rs of writeWinExecutable

feat: made num_imports change behaviors

da0ee1f

fix: typo at WIN_NAME_TABLE_ENTRY_SZ

62fbc62

Zffu added 20 commits December 22, 2024 14:18

chore: removed old not working windows pe format

e07d3e7

feat: added compiler PE format datatypes

720c21c

refactor: moved datatypes into def

0d108a5

fix: fixed WORD and added PE_DOS_HEADER

7b4a80a

feat: added PE_HEADER_SIGNATURE

07c367e

refactor: started moving pe format around

a263300

feat: added units in header file

0883149

refactor: finished moving pe format around

5c07010

fix: struct def

acfad1f

feat: added PE_HEADER

f35c273

feat: added PE_SECTION_HEADER

aba326a

fix: types

ad8c1fb

feat: added new structs

4de307e

feat: added potentially working PE compiling

445a236

feat: it works WWW

c9f1567

feat: made .text section the same size as the program instead of 0x200

e26945c

feat: added function invoke parsing

2e4d9c7

fix: lexer fixes

c06611b

fix: final fixes

fb81854

fix: final fixes

f99bc8a

Zffu added compiler Signifies a compiler change. experiment Includes an experimental feature ast Signifies a AST / parser change. labels Dec 23, 2024

Zffu self-assigned this Dec 23, 2024

Zffu requested a review from thealternatedev as a code owner December 23, 2024 21:52

coderabbitai bot reviewed Dec 23, 2024

View reviewed changes

Zffu merged commit 25d1634 into master Dec 23, 2024
2 of 3 checks passed

Zffu linked an issue Dec 24, 2024 that may be closed by this pull request

Compiler v3 #38

Closed

Zffu deleted the comp branch January 12, 2025 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler V3 #39

Compiler V3 #39

Zffu commented Dec 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

coderabbitai bot Dec 23, 2024

		node->right = parseNodes(result, params->endingIndex, AST_FUNCTION_ROOT);

		node->endingIndex = node->right->endingIndex;

Compiler V3 #39

Compiler V3 #39

Conversation

Zffu commented Dec 23, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 23, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly Related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

Zffu commented Dec 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)