Skip to content

Latest commit

 

History

History
149 lines (101 loc) · 6.75 KB

README.md

File metadata and controls

149 lines (101 loc) · 6.75 KB

Suspicious

Catching bugs in code with AI, fully local CLI app. No data leaves your computer.

GitHub PyPi

🤔 Overview🪄 Demos🔧 Installation💻 Usage🧠 How it works


Overview

This is a CLI application that analyzes a source code file using an AI model. It then shows you parts that look suspicious to it.

It does not use rules or static analysis the way a linter tool would. Instead, the model generates its own code suggestions based on the surrounding context. Check out how it works.

NB: All processing is done on your hardware and no data is transmitted to the Internet

Example output:

example results

Demo

Here's the output of running the application on its own source files (so meta).

Have I seen this before?

There was this post AI found a bug in my code on Hacker News which was pretty cool. I wanted to try it on my own code, so I went ahead and built my implementation of the idea.

Installation

You can install sus via pip or from the source.

Pip (MacOS, Linux, Windows)

pip3 install suspicious

From source

git clone [email protected]:sturdy-dev/suspicious.git
cd suspicious
python -m pip install .

Usage

You can run the program like this:

sus /path/to/file.py

Note that when you run this for the first time, the application will need to download a model (~500 MB) — more info section.

This will generate and open an .html file with the results.

  • grey means prediction is the same as the original
  • light grey means the model had a different prediction but with super low confidence
  • light red means things are looking a little sus
  • red means there was a different prediction and confidence was higher

Practical usage

Unclear. You run sus on a file and skim over the red stuff, maybe it spots something you missed. Ping me on twitter if you catch something cool with it.

How does it work?

In a nutshell, it feeds a tokenized representation of your source text into a Transformer model and asks the model to predict one token at a time using Masked Language Modelling.

For a general overview about Transformer models, check out The Illustrated Transformer article by Jay Alammar, which helped me out in understanding the core ideas.

sus uses a model called UniXcoder which has been trained on the CodeSearchNet dataset. To do the MLM (masked language modelling) we are adding a lm_head layer.

When sus processes your code, it first tokenizes the text, where a token could be a special character or programming language keyword, English word or part of a word.

Before feeding the sequence of token ids to the model, one or multiple tokens are replaced with a special <mask> token. After feeding the input through the network, we extract just the value at the masked location. This masking is done in a loop for each token to generate individual predictions.

Since this process is impractically slow, instead of masking one token at a time, sus masks 10% of the tokens, making sure that the masked locations are spread out (so that there is sufficient context around each prediction site).

The output of this entire process is a list of structs that contain the original and predicted values for each token. Example:

{
    "idx": 0, // position in sequence
    "original": "foo", // as originally written in the source file
    "predicted": "bar", // what the model predicted
    "cosine_similarity": 0.23, // how different the prediction is from the original in the vector space
    "probability": 0.92, // how confident the model is in it's prediction
}

This is then fed into an html template to be rendered for the user. Easy-peasy.

Model

sus uses the decoder of UniXcoder, specifically the unixcoder-base-nine checkpoint. What's cool is that it's only 500 MB and ~120M parameters, which means it's quick to download and fast enough to run locally.

Larger models produce higher quality outputs, but you need to run the inference on a server.

Supported languages

You can try sus on any source file, but you can expect best results with the following languages:

  • java
  • ruby
  • python
  • php
  • javascript
  • go
  • c
  • c++
  • c#

Bugs and limitations

  • Accuracy — sus is meant to be executed locally (aka not sending code to a server), which puts some constraints on the AI model size. Larger models will produce higher quality results, but they can be tens of GB in size and without a beefy GPU could take a long time to generate the output. Because of this, sus uses a modestly sized model.
  • Large files — The model also puts constraints on the input size (analyzed file size). sus works around this by batching the input, but as a result of this, batches are not aware of the 'context' / code that is in other batches. Files are split in batches of 2500 characters which is super crude and is meant to correspond to ~1024 tokens.
  • Masking is done on per token basis. It could be interesting to first generate syntax tree from the code and then mask the entire node instead.

License

Semantic Code Search is distributed under AGPL-3.0-only. For Apache-2.0 exceptions — [email protected]