Using LLMs to reverse JavaScript variable name minification

This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at Github project Humanify.

What is minification?

Minification is a process of reducing the size of a Javascript file in order to optimize for fast network transfer. Looking from the reverse engineering standpoint, there are a few different categories of minification that present increasing challenges:

Lossless minification

Most minification is lossless; There's no data lost when true is converted to its minified alternative !0. It's straightforward to write a Babel transformation to reverse this process. There are many tools that are specifically designed to reverse these kind of lossless transformations.

Unimportant data loss

Some data is lost during the minification, but that data may be trivial to recreate. A good example is whitespace; it's trivial to use Prettier (or a similar tool) to reformat minified code's indentation and whitespace to a human-readable format. Most of the time the developers have used similar tools with the original code too, so the whitespace data can be recreated with high confidence.

Variable names

The most important information that's lost during the minification process is the loss of variable and function names. When you run a minifier, it completely replaces all possible variable and function names to save bytes.

Until now, there has not been any good way to reverse this process; when you rename a variable from crossProduct to a, there's not much you can do to reverse that process.

How would a human reverse minification?

Many reverse engineers train their eyes to recognize some heuristics from the context of the code so that they form educated guesses about the code's purpose. Let's have a look at a simple example:

function a(b) {
  return b * b;
}

How would you rename the function a? Given the context, we can make a pretty good guess that the original name was something like square. But this requires the knowledge of the internal workings of the function.

Let's try to codify the process of renaming the function:

Read the function's body
Describe what the function does
Try to come up with a name that fits that description

For a classical computer program it would be very difficult to make the leap from "multiply b with itself" to "squaring a number". Fortunately recent advances in LLMs have made this leap not only possible, but almost trivial.

Essentially the step 2. is called "rephrasing" (or "translating" if you consider Javascript as its own natural language), and LLMs are known to be very good at that.

Another task where LLMs really shine is summarization, which is pretty much what we're doing in step 3. The only specialization here is that the output needs to be short enough and formatted to the camel case.

Controlling LLMs

The issue with using output from LLMs is that they are not deterministic. In a nutshell a LLM is a very complex markov chain; it tries its best to guess the next word in a text based on the previous words.

This means that even if we have a good prompt like:

Are all roses red? Please answer only "Yes" or "No".

An LLM might still answer "No, but ...", "I don't know" or the famous "I'm sorry, but as an AI language model I cannot...".

This used to be a problem, but fortunately there are now ways to control the LLM's output like guidance and outlines. These tools use different techniques to ensure that the LLM's output matches a desired format.

Fortunately Javascript variables can only have a specific format, so can match the output with a regular expression to ensure that the output is a valid Javascript variable name.

Don't let AI touch the code

Now while LLMs are very good at rephrasing and summarizing, they are not very good at coding (yet). They have inherent randomness, which makes them unsuitable for performing the actual renaming and modification of the code.

Fortunately renaming a Javascript variable within its scope is a solved problem with traditional tools like Babel. Babel first parses the code into an abstract syntax tree (AST, a machine representation of the code), which is easy to modify using well behaving algorithms.

This is much better than letting the LLM modify the code on a text level; it ensures that only very specific transformations are carried out so the code's functionality does not change after the renaming. The code is guaranteed to have the original functionality and to be runnable by the computer.

Putting it all together

So, how do we un-minify Javascript? Let's put it all together:

Unbundle webpack bundles with webcrack
Run the code through transform-beautifier and a few custom Babel plugins that reverse lossless minification
Loop through all variables in the code, asking LLM to describe their intent and generate a better name based on that description
Rename the variables using Babel
Run a final round of Prettier to ensure nice whitespace

And that's it! Given the following code:

function a(e,t){var n=[];var r=e.length;var i=0;for(;i<r;i+=t){if(i+t<r){n.push(e.substring(i,i+t))}else{n.push(e.substring(i,r))}}return n}

The tool outputs the following:

function chunkedString(inputStringToBeSliced, chunk) {
  var chunkBuffer = [];
  var sliceSize = inputStringToBeSliced.length;
  var currentCharIndex = 0;
  for (; currentCharIndex < sliceSize; currentCharIndex += chunk) {
    if (currentCharIndex + chunk < sliceSize) {
      chunkBuffer.push(
        inputStringToBeSliced.substring(
          currentCharIndex,
          currentCharIndex + chunk
        )
      );
    } else {
      chunkBuffer.push(
        inputStringToBeSliced.substring(currentCharIndex, sliceSize)
      );
    }
  }
  return chunkBuffer;
}

Try it yourself!

The tool is called Humanify and it's available at Github. Do check it out and see if it works for you!