Towards `ssdeep` – part 1

Disclaimer: The following post has not gone through any kind of peer review process – take it with a grain of salt.

Introduction

To brush up on my RE skills, I’ve recently been checking out TrueBot – a loader that functions as a botnet, and spreads through phishing (like Emotet). Following herrcore, I checked out CISA’s Cybersecurity Advisory from July 6, on TrueBot, and skimmed through their malware analysis report MAR-10445155-1.v1. In this, I found mentioned:

ssdeep Matches
No matches found.

Intriguing. I’ve never heard of this before – what might this be?

Down the rabbit hole

So, we do a quick Google, and we’re lead to a a webpage seemingly run by a real OG, in collaboration with some other OGs. We find that ssdeep is a utility for:

ssdeep is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.

Ah, that’s an interesting use of the word “homology”. I’m more used to it in the context of topology – chain complexes and all that stuff. However, I guess this works too. Mathematically, what is described is the transitive closure of $\sim$, given by

$$ [x]\sim [a,x]\text{ and }[x]\sim[x,a] $$

and reflexivity. Let’s denote it by $\sim_H$. (Here, $a$ and $x$ are bytes, and $[a,x]$ denotes a list – let’s not be too picky on the details.)

However, this is a fairly boring relation. After all, the sequence

$$ S_1=[\mathtt{00},\mathtt{01},\mathtt{02},\mathtt{03}], $$

isn’t very similar to

$$ S_2=[\mathtt{00},\mathtt{de},\mathtt{ad},\mathtt{be},\mathtt{ff}]. $$

even though they’re homologous, i. e. $S_1\sim_HS_2$.¹

With some more mathematical terminology, what we’re the ssdeep-folks seem to be thinking of something like a metric. And indeed, something like the Levenshstein distance seems to fit the bill.

And since I like getting side-tracked, let’s take a look at that!

The Levenshtein distance

For brevity, we hereafter refer to the Levenshtein distance by LVSD. Our starting point, is the following sentence on the Wikipedia page – it gives us some great intuition.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

So, Billy and Willy should have an LVSD of $1$, and Billy and Willie should have an LVSD of $3$. Let’s define it formally, by bringing the notion of “edit sequences”.

Let $A$ be an alphabet, and let $A^\ast$ be the set of finite-length strings on $A)$. Then we define a triple of partial functions $(\mathtt{i},\mathtt{r},\mathtt{m})$ by

$\mathtt{i}:A^\ast\times A\times\mathbb{N}\to A^\ast$ by $\mathtt{i}(a[1\ldots n],i,x)=a[1\ldots i-1]\,x\,a[i\ldots n]$ – we call this insertion of $x$ at $i$,
$\mathtt{r}:A^\ast\times\mathbb{N}\to A^\ast$ by $\mathtt{r}(a[1\ldots n],i)=a[1\ldots i-1]\,a[i+1\ldots n]$ – we call this removal of element $i$,
$\mathtt{m}:A^\ast\times A\times\mathbb{N}\to A^\ast$ by $\mathtt{m}(a[1\ldots n],i,x)=a[1\ldots i-1]\,x\,a[i+1\ldots n]$ – we call this modification of element $i$ to $x$.

If $i\notin\{1,\ldots,n\}$, the functions are not defined. Collectively, we call these “edits”. Let $$ \mathtt{E}=\{\mathtt{i}(\cdot,x,i):x\in A, i\in\mathbb{N}\}% \cup% \{\mathtt{r}(\cdot,i):i\in\mathbb{N}\}% \cup% \{\mathtt{m}(\cdot,x,i):x\in A,i\in\mathbb{N}\} $$ be the set of all edits. We define (partial) functions $\mathtt{index}:\mathtt{E}\to\mathbb{N}$ and $\mathtt{char}:\mathtt{E}\to A$ by $$ \mathtt{index}(\mathtt{i}(\cdot,x,i))=i\text{,}\quad% \mathtt{index}(\mathtt{r}(\cdot,i))=i\text{,}\quad\text{and}\quad% \mathtt{index}(\mathtt{m}(\cdot,x,i))=m\text{,} $$ and $$ \mathtt{char}(\mathtt{i}(\cdot,x,i))=x\text{,}\quad% \mathtt{char}(\mathtt{r}(\cdot,i))=\mathtt{undefined}\text{,}\quad\text{and}\quad% \mathtt{char}(\mathtt{m}(\cdot,x,i))=x\text{.} $$

Now, given strings $X,Y\in A^\ast$, we call a finite sequence $(f_1,\dots,f_n)\in \mathtt{E}^\ast$ satisfying that $$ Y=f_n\circ f_{n-1}\circ\dots\circ f_1(X), $$ an “edit sequence” from $X$ to $Y$. (We allow empty sequences.) Let $$ \mathtt{E}(X,Y)=\{\mathbf{f}\in\mathtt{E}^\ast:\mathbf{f}\text{ is an edit sequence from }X\text{ to }Y\} $$ We then set $$ \mathtt{LVSD}(X,Y)=\min\{|\mathbf{f}|:\mathbf{f}\in\mathtt{E}(X,Y)\}, $$ where $|\mathbf{f}|$ denotes sequence length.

Given a string $X\in A^\ast$ with $n=|X|$, we write $h(X)=X[1\ldots n-1]$, for the “head” of $X$. It turns out that the $\mathtt{LVSD}$ can be computed through a simple recurrence formula.

Theorem: Let $X,Y\in A^\ast$. Then $$ \mathtt{LVSD}(X,Y)=% \begin{cases} \max(|X|,|Y|)&\text{if }|X|=0\text{ or }|Y|=0\\ \mathtt{LVSD}(h(X),h(Y))&\text{if }X[|X|]=Y[|Y|]\\ 1+\min\begin{cases}% \mathtt{LVSD}(h(X),h(Y))\\ \mathtt{LVSD}(X,h(Y))\\ \mathtt{LVSD}(h(X),Y) \end{cases}&\text{otherwise.} \end{cases} $$

The proof is straight-forward, given that we have a crucial lemma – which is however not at all straight-forward.

Lemma: Given strings $X,Y\in A^\ast$, and an edit sequence $(f_1,\dots,f_n)$ from $X$ to $Y$, there exists an edit sequence $(g_1,\dots,g_m)$ from $X$ to $Y$ with $m\leq n$ and where $g_m$ is either

an insertion at the end,

a deletion at the end, or

a modification at the end.

We will prove this lemma constructively, by providing an algorithm that computes the $(g_1,\dots,g_m)$.

However, let’s first prove the theorem.

Proof of the Theorem

The case of $|X|=0$ or $|Y|=0$ is immediate. As for the case of $\mathtt{end}(X)=\mathtt{end}(Y)$, we notice that any edit $$ [X,z_1]\overset{f}{\mapsto}[Y,z_2] $$ yields a valid edit $$ X\overset{f'}{\mapsto} Y $$ of the same type (including the identity edit), and any edit $$ X\overset{g}{\mapsto}Y $$ yields a valid edit $$ [X,z]\overset{g'}{\mapsto}[Y,z] $$ of the same type (including the identity edit). Suppose therefore that $s=(f_1,\dots,f_n)$ is a minimal (in terms of length) edit sequence from $X$ to $Y$. By removing the last characters, we obtain an edit sequence from $h(X)$ to $h(Y)$, of the same length – implying that $\mathtt{LVSD}(h(X),h(Y))\leq\mathtt{LVSD}(X,Y)$. Conversely, given a minimal edit sequence from $h(X)$ to $h(Y)$, we obtain an edit sequence of the same length from $X$ to $Y$, by appending $\mathtt{end}(X)=\mathtt{end}(Y)$ – implying that $\mathtt{LVSD}(X,Y)\leq\mathtt{LVSD}(h(X),h(Y))$.

Let us therefore focus on the case when $\mathtt{end}(X)\neq\mathtt{end}(Y)$. We let $s=(f_1,\dots,f_n)$ be a minimal edit sequence from $X$ to $Y$. Let $Z=f_{n-1}\circ\cdots\circ f_1(X)$. By the lemma, we can assume that either:

$f_n$ is an insertion and $\mathtt{index}(f)=|Z|+1$,
$f_n$ is a modification and $\mathtt{index}(f)=|Z|$, or
$f_n$ is a deletion and $\mathtt{index}(f)=|Z|$.

Case 1: In this case we have that $Z=h(Y)$. Since $(f_1,\dots,f_{n-1})$ is an edit sequence from $X$ to $h(Y)$, we have that $\mathtt{LVSD}(X,h(Y))\leq n-1$. Conversely, given any edit sequence from $X$ to $h(Y)$, we can create an edit sequence from $X$ to $Y$ by adding an insertion at the end, hence we obtain $\mathtt{LVSD}(X,Y)\leq 1+\mathtt{LVSD}(X,h(Y))$.

Case 2: In this case, we have that $Z=[h(Y),z]$, where $z\neq Y[|Y|]$. Applying the trick we used in the case of $\mathtt{end}(X)=\mathtt{end}(Y)$, we see that the edit sequence $(f_1,\dots,f_{n-1})$ from $X$ to $Z$ can be mapped to an edit sequence $(f_1',\dots,f'_{n-1})$ from $h(X)$ to $h(Y)$. Hence $\mathtt{LVSD}(h(X),h(Y))\leq n-1$. Conversely, given any edit sequence from $h(X)$ to $h(Y)$, we can create an edit sequence from $X$ to $Y$, by appending $X[|X|]$ to make an edit sequence from $X$ to $[h(Y),X[|X|]]$ of the same length, and then adding a modification at the end. This implies that $\mathtt{LVSD}(X,Y)\leq 1+\mathtt{LVSD}(h(X),h(Y))$.

Case 3: In this case, we have that $Z=[Y,z]$. By the same argument as in case 2, we find that $\mathtt{LVSD}(h(X),Y)\leq n-1$ and $\mathtt{LVSD}(X,Y)\leq 1+\mathtt{LVSD}(h(X),Y)$.

In conclusion, in case 1, we have that:

$$ \mathtt{LVSD}(X,Y)=1+\mathtt{LVSD}(X,h(Y))\text{,} $$

in case 2, we have that:

$$ \mathtt{LVSD}(X,Y)=1+\mathtt{LVSD}(h(X),h(Y))\text{,} $$

and in case 3, we have that:

$$ \mathtt{LVSD}(X,Y)=1+\mathtt{LVSD}(h(X),Y)\text{,} $$

By minimality of LVSD, we find that:

$$ \mathtt{LVSD}(X,Y)=1+% \min\begin{cases}% \mathtt{LVSD}(X,h(Y))\\ \mathtt{LVSD}(h(X),h(Y))\\ \mathtt{LVSD}(h(X),Y) \end{cases} $$ We are done.

Proof of the Lemma

My proof of the lemma proceeds by mapping a given edit sequence $s$ from $X$ to $Y$ to a multi-dimensional array called an “edit array”, which, when traversed in a specific order can be mapped back to an edit sequence $s^\ast$ from $X$ to $Y$ that goes from left to right.²

Edit arrays are defined as follows.

Definition: Let $A$ be an alphabet with $\sqcup,\texttt{!}\in A$. Then an edit array is an array of arrays with elements in $A$ satisfying:

for $1\leq i\leq |T|$ it holds that $T[i][1]\neq\texttt{!}$,

for $1\leq i\leq |T|$ and $2\leq j\leq|T[i]|$, it holds that $T[i][j]\neq\sqcup$

for $1\leq i\leq |T|$, if $T[i][j]=\texttt{!}$ then either $j=|T[i]|$ or $T[i][j+1]\neq\texttt{!}$.

Given an edit array $T$, we write $$ \mathtt{top}(T)=[T[i][1]:1\leq i\leq |T|\text{ and }T[i][1]\neq\sqcup]\text{,} $$ and $$ \mathtt{bottom}(T)=[\mathtt{end}(T[i]):1\leq i\leq|T|\text{ and }\mathtt{end}(T[i])\neq\texttt{!}] $$

Let us begin by transforming an edit sequence to an edit array.

Algorithm:

Input: an edit sequence $s$ from a string $X$ to another string $Y$, with letters in the same alphabet, and where $\mathtt{char}(f)\notin\{\sqcup,\texttt{!}\}$ for all edits $f$ in $s$.

Output: an edit array $T$ with $$ X=\mathtt{top}(T)\text{,} $$ and $$ Y=\mathtt{bottom}(T)\text{.} $$

Initialize $T$ to an empty array.

for every $i$, do $T[i]\gets [X[i]]$, endfor.

for every edit $f$ in $s$, do

$\mathtt{on\_cols}\gets[i:\mathtt{end}(T[i])\neq\texttt{!}]$ – this is an array.

Comment: the $\mathtt{on\_cols}$ array keeps track of the elements

that are not marked as deleted.

if $f$ is an insertion:

$\mathtt{ins\_col}\gets 1$.

Comment: $\mathtt{ins\_col}$ denotes the index in $T$ where we want record the insertion.

if $\mathtt{index}(f)=|\mathtt{on\_cols}|+1$:

$\mathtt{ins\_col}\gets\mathtt{end}(\mathtt{on\_cols})+1$,

else:

$\mathtt{ins\_col}\gets\mathtt{on\_cols}[\mathtt{index}(f)]$.

endif

Comment: there are three cases – insert at the front, at the end, or in the middle.

Comment: at front.

if $\mathtt{ins\_col}=1$:

if $\mathtt{end}(T[1])=\texttt{!}$:

append $\mathtt{char}(f)$ to $T[1]$,

else:

insert $[\sqcup,\mathtt{char}(f)]$ to the beginning of $T$,

endif

Comment: at end.

else-if $\mathtt{ins\_col}=|T|+1$:

append $[\sqcup,\mathtt{char}(f)]$ to the end of $T$,

Comment: at middle.

else:

if $\mathtt{end}(T[\mathtt{ins\_col}])=\texttt{!}$:

append $\mathtt{char}(f)$ to $T[\mathtt{ins\_col}]$,

else-if $\mathtt{end}(T[\mathtt{ins\_col}-1])=\texttt{!}$:

append $\mathtt{char}(f)$ to $T[\mathtt{ins\_col}-1]$,

else:

insert $[\sqcup,\mathtt{char}(f)]$ at index $\mathtt{ins\_col}$ of $T$.

endif

endif

else-if $f$ is a modification:

append $\mathtt{char}(f)$ to $T[\mathtt{on\_cols}[\mathtt{index}(f)]]$,

else-if $f$ is a deletion:

append $\texttt{!}$ to $T[\mathtt{on\_cols}[\mathtt{index}(f)]]$.

endif

endfor

return $T$

Ok, let’s prove this algorithm is correct. It is enough to show:

that $T$ is an edit array after every iteration in the outermost for-loop,
that $Y=\mathtt{bottom}(T)$ when the algorithm terminates.

Let’s start by proving 1. with induction. Since $T$ is initialized to $[[X[1]],[X[2]],\dots,[X[|X|]]$, the base case is clear. Hence, suppose that we’re processing an edit sequence $s=(f_1,\dots,f_m)$ where $m\geq 2$. We denote the value of $T$ after $f_i$ has been processed by $T_i$. As an inductive assumption, we assume that $T_{i-1}$ is an edit array. We want to show that $T_i$ is an edit array.

If $f_i$ is a modification, no change affecting the defining properties of an edit array is made, and so $T_i$ remains an edit array. If $f_{i}$ is a deletion, we need to make sure $T_{i-1}[\mathtt{on\_cols}[\mathtt{index}(f)]]\neq\texttt{!}$. But this follows by definition of $\mathtt{on\_cols}$, and hence $T_i$ remains an edit array. If $f_i$ is an insertion, we see that either a character, not equal to $\texttt{!}$ or $\sqcup$, is added after a $\texttt{!}$, or a new element $[\sqcup,x]$ where again $x\notin\{\texttt{!},\sqcup\}$ is added to $T_{i-1}$. Again, none of this violates the defining properties of an edit array, and thus $T_i$ remains an edit array.

The proof for 2 is entirely analogous, so we leave it to the reader.

Now, let’s see how we can map an edit array $T$, to an edit sequence from $\mathtt{top}(T)$ to $\mathtt{bottom}(T)$, proceeding from left to right. Again, I’ve opted for an algorithmic approach.

Algorithm:

Input: An edit array $T$.

Output: An edit sequence $s$ from $\mathtt{top}(T)$ to $\mathtt{bottom}(T)$ proceeding from left to right.

Initialize $s$ to an empty sequence.

$p\gets 1$

for $1\leq i\leq |T|$, do:

for $1\leq j\leq |T[i]|-1$, do:

label: START_INNER

$x\gets T[i][j+1]$

if $T[i][j]\neq\texttt{!}$ and $x=\texttt{!}$:

append $\mathtt{r}(\cdot,p)$ to $s$

else-if $T[i][j]=\texttt{!}$ and $x\neq\texttt{!}$:

append $\mathtt{i}(\cdot, x, p)$ to $s$

else-if $T[i][j]=\sqcup$:

append $\mathtt{i}(\cdot, x, p)$ to $s$

else:

append $\mathtt{m}(\cdot, x, p)$ to $s$

endif

label: ADDED_EDIT

endfor

if $\mathtt{end}(T[i])\neq\texttt{!}$, then $p\gets p+1$, endif.

endfor

return $s$

To prove that this algorithm is correct, we again use induction. For convenience, we write $X=\mathtt{top}(T)$. Let $f_{i,j}$ be the edit appended to $s$, on the $i$th iteration in the outer for-loop, on the $j$th iteration in the inner for-loop, at the label ADDED_EDIT. Let $p_i$ be the value of $p$ at $i$th iteration on the outer for-loop, at the label START_INNER. Note that this implies that³ $p_i-1$ is equal to the number of indices $1\leq j\leq i-1$ for which $\mathtt{end}(T[j])\neq\texttt{!}$. For convenience, we also set $p_{|T|+1}=|\mathtt{bottom}(T)|+1$.

Let also $$ A_{i,j}=% f_{i,j}\circ\cdots f_{i,1}% \circ\cdots\circ% f_{1,|T[1]|}\circ\cdots\circ f_{1,1}(X) $$ be the effect of applying all the edits until and including $f_{i,j}$ (in the order of creation) to $X$. Note that $A_{i,j}$ may a priori be undefined. Since the edits $f_{i,j}$ proceed from left to right by construction, it is enough to show that $A_{|T|,|T[|T|]|}=\mathtt{bottom}(T)$. We prove this by induction, and show more generally that $$ A_{i,|T[i]|}[1,\dots,p_{i+1}-1]=\mathtt{bottom}(T)[1,\dots,p_{i+1}-1]\text{,} $$ and $|A_{|T|,|T[|T|]|}|=|\mathtt{bottom}(T)|$.

Since $T$ is an edit array – there are no removals in sequence, and no insertions in sequence. This means that $A_{1,|T[1]|}$ takes on one of four forms:

$[\mathtt{end}(T[1]),X]$ and then $p_2=2$ – when the first edit is an insertion, and edit $|T[1]|$ is not a removal,
$[\mathtt{end}(T[1]),X[2],\dots,\mathtt{end}(X)]$ and then $p_2=2$ – when the first edit is not an insertion, and edit $|T[1]|$ is not a removal,
$[X[2],X[3],\dots,\mathtt{end}(X)]$ and then $p_2=1$ – when the first edit is not an insertion, and edit $|T[1]|$ is a removal,
$X$ and then $p_2=1$ – when the first edit is an insertion, and edit $|T[1]|$ is a removal.

It is clear that in all of these cases, whatever is to the left of $p_2$ is $\mathtt{bottom}(T)[1,\dots,p_2-1]$ – which of course is nothing when $p_2-1=0$. This settles the base case.

Let’s continue on the inductive hypothesis. We assume that $$ A_{i,|T[i]|}=[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],z_1,\dots,z_n]\text{,} $$ where the $z_1$ is at index $p_{i+1}$. This implies that $A_{i+1,|T[i+1]|}$ takes on one of four forms:

$[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],\mathtt{end}(T[i+1]),z_1,\dots,z_n]$ and then $p_{i+2}=p_{i+1}+1$,
$[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],\mathtt{end}(T[i+1]),z_2,\dots,z_n]$ and then $p_{i+2}=p_{i+1}+1$,
$[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],z_2,\dots,z_n]$ and then $p_{i+2}=p_{i+1}$,
$[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],z_1,z_2,\dots,z_n]$ and then $p_{i+2}=p_{i+1}$,

under conditions analogous to the ones in the base case. We are done if we can show that

$$ A_{i+1,|T[i+1]|}=[\mathtt{bottom}(T)[1,\dots,p_{i+2}-1],w_1,\dots,w_m] $$

for some characters $w_1,\dots,w_m$ where $m\leq n$. In case 3 and 4, this is immediate. For case 1 and 2, we notice that, by definition of $p_i$ and $\mathtt{bottom}(T)$, we have that

$$ \mathtt{bottom}(T)[1,\dots,p_i-1]=[\mathtt{end}(T[j]):1\leq j\leq i-1\text{ and }\mathtt{end}(T[j])\neq\texttt{!}] $$

Clearly, this implies that

$$ \mathtt{bottom}(T)[1,\dots,p_{i+2}-1]=[\mathtt{end}(T[j]):1\leq j\leq i+1\text{ and }\mathtt{end}(T[j])\neq\texttt{!}], $$

In case 1 and 2, we then have that

$$ \mathtt{bottom}(T)[1,\dots,p_{i+2}-1]\\ =[[\mathtt{end}(T[j]):1\leq j\leq i\text{ and }\mathtt{end}(T[j])\neq\texttt{!}],\mathtt{end}(T[i+1])]\\ =[\mathtt{bottom}(T)[1,\dots,p_{i+1}-1],\mathtt{end}(T[i+1])] $$

So, we have shown that

$$ A_{i+1,|T[i+1]|}[1,\dots, p_{i+2}-1]=\mathtt{bottom}(T)[1,\dots,p_{i+2}-1]\text{.} $$

Finally, to see that $|A_{|T|,|T[|T|]|}|=|\mathtt{bottom}(T)|$, note that

$$ |A_{i,|T[i]|}|=|\mathtt{bottom}(T)[1,\dots,p_{i+1}-1]|+|X[i+1,\dots,|X|]|\text{.} $$

We are done.

Combining these two algorithms, we immediately obtain the lemma. I think the reader will agree that this is quite enough for now.

Appendix

I have provided implementations of the two algorithms defined above in Python. Check it out if you wish.

E. g. because $S_i\sim[\mathtt{00}]$ for $i\in\{1,2\}$. ↩︎
To be specific, this means that if $s^\ast=(g_1,\dots,g_m)$, then the following holds: for $1\leq j\leq m-1$, if $g_j$ is an edit at $i$, then $g_{j+1}$ is an edit at $i'$ where $i'\in\{i,i+1\}$. ↩︎
Note that $p$ starts at $1$, thereof the subtraction. ↩︎