Advanced Binary File Similarity Detection: Identifying Duplicate and Related Files

Question

How can I detect similarity between binary files, especially when dealing with different versions or slightly modified executables? What advanced techniques go beyond simple byte-by-byte comparison?

Bobbie.Morgan · Accepted Answer

Advanced Binary File Similarity Detection 🔍
Detecting similarity between binary files, such as executables or compiled code, is a complex task that goes beyond simple byte-by-byte comparison. Advanced techniques are needed to handle variations like different compiler versions, minor code changes, or added resources. Here are some effective methods:

1. Hashing Techniques 🧮
Fuzzy Hashing (SSDEEP):

Fuzzy hashing, like SSDEEP, computes a hash value that is sensitive to small changes in the input. Unlike cryptographic hashes, similar files will produce similar hash values.
 SSDEEP divides the file into chunks and computes a hash for each chunk, then combines these hashes.

import ssdeep

file1_path = "file1.exe"
file2_path = "file2.exe"

with open(file1_path, "rb") as f:
 file1_data = f.read()

with open(file2_path, "rb") as f:
 file2_data = f.read()

hash1 = ssdeep.hash(file1_data)
hash2 = ssdeep.hash(file2_data)

print(f"SSDEEP Hash for {file1_path}: {hash1}")
print(f"SSDEEP Hash for {file2_path}: {hash2}")

# Calculate similarity score
score = ssdeep.compare(hash1, hash2)
print(f"Similarity score: {score}")

2. Control Flow Graph (CFG) Analysis ⚙️

CFG Extraction: Disassemble the binary and construct a control flow graph for each function.
 Graph Similarity: Compare the CFGs to identify similar functions. This can be done using graph matching algorithms.

from capstone import * # Disassembler
from capstone.x86 import *
import networkx as nx # Graph library

def disassemble(binary_path, offset, size):
 with open(binary_path, 'rb') as f:
 f.seek(offset)
 code = f.read(size)
 md = Cs(CS_ARCH_X86, CS_MODE_64) # Or CS_MODE_32 for 32-bit
 instructions = list(md.disasm(code, offset))
 return instructions

def build_cfg(instructions):
 graph = nx.DiGraph()
 # Implementation to build CFG from instructions
 # This is a simplified example; a full implementation would
 # handle conditional jumps, loops, function calls, etc.
 for i, instr in enumerate(instructions[:-1]):
 graph.add_edge(instr.address, instructions[i+1].address)
 return graph

# Example usage:
binary_file = "example.exe"
entry_point = 0x1000 # Example entry point
code_size = 0x2000 # Example code size

instructions = disassemble(binary_file, entry_point, code_size)
cfg = build_cfg(instructions)

# Now you can compare CFGs using graph similarity algorithms

3. Feature Extraction and Machine Learning 🧠

Feature Extraction: Extract features from the binary files, such as opcode sequences, string constants, or byte n-grams.
 Machine Learning: Train a machine learning model to classify files as similar or dissimilar based on these features.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def extract_opcodes(binary_path, num_opcodes=1000):
 with open(binary_path, 'rb') as f:
 code = f.read()
 # Extract opcodes (example: first byte of each instruction)
 opcodes = [hex(byte) for byte in code[:num_opcodes]]
 return ' '.join(opcodes)

# Example usage:
file1_opcodes = extract_opcodes("file1.exe")
file2_opcodes = extract_opcodes("file2.exe")

# Vectorize the opcodes
vectorizer = CountVectorizer()
vectorizer.fit([file1_opcodes, file2_opcodes])
vectors = vectorizer.transform([file1_opcodes, file2_opcodes])

# Calculate cosine similarity
similarity = cosine_similarity(vectors[0], vectors[1])[0][0]
print(f"Cosine Similarity: {similarity}")

4. Byte-Level Entropy Analysis 📊

Analyze the distribution of byte values in the binary. High entropy regions often indicate compressed or encrypted data, while low entropy regions may indicate code or data.
 Compare entropy profiles to identify similar regions.

import math

def calculate_entropy(data):
 if not data:
 return 0
 entropy = 0
 for x in range(256):
 p_x = float(data.count(x))/len(data)
 if p_x > 0:
 entropy += - p_x*math.log2(p_x)
 return entropy

file_path = "example.exe"
with open(file_path, "rb") as f:
 file_data = f.read()

entropy = calculate_entropy(file_data)
print(f"Entropy of {file_path}: {entropy}")

5. Import/Export Table Comparison 📦

Compare the imported and exported functions. Similar binaries often share many of the same imports and exports.

6. Resource Comparison 🖼️

Examine resources such as icons, strings, and other embedded data. Similar binaries may share common resources.

7. Hybrid Approach ibrid

Combine multiple techniques for more accurate results. For example, use fuzzy hashing to quickly identify potential matches, then use CFG analysis or feature extraction to confirm the similarity.

By employing these advanced techniques, you can effectively detect similarity between binary files, even in the presence of variations and modifications.

Advanced Binary File Similarity Detection: Identifying Duplicate and Related Files

1 Answers