🌝

HashGNN: Min-Hashing & Locality Sensitive Hashing

Easier to understand when sections are read in this order: min-hashing → locality sensitive min hashing → HashGNN

HashGNN (Presentation)

Contrary to what the name suggests, HashGNN is not a supervised neural learning model. It is in fact an unsupervised algorithm. Its name comes from the fact that the algorithm design is inspired by that of graph neural networks, in that it does message passing interleaved with transformations on each node. But instead of doing neural transformations like most GNNs, its transformations are done by locality sensitive min-hashing. Since the hash functions used are randomly chosen independent of the input data, there is no need for training.

It can also be part of NN models, read below!

HashGNN architecture overview: message passing with locality sensitive hashing instead of neural transformations.

Graph Encoder: 2 layer GCN (flexible choice).
Hash Layers: Generate binary hash codes for nodes in $G$.
- $z_i = \tanh(W^T u_i + b)$ — intermediate representation of size $K$.
To enforce the embedding to be binary, we use $\tanh$, which returns values from -1 to 1, then simply each value is converted to 1 if value ≥ 0; or 0 if value < 0.
We can interpret:
- $P(A_{ij}|h_i,h_j) = \sigma(\text{dist}(h_i, h_j))$ if $A_{ij}=1$
- $P(A_{ij}|h_i,h_j) = 1 - \sigma(\text{dist}(h_i, h_j))$ if $A_{ij}=0$
- ($\sigma$ = sigmoid function)
- (here be dragons: step function in NN → non-differentiable → STE, read more below)
The relationship between Hamming distance and inner product for binary hash codes:
- Hamming distance = number of bits that differ between two hash codes.
- Inner product = number of bits that are the same between two hash codes.
- The Hamming distance is a measure of dissimilarity; the inner product is a measure of similarity.
The smaller the Hamming distance, the larger $P(1|h_i, h_j)$!

Cross-Entropy Loss

$$L_{\text{cross}} = \sum_{A_{ij}} A_{ij} \cdot \log(\sigma(\text{inner}(h_i,h_j))) + (1 - A_{ij}) \cdot \log(1 - \sigma(\text{inner}(h_i,h_j)))$$

By minimizing this loss, we obtain binary hash codes that share similar values between linked nodes, so the network topological information is stored in the Hamming space.

Ranking-Preserving Loss

$$L_{\text{rank}} = \sum_{(v_i,v_j,v_m)} \max(0, -\sigma(\text{inner}(h_i,h_j)) + \sigma(\text{inner}(h_i,h_m)) + \alpha)$$

Where $\alpha$ is the margin parameter to control difference between similar and dissimilar hash codes (fixed to 0.2 by trial and error). Minimizing this will mean binary hash codes for neighbors of $v_i$ will be in a smaller radius near $v_i$, vs non-connected nodes.

End-to-End Loss

$L = L_{\text{cross}} + \lambda L_{\text{rank}}$, where $\lambda$ is a parameter to balance between entropy and ranking loss.

Inference

Forward pass → binary hash h, embedding z for each node.
Based on a given query vector, get either Hamming Space Retrieval (using h only) or Hierarchical Search (use h for sourcing, then z for ranking).

Since the final loss function is non-smoothing, the gradient is 0, so STE (Straight-Through Estimator) is used to approximate the gradient.

Experimental Results

IMDB (HashGNN + logistic regression)

Why HashGNN?

Can generate high quality node embeddings
Can encode the heterogeneity of a graph
A lot faster than neural models
Simpler than neural models and does not require training
Does not require GPU
Scales very well with increased CPU concurrency
Does not require node feature input
Has inductive capabilities

Locality Sensitive Min-Hashing

The basic idea behind LSH is that sets with a high Jaccard similarity are more likely to have many hash values in common than sets with a low Jaccard similarity. Therefore, if we can find a way to group sets with similar hash values together, we can identify sets that are likely to have a high Jaccard similarity.

In LSH, the data is first transformed into a set of binary signatures using the min-hashing technique. Min-hashing is a method of estimating the Jaccard similarity between two sets (i.e., the size of their intersection divided by the size of their union) by hashing the sets and comparing the resulting hashes.

LSH then applies multiple hash functions to the signatures, and groups the signatures that collide on at least one hash function into the same bucket. The probability of two similar signatures colliding on at least one hash function is high, while the probability of two dissimilar signatures colliding is low. This makes it possible to quickly prune the search space by only considering candidates in the same buckets as the query signature.

LSH can be used with various distance metrics, such as Euclidean distance, cosine similarity, or Jaccard distance. The choice of distance metric depends on the type of data and the specific application.

Overall, LSH is a powerful technique for approximate nearest neighbor search, allowing us to efficiently search large datasets for comparable items with high accuracy. It has many practical applications, such as image search, recommendation systems, and natural language processing.

Min-Hashing

Min-hashing is a technique used to estimate the Jaccard similarity between two sets, which is defined as the size of their intersection divided by the size of their union. It is a useful tool in information retrieval and data mining, where it is often used for clustering, indexing, and nearest neighbor search.

Here is a step-by-step explanation of how min-hashing works:

1. Convert the sets into binary vectors

Given two sets A and B, we can represent them as binary vectors of length N, where N is the total number of items. Each entry in the vector corresponds to an item, and is set to 1 if the item is present in the set, and 0 otherwise. For example, if N = 7 and A = {1, 2, 4, 5, 6}, and B = {2, 3, 5, 7}:

A = [1, 0, 1, 0, 1, 1, 0]
B = [0, 1, 1, 0, 1, 0, 1]

2. Generate k hash functions

We randomly select k hash functions, each of which maps a binary vector to a single integer between 1 and N. One common approach is to use hash functions of the form h(x) = (a * x + b) % p, where a and b are random parameters, and p is a large prime number:

h1(x) = (2x + 1) % 7
h2(x) = (3x + 2) % 7
h3(x) = (5x + 3) % 7

3. Compute the min-hash signature

For each hash function, we compute the minimum hash value for each set. This gives us a signature of length k for each set:

h1(1) = (2*1 + 1) % 7 = 3
h1(2) = (2*2 + 1) % 7 = 5
h1(4) = (2*4 + 1) % 7 = 2  <-- minhash for A using h1 is 2
h1(5) = (2*5 + 1) % 7 = 4
h1(6) = (2*6 + 1) % 7 = 6

# Min-hash signatures:
# Set A: [2, 0, 0]
# Set B: [0, 1, 0]

4. Estimate the Jaccard similarity

To estimate the Jaccard similarity between sets A and B, we compare their min-hash signatures, and count the number of times the corresponding entries are equal:

J(A, B) = (1/k) * (number of matching entries in s(A) and s(B))

signature1 = {2,0,0}
signature2 = {0,1,0}
intersection = signature1.intersection(signature2)
union = signature1.union(signature2)
jaccard_similarity = len(intersection) / len(union)

signature1 = [2,0,0]
signature2 = [0,1,0]
num_hashes = 3
num_matches = sum(1 for i in range(num_hashes) if signature1[i] == signature2[i])
estimated_similarity = num_matches / num_hashes

print('Jaccard similarity:', jaccard_similarity,
      'Estimated similarity:', estimated_similarity)
# > Jaccard similarity: 0.333 Estimated similarity: 0.333

The intuition behind min-hashing is that, by hashing the items in a set and taking the minimum hash value, we are effectively selecting a "representative" item that captures the essence of the set. By comparing the min-hash signatures of two sets, we can estimate their Jaccard similarity with high probability, even if we don't know the exact contents of the sets. This makes min-hashing a useful tool for dealing with large, high-dimensional datasets where traditional similarity measures are computationally infeasible.

import random

# Define the sets to be compared
set1 = {1, 2, 3, 4, 5}
set2 = {3, 4, 5, 6, 7}

# Define the number of hash functions to use
num_hashes = 100

# Generate random hash functions
hash_funcs = []
for i in range(num_hashes):
    a = random.randint(1, 100)
    b = random.randint(1, 100)
    hash_funcs.append(lambda x: (a*x + b) % 100000000)

# Define a function to create a min-hash signature for a set
def min_hash_signature(s):
    signature = [float('inf')] * num_hashes
    for element in s:
        for i, hash_func in enumerate(hash_funcs):
            hash_val = hash_func(element)
            signature[i] = min(signature[i], hash_val)
    return signature

# Create min-hash signatures for the sets
signature1 = min_hash_signature(set1)
signature2 = min_hash_signature(set2)

# Estimate Jaccard similarity as the fraction of matching min-hash values
num_matches = sum(1 for i in range(num_hashes)
                  if signature1[i] == signature2[i])
jaccard_similarity = num_matches / num_hashes

print('~ Jaccard similarity:', jaccard_similarity)

Code for STE (Straight-Through Estimator)

class BinaryLayer(torch.autograd.Function):
    def forward(self, input):
        return torch.sign(input)

    def backward(self, grad_output):
        input = self.saved_tensors
        grad_output[input>1]=0
        grad_output[input<-1]=0
        return grad_output

input = torch.randn(4,4)
input = torch.autograd.Variable(input, requires_grad=True)

model = BinaryLayer()
output = model(input)
loss = output.mean()
loss.backward()

# With STE:
# input.grad =
#   6.2500e-02 for all entries

# Without STE:
# input.grad =
#   0 for all entries