Davide Santangelo

Posted on Jan 21

HybridSimilarity Algorithm

#python #machinelearning

Explaining the HybridSimilarity Algorithm

In this article, we will delve into the HybridSimilarity algorithm, a custom-built neural network-based model for measuring the similarity between two pieces of text. This hybrid model leverages various techniques to combine lexical, phonetic, semantic, and syntactic similarities for a comprehensive similarity score.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sentence_transformers import SentenceTransformer
from Levenshtein import ratio as levenshtein_ratio
from phonetics import metaphone
import torch
import torch.nn as nn

class HybridSimilarity(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.tfidf = TfidfVectorizer()
        self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
        self.fc = nn.Sequential(
            nn.Linear(1152, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def _extract_features(self, text1, text2):
        # Multiple features
        features = {}

        # Lexical similarity
        features['levenshtein'] = levenshtein_ratio(text1, text2)
        features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

        # Phonetic similarity
        features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

        # Semantic embedding (BERT)
        emb1 = self.bert.encode(text1, convert_to_tensor=True)
        emb2 = self.bert.encode(text2, convert_to_tensor=True)
        features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

        # Syntactic similarity (LSA-TFIDF)
        tfidf_matrix = self.tfidf.fit_transform([text1, text2])
        svd = TruncatedSVD(n_components=1)
        lsa = svd.fit_transform(tfidf_matrix)
        features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

        # Attention patterns
        att_output, _ = self.attention(
            emb1.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0)
        )
        features['attention_score'] = att_output.mean().item()

        return torch.tensor(list(features.values())).unsqueeze(0)

    def forward(self, text1, text2):
        features = self._extract_features(text1, text2)
        return self.fc(features).item()

def similarity_coefficient(text1, text2):
    model = HybridSimilarity()
    return model(text1, text2)

Key Components of the Algorithm

The HybridSimilarity model utilizes the following libraries and technologies:

SentenceTransformers: For semantic embedding generation using pre-trained transformer models.
Levenshtein Ratio: To calculate lexical similarity.
Phonetics (Metaphone): For phonetic similarity.
TF-IDF and TruncatedSVD: For syntactic similarity through Latent Semantic Analysis (LSA).
PyTorch: To define a custom neural network with attention mechanisms and fully connected layers.

Step-by-Step Explanation

1. Model Initialization

The HybridSimilarity class inherits from nn.Module and initializes:

A BERT-based sentence embedding model (all-MiniLM-L6-v2).
A TF-IDF vectorizer for text vectorization.
A multi-head attention mechanism to capture interdependencies between text pairs.
A fully connected neural network for aggregating features and producing the final similarity score.

self.bert = SentenceTransformer('all-MiniLM-L6-v2')
self.tfidf = TfidfVectorizer()
self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
self.fc = nn.Sequential(
    nn.Linear(1152, 256),
    nn.ReLU(),
    nn.LayerNorm(256),
    nn.Linear(256, 1),
    nn.Sigmoid()
)

2. Feature Extraction

The _extract_features method calculates multiple similarity features:

Lexical Similarity
- Levenshtein ratio: Measures character-level edits needed to convert one text into another.
- Jaccard index: Compares sets of unique words in both texts.

features['levenshtein'] = levenshtein_ratio(text1, text2)
features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

Phonetic Similarity
- Metaphone encoding: Checks if the phonetic representation of both texts matches.

features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

Semantic Similarity
- Sentence embeddings are generated using BERT, and cosine similarity is calculated between them.

emb1 = self.bert.encode(text1, convert_to_tensor=True)
emb2 = self.bert.encode(text2, convert_to_tensor=True)
features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

Syntactic Similarity
- TF-IDF is used to vectorize the text, and Latent Semantic Analysis (LSA) is applied via TruncatedSVD.

tfidf_matrix = self.tfidf.fit_transform([text1, text2])
svd = TruncatedSVD(n_components=1)
lsa = svd.fit_transform(tfidf_matrix)
features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

Attention Mechanism
- A multi-head attention mechanism is applied to the embeddings, and the average attention score is used as a feature.

att_output, _ = self.attention(
    emb1.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0)
)
features['attention_score'] = att_output.mean().item()

3. Neural Network Aggregation

The extracted features are concatenated and passed through a fully connected neural network. The network predicts a similarity score between 0 and 1.

def forward(self, text1, text2):
    features = self._extract_features(text1, text2)
    return self.fc(features).item()

Example Usage

The similarity_coefficient function initializes the model and calculates the similarity between two input texts.

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "A fast brown fox leaps over a sleepy hound"

print(f"Similarity coefficient: {similarity_coefficient(text_a, text_b):.4f}")

This function calls the HybridSimilarity model and outputs a similarity score, which is a float value between 0 (completely dissimilar) and 1 (identical).

Conclusion

The HybridSimilarity algorithm is a robust solution that combines multiple dimensions of text similarity into a unified model. By integrating lexical, phonetic, semantic, and syntactic features, this hybrid approach ensures a nuanced and comprehensive similarity analysis. This makes it suitable for tasks such as duplicate detection, text clustering, and recommendation systems.

DEV Community

HybridSimilarity Algorithm

Explaining the HybridSimilarity Algorithm

Key Components of the Algorithm

Step-by-Step Explanation

1. Model Initialization

2. Feature Extraction

3. Neural Network Aggregation

Example Usage

Conclusion

Top comments (0)

Read next

Débusquer les Goulots d'Étranglement Django : Une Analyse Approfondie avec Django-Silk

PyApiGen Python Program

🛥️ Introduction to Docker: Core Concepts

FLUX: Breakthrough 1.58-bit Neural Network Compression Maintains Full Accuracy While Slashing Memory Use by 20x