DEV Community

Raj Kundalia
Raj Kundalia

Posted on

Exploration of Different AI Techniques

LLM, AI, ML, etc. have become so famous these days and everyone is jumping on to it and having their share of fun. Some of the folks are really good at this and have been bringing out amazing use cases and solutions.

We have one of those problems too that we think could make use of the buzz words that are floating around since Chat GPT came in.

Problem Statement: We have code in one language, there are multiple files of it; we want to convert this to a different language and a different format altogether. We could call it translation + transformation.

I had to research for it and I thought why not write a page for it. We did come up with a POC that partially helped in getting some code generated but it required a lot of manual effort.

Anyway, I had to research some AI techniques for it and I am listing them down here. They are:

1. Prompt Chaining

Prompt chaining is a technique in prompt engineering that involves using a sequence of prompts to guide generative AI through complex tasks. This method allows LLMs to tackle complex tasks by breaking them down into smaller, more manageable steps.

Common Chaining Techniques

1. Sequential Chaining:

  • The output of one prompt is directly fed as input to the next.
  • Suitable for tasks with a clear linear progression.
  • Short example: Translate English to Spanish -> Summarize the Spanish text
  • Examples are mentioned below: #1 and #2

2. Iterative Chaining:

  • The output of one prompt is refined through multiple iterations of the same or similar prompts.
  • Useful for tasks requiring optimization or refinement.
  • Short Example: Write a poem -> Improve the poem’s rhyme scheme -> Make the poem more concise
  • Examples are mentioned below: #3

3. Hierarchical Chaining:

  • Involves creating a tree-like structure of prompts, where the output of one prompt can be used as input for multiple subsequent prompts.
  • Ideal for tasks with multiple sub-tasks or branching logic.
  • Short Example: Write a story -> Develop 3 characters -> Create a plot for character A
  • Examples are mentioned below: #4

Examples of Chaining:

1. Sequential Chaining — Data-Driven Chaining

Prompt 1:

Provide examples of Python functions that implement sorting algorithms (e.g., bubble sort, insertion sort, merge sort).

Prompt 2:

Based on the provided examples, identify common patterns and structures in sorting algorithms.

Prompt 3:

Write a Python function that implements a new sorting algorithm, using the identified patterns and structures.

2. Sequential Chaining — Contextual Chaining

Prompt 1:

Given the following Python code for a linked list class:

class Node:
def __init__(self, data):
self.data = data
self.next = None
class LinkedList:
def __init__(self):
self.head = None
Enter fullscreen mode Exit fullscreen mode

Write a Python function that inserts a new node at the beginning of the linked list.

Prompt 2:

Given the same linked list class and the insertion function from the previous prompt, write a Python function that deletes the first node of the linked list.

Prompt 3:

Given the same linked list class and the insertion and deletion functions, write a Python function that searches for a specific value in the linked list.

3. Iterative Chaining:

Prompt 1:

Generate code for bubble sort

Output 1:

Code for bubble sort with some missing condition

Prompt 2:

You have missed a certain condition in the provided code, can you correct it? Code provided

Output 2:

Code with fixed condition is provided

Prompt 3:

The code is correct but you optimize it by removing unnecessary conditions?

Output 3:

Improved or optimized code

4. Hierarchical Chaining: Top-Down Decomposition

Prompt 1:

Write a function that calculates the factorial of a given number. The factorial of a number nn is the product of all positive integers less than or equal to nn. For example, the factorial of 5 is 5!=5×4×3×2×1=120.

Prompt 2:

Define a base case for the factorial function. The base case should return 1 when the input number is 0, as 0!=1.

Prompt 3:

Write the recursive step for the factorial function. This step should multiply the current number by the factorial of the number minus one.

Prompt 4:

Combine the base case and recursive step to create the complete factorial function.

Code Generation: Chaining can be used to generate code in multiple steps, starting with high-level specifications and gradually refining the code based on intermediate outputs.

References: Link 1, Link 2

Final Thought and Idea:

We should design our prompts such that we can use sequential as well as hierarchical chaining for generating different classes that we want and then combine the results at the end. This approach will however have to be iterative in terms of designing the system and writing the prompts.

Find all the required metadata and add it. This includes sample classes of all types, old and new. The classes that we need to convert should be fed one by one and then generate the outputs for them while also maintaining the relevant context i.e. class names and summary of the classes in some data structure for referring.

2. RAG and Code Chunking

Image Source: [Apoorva Joshi’s Blogs](https://www.mongodb.com/developer/author/apoorva-joshi/)

Image Source: Apoorva Joshi’s Blogs

Basics of RAG:

Retrieval-augmented generation (RAG), as the name suggests, aims to improve the quality of pre-trained LLM generation using data retrieved from a knowledge base. The success of RAG lies in retrieving the most relevant results from the knowledge base.

Indexing the Code-base: The first step is to index the existing code-base using a retrieval system. This involves converting the code into a format suitable for retrieval, such as embedding the code or extracting relevant information like function/class names, java docs, and comments.

Retrieval: When prompted to generate new code, the system (or RAG system or the service that is orchestrating) can query the indexed code-base to retrieve relevant code snippets, functions, or components that are potentially relevant to the task at hand. The retrieval system can use similarity measures or other techniques to identify the most relevant parts of the code-base.

Augmented Generation: The retrieved code snippets or information can then be provided as additional context to the LLM, along with the original prompt or requirements. The LLM can use this augmented context to generate new code while being aware of the existing code-base and potentially reusing or adapting relevant parts.

Note on: Embedding & Embedding Model:

An embedding is an array of numbers (a vector) representing a piece of information, such as text, images, audio, video, etc. Together, these numbers capture semantics and other important features of the data. The immediate consequence of doing this is that semantically similar entities map close to each other while dissimilar entities map farther apart in the vector space.

In the context of natural language processing (NLP), embedding models are algorithms designed to learn and generate embeddings for a given piece of information.

In RAG systems, embeddings play a crucial role in both storing and retrieving data:

Storing Data:

Document Embeddings: Each document in the knowledge base is converted into an embedding vector. This vector represents the core meaning and context of the document.

Storing Embeddings: These embeddings are typically stored in a vector database, which is optimized for efficient similarity search.

Retrieving Data:

Query Embedding: When a user query is received, it is also converted into an embedding vector.

Similarity Search: The query embedding is then compared to the stored document embeddings using a similarity metric (e.g., cosine similarity).

Retrieval: The documents with the highest similarity scores to the query are retrieved and used as context for the language model.

The RAG approach can be particularly useful when dealing with large code-bases because it allows the our system to selectively access and incorporate relevant information from the code-base without having to process the entire code-base at once. This can help mitigate the context window limitations and improve the relevance and coherence of the generated code.

Additionally, RAG can be combined with other strategies like code chunking or hierarchical prompting. For example, you could index and retrieve relevant code snippets or components for each chunk or hierarchical level, providing the LLM with more focused and relevant context at each step.

While RAG can be beneficial, it’s important to note that it introduces additional complexity in terms of indexing and retrieval setup, as well as potential challenges in ensuring the retrieved information is correctly interpreted and integrated by the LLM:

Retrieval Efficiency: Efficiently retrieving relevant information from a large knowledge base can be computationally challenging, especially for complex queries.

Knowledge Integration: Integrating retrieved information seamlessly with the LLM’s internal knowledge representation can be tricky, potentially leading to inconsistencies or inaccuracies.

Model Bias: If the knowledge base contains biased or inaccurate information, it can negatively impact the LLM’s responses.

Chunking Strategies:

Chunking is the process of breaking down large pieces of text into smaller segments or chunks. In the context of RAG, embedding smaller chunks instead of entire documents to create the knowledge base means that given a user query, you only have to retrieve the most relevant document chunks, resulting in fewer input tokens and more targeted context for the LLM to work with. Different techniques for chunking:

Splitting technique: Determines where the chunk boundaries will be placed — based on paragraph boundaries, programming language-specific separators, tokens, or even semantic boundaries

Chunk size: The maximum number of characters or tokens allowed for each chunk

Chunk overlap: Number of overlapping characters or tokens between chunks; overlapping chunks can help preserve cross-chunk context; the degree of overlap is typically specified as a percentage of the chunk size

References: Link 1, Link 2

Final Thoughts:

This is complex and may require a good experimentation to be able to implement. The relevant retrieval and then integrating it for generating output requires a sweet spot, may require a good amount of iteration.

3. Small Language Model

Small Language Models (SLMs) are a subset of artificial intelligence models designed for natural language processing (NLP). They are characterised by their smaller size and fewer parameters compared to Large Language Models (LLMs), making them more efficient and easier to deploy in resource-constrained environments. [What are parameters?]

SLMs can be fine-tuned on specific datasets, they can be tailored to generate code relevant to particular domains or programming languages, enhancing their utility for developers working in specialized fields.

Reference: Link 1

Challenges:

SLM might struggle with inputs which are more complex and do not have a data set that is close enough to what we are trying to generate, so training data quality should be top notch. Also, it might not have the breadth of information necessary to handle a wide range of topics effectively. SLMs might produce more repetitive and less creative responses. We do an expert or have to get up to speed about training and fine tuning the models.

Final thoughts:

We have to find a good model for code generation and then train it on our data, which requires knowledge of fine tuning as well as training the data. This would require deeper exploration to make more sense.

4. GAN

GANs(Generative adversarial networks) are a type of deep learning model that works by pitting two neural networks against each other in a game-like setting. One network, the generator, is responsible for creating new data, while the other network, the discriminator, is responsible for determining whether the data is real or fake. The generator is constantly trying to improve its ability to create realistic data, while the discriminator is constantly trying to improve its ability to distinguish between real and fake data. This back-and-forth competition between the two networks drives GANs to become better and better at generating realistic data.

GANs have been used to achieve state-of-the-art results in a variety of tasks, including image generation, text generation, and translation. For example, GANs have been used to create realistic images of people, animals, and objects that are indistinguishable from real images. They have also been used to generate realistic text, such as news articles, blog posts, and even creative writing. Additionally, GANs have been used to translate languages with a high degree of accuracy.

Challenges:

We found that while code generation using GANs has shown promising results, several challenges remain unresolved, including the generation of code that is both syntactically and semantically correct, as well as the need for large amounts of training data. — Source: Link

Also, code often depends on long-range dependencies (e.g., variables defined earlier affect later code). GANs don’t handle these dependencies as effectively as other models.

Final thoughts:

GANs are better suited for data where slight imperfections aren’t critical (like an image with a tiny anomaly).

PS: I am by no means expert and corrections and suggestions are welcome. I hope this helps somebody.

Top comments (0)