Dmitrii

Posted on Dec 27, 2024

Hiring Best AI Talents: Interview Questions in 2025

#interview #ai #hiring #startup

The first part of the article focuses on the characteristics and personality traits of developers (soft skills).
The second part covers topics to discuss during an interview.

Disclaimer

This article is focused on providing practical questions for companies aiming to integrate AI into their traditional products and businesses with minimal effort and high-quality outcomes. Examples include using AI chatbots for retail, analyzing patient data in healthcare, and delivering personalized experiences in educational platforms. It is not intended for research-oriented firms (e.g., Mistral, Anthropic, ElevenLabs) or enterprises (e.g., Google, Amazon, Microsoft).

The Value of Soft Skills

A trustworthy developer who is eager to learn and experiment can adapt more effectively to the rapidly evolving AI landscape. The ability to learn and pivot quickly is more valuable than proficiency in a specific framework or programming language.

If you know such a developer, you're in luck — they can solve problems without relying on guides like this.

The Changing AI Landscape: From NLP to LLMs and Multimodal models.

Previously, solving specific NLP problems required specialized tools and deep expertise in areas such as: Sentiment Analysis, Spam Detection, Topic Classification, Named Entity Recognition, Text Summarization, Translation, Duplicate Detection, Recommendation Systems, Intent Detection, Grammar Correction, Audio/Image Recognition.

Today, with the advent of large language models (LLMs) and multimodal models, many of these tasks can be addressed more efficiently and comprehensively. The focus has shifted from building custom models/pipelines to applying pre-trained models to real-world use cases.

Avoid Overengineering

When building your first product or MVP, simplicity should be the priority. If you're unsure where to start, begin with straightforward solutions.

For instance, if you're building an MVP with limited usage—say, 10 requests per day for customer support — and someone suggests training a BERT or ModernBERT model, hosting it locally, and managing the entire setup, that's likely overengineering. You'll invest significantly more time, but without the scale to handle 1,000 RPS or a dedicated tech team to maintain the system, it’s not a practical approach.

Getting started with conversational chatbots no longer requires an in-depth understanding of NLP concepts like encoder-only vs. decoder models. An analogy: You don't need to write assembly code to develop most business applications. Instead, nowadays you can rely on high-level languages, frameworks, or even no-code/low-code solutions to efficiently solve business problems.

Topics to Discuss During an Interview

From the business problem to implementation.
Each topic also includes links to specialized resources for deeper exploration

Dataset Management

Working with data is often the most challenging aspect of AI development. It's critical for fine-tuning and evaluations. Key topics include:

Preprocessing: Cleaning and preparing raw data for training. This includes transformations, combining data from multiple sources, and reducing dimensionality while preserving relevant information. Developers should also understand concepts like training, validation, and testing sets:
- Training set: Your textbook and practice problems (you actively learn from these).
- Validation set: Practice tests that are different from the practice problems (you use these to gauge your understanding and identify areas needing improvement, adjusting your study methods accordingly).
- Testing set: The actual exam (this is the final, unseen evaluation of your knowledge).
Labeling and Annotation: Labeling entities (e.g., person, organization, location) in text, annotating sentiment (positive, negative, neutral) in reviews, and other critical tasks for fine-tuning models.
Tooling: Tools for exploration, visualization, and annotation.
Data Pipelines and Workflows: Automation, data sources, responsibilities in the team.
Security and Privacy: Encryption, anonymization, pseudonymization, and access control to ensure data security and compliance with regulations.
Versioning

AI Agents Architecture

Consider the level of control and custom logic required for your application: Do you want the LLM to make decisions autonomously, or will it work in tandem with traditional bytecode programming logic? Then choose a framework or approach based on your specific requirements:

Low-code agent builders (e.g., n8n, Langflow)
Multi-agent systems (e.g., CrewAI, Autogen).
Multi-actor systems (e.g., LangGraph).
Custom architectures tailored to specific use cases (e.g. Semantic Kernel).

Streaming provides a more dynamic and responsive user experience but requires careful implementation to ensure that all system components support streaming capabilities. Messaging, on the other hand, is easier to debug and troubleshoot but may lead to a less seamless user experience compared to streaming.

Memory and State Management

Effective chatbot memory management involves balancing precision and recall with considerations of accuracy, latency, and cost. The principle is simple: "Garbage In, Garbage Out." The ultimate goal is to equip the agent with exactly what is needed—no more, no less.

Short-Term Memory (Conversation Thread)

Message Buffering: Retain the last N messages or a specific time window to maintain conversational context.
Summarization: Condense previous interactions to preserve relevance without overwhelming the system.
Session Timeout: Define when a session should expire (e.g., after 30 minutes of inactivity).
Tools History: Determine whether interactions with external tools should be included in the conversation history.
State Passing: Ensure seamless state transfer between agents or modules.
Entity Storage: Capture and update entities, facts, and IDs relevant to the conversation.
Update Timing: Decide when and how memory updates should occur.

Long-Term Memory

Diverse Storage: Leverage various storage solutions to retrieve relevant information efficiently.
Update Mechanisms: Implement robust processes for updating long-term memory with new data.
Few-Shot Prompting: Use stored conversations as context for dynamic prompting.
Data Masking: Ensure sensitive information is appropriately masked or anonymized.
Context-Dependent Instructions: Tailor memory behavior to the specific use case or scenario.

Storage Solutions

SQL Databases: Best suited for structured data and simple, predefined queries.
Vector Databases: Optimal for storing embeddings and performing similarity searches.
Document Databases: Ideal for unstructured data, such as conversation history, and flexible schemas.
Graph Databases: Perfect for representing and querying intricate relationships within data.

RAG, Embeddings, and Vector Stores

Deep technical topic to discuss.

Embedding Dimensions: Balance between detail capture and computational efficiency.
Sparse vs. Dense: Sparse embeddings for discrete features; dense embeddings for semantic relationships.
Model Selection: Pre-trained models (e.g., BERT, GPT) for general tasks; fine-tuned models for specialized domains.
Language Support: Coverage of all target languages; additional training data for less common ones.
Input Type: Embedding of text, non-text data, or both based on chatbot needs.
Vector Stores: Scalable, efficient databases with metadata integration for enhanced retrieval.
Data Retrieval: Use of varied agents, approaches, and workflows for effective data access.
Reranking and Filtering: Reranking, filtering, and scoping techniques to refine results and improve relevance.
Ingest Workflows: Seamless data ingestion and transformation for embedding and storage preparation.
Quality Assurance: Regular fine-tuning of retrieval processes to maintain accuracy.
Large Datasets: Document chunking and relevance ranking for extensive data handling.
Embedding Updates: Periodic refreshing of embeddings to ensure relevance.

Integrations

Yes, we aim to replace bytecode with tokens in most cases. However, integration with external systems remains one of the most time-consuming tasks, so it needs to be discussed as well.

APIs: REST, gRPC, GraphQL for standardized input/output interactions with AI models.
Webhooks: Real-time, event-driven communication between systems.
Two-Way Integrations: AI sending and retrieving relevant data in real time (e.g., chatbots accessing CRM data).
Data Synchronization: Consistent, up-to-date data through queues or pipelines.
Retries and Fallbacks: Failure management with retry mechanisms and default responses.
Error Handling: Input validation, error logging, and debugging alerts.
Performance Optimization: Batched API calls and caching for reduced latency.
Low-Code Platforms: like Zapier, Make for workflow automation.
Data Integration Tools: Airbyte, Apache Kafka for streaming and event processing.
Authentication: Api keys, OAuth2, 2-factor.
Encryption: Vaults, protocols, keys.

Models

Before diving into prompt engineering, it's crucial to choose the right model for your needs:

Open Source: Is the model proprietary or open source?
Modality: Supported input/output types (e.g., text, images, audio).
Batching: Cost-efficient processing in the background (e.g., hours).
Caching: Support for response reuse and optimization.
Fine-Tuning: Support and ease.
Cost: Pricing structure and affordability.
Context Length: Maximum tokens supported per input/output.
SDK Frameworks: Availability of developer tools and APIs.
Ecosystem: Compatibility with libraries, plugins, or platforms.
Scaling & Throughput: Limits and quota.
Latency: Average response time (ms, sec, or min: o1 vs gemini flash).
Built-in Tools: Features like reasoning, code interpretation, or search.

Every week, a new model emerges that surpasses all previous ones. So, just open Twitter (x.com) and follow OpenAI, Gemini, Anthropic, Mistral, Hugging Face, LLaMA, DeepSeek, Qwen, Gemma, and Phi.

Prompts

It all comes down to personal experience and applying tips. LLMs are non-deterministic, meaning that similar prompts can produce different results.

Prioritize Longform Data: Place detailed context at the start, instructions and examples at the bottom.
Prompt Chaining: Break tasks into steps (e.g., Extract → Transform → Analyze → Visualize).
- Accuracy: Each step gets full attention.
- Clarity: Simple tasks = clear outputs.
- Traceability: Spot and fix issues easily.
Chain of Thought: Encourage step-by-step reasoning.
Multishot Prompting: Provide multiple examples for better learning.
Adopt a Persona: Specify the model’s role for focused responses.
Use Delimiters: Separate distinct input parts clearly.
Prompt Caching: Reuse prompts for efficiency.
Structured Outputs: Request organized formats like JSON or tables.
Directional Cues: Add hints or keywords or formatting like JSON to focus LLM on required problem.
ReAct Approach: Combine reasoning and action in problem-solving.

Security / PII

Basic hygiene to protect customer's data:

Minimizing PII by removing or masking sensitive data.
Trying pseudonymization techniques.
Monitoring access and usage logs for unauthorized activity.
Implementing real-time security breach alerts.

LLM Evaluations

Huge topic to discuss.

Performing regression testing and testing across different models.
AI-Judge.
Evaluating model performance in live environments (e.g., helpfulness) and offline using established gold-standard datasets.
What specific metrics would you use to measure response accuracy in different contexts (e.g., question answering, summarization, dialogue)?
How do you balance competing evaluation objectives (e.g., accuracy vs. fluency, helpfulness vs. harmlessness)?
What are the advantages/disadvantages of different evaluation methods (human evaluation, automated metrics, adversarial testing)?
How would you detect context loss or contradictory statements across turns?
Efficiency and Performance: What specific metrics would you use to measure LLM efficiency, and how would you optimize for them in production? Consider latency, throughput, and memory usage.
Hallucination Detection: What specific techniques/tools would you use to detect hallucinations in LLMs? How would you distinguish between factual errors, creative interpretations, and genuine hallucinations?
Human Evaluation: What specific criteria would guide human evaluators assessing LLM output quality? How would you ensure inter-rater reliability and minimize subjective bias?

Observability

The first step is recognizing that an issue or hallucination exists. Then, you need to find the root cause, troubleshoot, and ensure it's resolved.

Your application/framework should send all necessary information to an observability platform. This includes:

Metrics: Performance and cost data.
Alerting: Automated alerts for performance issues or downtime.
Logs and Traces: To help identify hallucinations and analyze prompt and response variance.

Compare observability platforms, their SDKs, integrations, additional features, and cost.

Final Thoughts

Chatbot development and AI Agent development is not just about math and NLP anymore. Today's best AI developers are:

Versatile Problem-Solvers: They combine business sense with technical skills and learn quickly.
Good Communicators: They work well with others and think critically.
Practical Technologists: They know how to use existing models and tools efficiently.

Use the provided topics to determine whether your candidate is a good fit. Good luck!

DEV Community