A Promising Methodology for Testing GenAI Applications in Java

In the vast universe of programming, the era of generative artificial intelligence (GenAI) has marked a turning point, opening up a plethora of possibilities for developers.

Tools such as LangChain4j and Spring AI have democratized access to the creation of GenAI applications in Java, allowing Java developers to dive into this fascinating world. With Langchain4j, for instance, setting up and interacting with large language models (LLMs) has become exceptionally straightforward. Consider the following Java code snippet:

public static void main(String[] args) {
var llm = OpenAiChatModel.builder()
.apiKey("demo")
.modelName("gpt-3.5-turbo")
.build();
System.out.println(llm.generate("Hello, how are you?"));
}

This example illustrates how a developer can quickly instantiate an LLM within a Java application. By simply configuring the model with an API key and specifying the model name, developers can begin generating text responses immediately. This accessibility is pivotal for fostering innovation and exploration within the Java community. More than that, we have a wide range of models that can be run locally, and various vector databases for storing embeddings and performing semantic searches, among other technological marvels.

Despite this progress, however, we are faced with a persistent challenge: the difficulty of testing applications that incorporate artificial intelligence. This aspect seems to be a field where there is still much to explore and develop.

In this article, I will share a methodology that I find promising for testing GenAI applications.

Project overview

The example project focuses on an application that provides an API for interacting with two AI agents capable of answering questions. 

An AI agent is a software entity designed to perform tasks autonomously, using artificial intelligence to simulate human-like interactions and responses. 

In this project, one agent uses direct knowledge already contained within the LLM, while the other leverages internal documentation to enrich the LLM through retrieval-augmented generation (RAG). This approach allows the agents to provide precise and contextually relevant answers based on the input they receive.

I prefer to omit the technical details about RAG, as ample information is available elsewhere. I’ll simply note that this example employs a particular variant of RAG, which simplifies the traditional process of generating and storing embeddings for information retrieval.

Instead of dividing documents into chunks and making embeddings of those chunks, in this project, we will use an LLM to generate a summary of the documents. The embedding is generated based on that summary.

When the user writes a question, an embedding of the question will be generated and a semantic search will be performed against the embeddings of the summaries. If a match is found, the user’s message will be augmented with the original document.

This way, there’s no need to deal with the configuration of document chunks, worry about setting the number of chunks to retrieve, or worry about whether the way of augmenting the user’s message makes sense. If there is a document that talks about what the user is asking, it will be included in the message sent to the LLM.

Technical stack

The project is developed in Java and utilizes a Spring Boot application with Testcontainers and LangChain4j.

For setting up the project, I followed the steps outlined in Local Development Environment with Testcontainers and Spring Boot Application Testing and Development with Testcontainers.

I also use Tescontainers Desktop to facilitate database access and to verify the generated embeddings as well as to review the container logs.

The challenge of testing

The real challenge arises when trying to test the responses generated by language models. Traditionally, we could settle for verifying that the response includes certain keywords, which is insufficient and prone to errors.

static String question = "How I can install Testcontainers Desktop?";
@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
assertThat(answer).contains("https://testcontainers.com/desktop/");
}

This approach is not only fragile but also lacks the ability to assess the relevance or coherence of the response.

An alternative is to employ cosine similarity to compare the embeddings of a “reference” response and the actual response, providing a more semantic form of evaluation. 

This method measures the similarity between two vectors/embeddings by calculating the cosine of the angle between them. If both vectors point in the same direction, it means the “reference” response is semantically the same as the actual response.

static String question = "How I can install Testcontainers Desktop?";
static String reference = """
– Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
– Answer must indicate to use brew to install Testcontainers Desktop in MacOS
– Answer must be less than 5 sentences
""";
@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
double cosineSimilarity = getCosineSimilarity(reference, answer);
assertThat(cosineSimilarity).isGreaterThan(0.8);
}

However, this method introduces the problem of selecting an appropriate threshold to determine the acceptability of the response, in addition to the opacity of the evaluation process.

Toward a more effective method

The real problem here arises from the fact that answers provided by the LLM are in natural language and non-deterministic. Because of this, using current testing methods to verify them is difficult, as these methods are better suited to testing predictable values. 

However, we already have a great tool for understanding non-deterministic answers in natural language: LLMs themselves. Thus, the key may lie in using one LLM to evaluate the adequacy of responses generated by another LLM. 

This proposal involves defining detailed validation criteria and using an LLM as a “Validator Agent” to determine if the responses meet the specified requirements. This approach can be applied to validate answers to specific questions, drawing on both general knowledge and specialized information

By incorporating detailed instructions and examples, the Validator Agent can provide accurate and justified evaluations, offering clarity on why a response is considered correct or incorrect.

static String question = "How I can install Testcontainers Desktop?";
static String reference = """
– Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
– Answer must indicate to use brew to install Testcontainers Desktop in MacOS
– Answer must be less than 5 sentences
""";

@Test
void verifyStraightAgentFailsToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/straight?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("no");
}

@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("yes");
}

We can even test more complex responses where the LLM should suggest a better alternative to the user’s question.

static String question = "How I can find the random port of a Testcontainer to connect to it?";
static String reference = """
– Answer must not mention using getMappedPort() method to find the random port of a Testcontainer
– Answer must mention that you don't need to find the random port of a Testcontainer to connect to it
– Answer must indicate that you can use the Testcontainers Desktop app to configure fixed port
– Answer must be less than 5 sentences
""";

@Test
void verifyRaggedAgentSucceedToAnswerHowToDebugWithTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("yes");
}

Validator Agent

The configuration for the Validator Agent doesn’t differ from that of other agents. It is built using the LangChain4j AI Service and a list of specific instructions:

public interface ValidatorAgent {
@SystemMessage("""
### Instructions
You are a strict validator.
You will be provided with a question, an answer, and a reference.
Your task is to validate whether the answer is correct for the given question, based on the reference.

Follow these instructions:
– Respond only 'yes', 'no' or 'unsure' and always include the reason for your response
– Respond with 'yes' if the answer is correct
– Respond with 'no' if the answer is incorrect
– If you are unsure, simply respond with 'unsure'
– Respond with 'no' if the answer is not clear or concise
– Respond with 'no' if the answer is not based on the reference

Your response must be a json object with the following structure:
{
"response": "yes",
"reason": "The answer is correct because it is based on the reference provided."
}

### Example
Question: Is Madrid the capital of Spain?
Answer: No, it's Barcelona.
Reference: The capital of Spain is Madrid
###
Response: {
"response": "no",
"reason": "The answer is incorrect because the reference states that the capital of Spain is Madrid."
}
""")
@UserMessage("""
###
Question: {{question}}
###
Answer: {{answer}}
###
Reference: {{reference}}
###
""")
ValidatorResponse validate(@V("question") String question, @V("answer") String answer, @V("reference") String reference);

record ValidatorResponse(String response, String reason) {}
}

As you can see, I’m using Few-Shot Prompting to guide the LLM on the expected responses. I also request a JSON format for responses to facilitate parsing them into objects, and I specify that the reason for the answer must be included, to better understand the basis of its verdict.

Conclusion

The evolution of GenAI applications brings with it the challenge of developing testing methods that can effectively evaluate the complexity and subtlety of responses generated by advanced artificial intelligences. 

The proposal to use an LLM as a Validator Agent represents a promising approach, paving the way towards a new era of software development and evaluation in the field of artificial intelligence. Over time, we hope to see more innovations that allow us to overcome the current challenges and maximize the potential of these transformative technologies.

Learn more

Check out the GenAI Stack to get started with adding AI to your apps. 

Subscribe to the Docker Newsletter.

Get the latest release of Docker Desktop.

Vote on what’s next! Check out our public roadmap.

Have questions? The Docker community is here to help.

New to Docker? Get started.

Quelle: https://blog.docker.com/feed/

AWS HealthOmics kündigt Unterstützung für das Lesen von Sequenzspeichern über Amazon-S3-APIs an

Wir freuen uns, Ihnen mitteilen zu können, dass AWS HealthOmics jetzt das Lesen von Sequenzspeicherobjekten mithilfe von Amazon-S3-APIs unterstützt. AWS HealthOmics ist ein vollständig verwalteter Service, der es Organisationen aus dem Gesundheitswesen und den Biowissenschaften ermöglicht, Omics-Daten zu speichern, abzufragen, zu analysieren und Erkenntnisse zu gewinnen, um die Gesundheit zu verbessern und wissenschaftliche Entdeckungen voranzutreiben. Mit dieser Version können Kunden HealthOmics-Datenspeicher einfacher in ihr Bioinformatiksystem integrieren und gleichzeitig von Domain-spezifischen Metadaten, Kosteneinsparungen und Skalierbarkeit profitieren.
Quelle: aws.amazon.com

Introducing Phi-3: Redefining what’s possible with SLMs

We are excited to introduce Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks. This release expands the selection of high-quality models for customers, offering more practical choices as they compose and build generative AI applications.

Starting today, Phi-3-mini, a 3.8B language model is available on Microsoft Azure AI Studio, Hugging Face, and Ollama. 

Phi-3-mini is available in two context-length variants—4K and 128K tokens. It is the first model in its class to support a context window of up to 128K tokens, with little impact on quality.

It is instruction-tuned, meaning that it’s trained to follow different types of instructions reflecting how people normally communicate. This ensures the model is ready to use out-of-the-box.

It is available on Azure AI to take advantage of the deploy-eval-finetune toolchain, and is available on Ollama for developers to run locally on their laptops.

It has been optimized for ONNX Runtime with support for Windows DirectML along with cross-platform support across graphics processing unit (GPU), CPU, and even mobile hardware.

It is also available as an NVIDIA NIM microservice with a standard API interface that can be deployed anywhere. And has been optimized for NVIDIA GPUs. 

In the coming weeks, additional models will be added to Phi-3 family to offer customers even more flexibility across the quality-cost curve. Phi-3-small (7B) and Phi-3-medium (14B) will be available in the Azure AI model catalog and other model gardens shortly.   

Microsoft continues to offer the best models across the quality-cost curve and today’s Phi-3 release expands the selection of models with state-of-the-art small models.

Azure AI Studio

Phi-3-mini is now available

Explore the release

Groundbreaking performance at a small size 

Phi-3 models significantly outperform language models of the same and larger sizes on key benchmarks (see benchmark numbers below, higher is better). Phi-3-mini does better than models twice its size, and Phi-3-small and Phi-3-medium outperform much larger models, including GPT-3.5T.  

All reported numbers are produced with the same pipeline to ensure that the numbers are comparable. As a result, these numbers may differ from other published numbers due to slight differences in the evaluation methodology. More details on benchmarks are provided in our technical paper. 

Note: Phi-3 models do not perform as well on factual knowledge benchmarks (such as TriviaQA) as the smaller model size results in less capacity to retain facts. 

Safety-first model design 

Responsible ai principles

Learn about our approach

Phi-3 models were developed in accordance with the Microsoft Responsible AI Standard, which is a company-wide set of requirements based on the following six principles: accountability, transparency, fairness, reliability and safety, privacy and security, and inclusiveness. Phi-3 models underwent rigorous safety measurement and evaluation, red-teaming, sensitive use review, and adherence to security guidance to help ensure that these models are responsibly developed, tested, and deployed in alignment with Microsoft’s standards and best practices.  

Building on our prior work with Phi models (“Textbooks Are All You Need”), Phi-3 models are also trained using high-quality data. They were further improved with extensive safety post-training, including reinforcement learning from human feedback (RLHF), automated testing and evaluations across dozens of harm categories, and manual red-teaming. Our approach to safety training and evaluations are detailed in our technical paper, and we outline recommended uses and limitations in the model cards. See the model card collection. 

Unlocking new capabilities 

Microsoft’s experience shipping copilots and enabling customers to transform their businesses with generative AI using Azure AI has highlighted the growing need for different-size models across the quality-cost curve for different tasks. Small language models, like Phi-3, are especially great for: 

Resource constrained environments including on-device and offline inference scenarios.

Latency bound scenarios where fast response times are critical.

Cost constrained use cases, particularly those with simpler tasks.

For more on small language models, see our Microsoft Source Blog.

Thanks to their smaller size, Phi-3 models can be used in compute-limited inference environments. Phi-3-mini, in particular, can be used on-device, especially when further optimized with ONNX Runtime for cross-platform availability. The smaller size of Phi-3 models also makes fine-tuning or customization easier and more affordable. In addition, their lower computational needs make them a lower cost option with much better latency. The longer context window enables taking in and reasoning over large text content—documents, web pages, code, and more. Phi-3-mini demonstrates strong reasoning and logic capabilities, making it a good candidate for analytical tasks. 

Customers are already building solutions with Phi-3. One example where Phi-3 is already demonstrating value is in agriculture, where internet might not be readily accessible. Powerful small models like Phi-3 along with Microsoft copilot templates are available to farmers at the point of need and provide the additional benefit of running at reduced cost, making AI technologies even more accessible.  

ITC, a leading business conglomerate based in India, is leveraging Phi-3 as part of their continued collaboration with Microsoft on the copilot for Krishi Mitra, a farmer-facing app that reaches over a million farmers.

“Our goal with the Krishi Mitra copilot is to improve efficiency while maintaining the accuracy of a large language model. We are excited to partner with Microsoft on using fine-tuned versions of Phi-3 to meet both our goals—efficiency and accuracy!”   
Saif Naik, Head of Technology, ITCMAARS

Originating in Microsoft Research, Phi models have been broadly used, with Phi-2 downloaded over 2 million times. The Phi series of models have achieved remarkable performance with strategic data curation and innovative scaling. Starting with Phi-1, a model used for Python coding, to Phi-1.5, enhancing reasoning and understanding, and then to Phi-2, a 2.7 billion-parameter model outperforming those up to 25 times its size in language comprehension.1 Each iteration has leveraged high-quality training data and knowledge transfer techniques to challenge conventional scaling laws. 

Get started today 

To experience Phi-3 for yourself, start with playing with the model on Azure AI Playground. You can also find the model on the Hugging Chat playground. Start building with and customizing Phi-3 for your scenarios using the Azure AI Studio. Join us to learn more about Phi-3 during a special live stream of the AI Show.  

1 Microsoft Research Blog, Phi-2: The surprising power of small language models, December 12, 2023.
The post Introducing Phi-3: Redefining what’s possible with SLMs appeared first on Azure Blog.
Quelle: Azure