Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership. In DZone’s 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business. This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
Kafka Event Streaming AI and Automation
The WIZ Research team recently discovered that an overprovisioned SAS token had been lying exposed on GitHub for nearly three years. This token granted access to a massive 38-terabyte trove of private data. This Azure storage contained additional secrets, such as private SSH keys, hidden within the disk backups of two Microsoft employees. This revelation underscores the importance of robust data security measures. What Happened? WIZ Research recently disclosed a data exposure incident found on Microsoft’s AI GitHub repository on June 23, 2023. The researchers managing the GitHub used an Azure Storage sharing feature through an SAS token to give access to a bucket of open-source AI training data. This token was misconfigured, giving access to the account's entire cloud storage rather than the intended bucket. This storage comprised 38TB of data, including a disk backup of two employees’ workstations with secrets, private keys, passwords, and more than 30,000 internal Microsoft Teams messages. SAS (Shared Access Signatures) are signed URLs for sharing Azure Storage resources. They are configured with fine-grained controls over how a client can access the data: what resources are exposed (full account, container, or selection of files), with what permissions, and for how long. See Azure Storage documentation. After disclosing the incident to Microsoft, the SAS token was invalidated. From its first commit to GitHub (July 20, 2020) to its revoking, nearly three years elapsed. See the timeline presented by the Wiz Research team: Yet, as emphasized by the WIZ Research team, there was a misconfiguration with the Shared Access Signature (SAS). Data Exposure The token was allowing anyone to access an additional 38TB of data, including sensitive data such as secret keys, personal passwords, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees. Here is an excerpt from some of the most sensitive data recovered by the Wiz team: As highlighted by the researchers, this could have allowed an attacker to inject malicious code into the storage blob that could then automatically execute with every download by a user (presumably an AI researcher) trusting in Microsoft's reputation, which could have led to a supply chain attack. Security Risks According to the researchers, Account SAS tokens such as the one presented in their research present a high-security risk. This is because these tokens are highly permissive, long-lived tokens that escape the monitoring perimeter of administrators. When a user generates a new token, it is signed by the browser and doesn't trigger any Azure event. To revoke a token, an administrator needs to rotate the signing account key, therefore revoking all the other tokens at once. Ironically, the security risk of a Microsoft product feature (Azure SAS tokens) caused an incident for a Microsoft research team, a risk recently referenced by the second version of the Microsoft threat matrix for storage services: Secrets Sprawl This example perfectly underscores the pervasive issue of secrets sprawl within organizations, even those with advanced security measures. Intriguingly, it highlights how an AI research team, or any data team, can independently create tokens that could potentially jeopardize the organization. These tokens can cleverly sidestep the security safeguards designed to shield the environment. Mitigation Strategies For Azure Storage Users: 1 - Avoid Account Sas Tokens The lack of monitoring makes this feature a security hole in your perimeter. A better way to share data externally is using a Service SAS with a Stored Access Policy. This feature binds a SAS token to a policy, providing the ability to centrally manage token policies. Better though, if you don't need to use this Azure Storage sharing feature, is to simply disable SAS access for each account you own. 2 - Enable Azure Storage Analytics Active SAS token usage can be monitored through the Storage Analytics logs for each of your storage accounts. Azure Metrics allows the monitoring of SAS-authenticated requests and identifies storage accounts that have been accessed through SAS tokens, for up to 93 days. For All: 1 - Audit Your Github Perimeter for Sensitive Credentials With around 90 million developer accounts, 300 million hosted repositories, and 4 million active organizations, including 90% of Fortune 100 companies, GitHub holds a much larger attack surface than meets the eye. Last year, GitGuardian uncovered 10 million leaked secrets on public repositories, up 67% from the previous year. GitHub must be actively monitored as part of any organization's security perimeter. Incidents involving leaked credentials on the platform continue to cause massive breaches for large companies, and this security hole in Microsoft's protective shell wasn't without reminding us of the Toyota data breach from a year ago. On October 7, 2022 Toyota, the Japanese-based automotive manufacturer, revealed they had accidentally exposed a credential allowing access to customer data in a public GitHub repo for nearly 5 years. The code was made public from December 2017 through September 2022. If your company has development teams, likely, some of your company's secrets (API keys, tokens, passwords) end up on public GitHub. Therefore it is highly recommended to audit your GitHub attack surface as part of your attack surface management program. Final Words Every organization, regardless of size, needs to be prepared to tackle a wide range of emerging risks. These risks often stem from insufficient monitoring of extensive software operations within today's modern enterprises. In this case, an AI research team inadvertently created and exposed a misconfigured cloud storage sharing link, bypassing security guardrails. But how many other departments - support, sales, operations, or marketing - could find themselves in a similar situation? The increasing dependence on software, data, and digital services amplifies cyber risks on a global scale. Combatting the spread of confidential information and its associated risks necessitates reevaluating security teams' oversight and governance capabilities.
In the ever-evolving world of AI and Natural Language Processing (NLP), Large Language Models and Generative AI have become powerful tools for various applications. Achieving the desired results from these models involves different approaches that can be broadly classified into three categories: Prompt Engineering, Fine-Tuning, and Creating a new model. As we progress from one level to another, the requirements in terms of resources and costs increase significantly. In this blog post, we’ll explore these approaches and focus on an efficient technique known as Parameter Efficient Fine-Tuning (PEFT) that allows us to fine-tune models with minimal infrastructure while maintaining high performance. Prompt Engineering with Existing Models At the basic level, achieving expected outcomes from Large Language Models involves careful prompt engineering. This process involves crafting suitable prompts and inputs to elicit the desired responses from the model. Prompt Engineering is an essential technique for various use cases, especially when general responses suffice. Creating a New Model At the highest level, Creating a new model involves training a model from scratch, specifically tailored for a particular task or domain. This approach provides the highest level of customization, but it demands substantial computational power, extensive data, and time. Fine Tuning Existing Models When dealing with domain-specific use cases that require model adaptations, fine-tuning becomes essential. Fine-tuning allows us to leverage existing pre-trained foundation models and adapt them to specific tasks or domains. By training the model on domain-specific data, we can tailor it to perform well on targeted tasks. However, this process can be resource-intensive and costly, as we will be modifying all the millions of parameters as part of training. Fine-tuning the model requires a lot of training data, a huge infrastructure, and effort. In the process of full fine-tuning of LLMs, there is a risk of catastrophic forgetting, where previously acquired knowledge from pretraining is lost. Applying complete fine-tuning to a single model for different domain-specific tasks often results in creating large models tailored to specific tasks, lacking modularity. What we require is a modular approach that avoids altering all parameters while demanding fewer infrastructure resources and less data. There are various techniques, such as Parameter Efficient Fine Tuning (PEFT), which provide a way to perform modular fine-tuning with optimal resources and cost. Parameter Efficient Fine Tuning (PEFT) PEFT is a technique designed to fine-tune models while minimizing the need for extensive resources and cost. PEFT is a great choice when dealing with domain-specific tasks that necessitate model adaptation. By employing PEFT, we can strike a balance between retaining valuable knowledge from the pre-trained model and adapting it effectively to the target task with fewer parameters. There are various ways of achieving Parameter efficient fine-tuning. Low-Rank Parameter or LoRA and QLoRA are the most widely used and effective. Low-Rank Parameters This is one of the most widely used methods, where a set of parameters are modularly added to the network with lower dimensional space. Instead of modifying the whole network, only these modular low-rank network is modified to achieve the results. Let's deep dive into one of the most popular techniques called LoRA and QLoRA Low-Rank Adaptation (LoRA) Low-Rank Adaptation provides the modular approach towards fine-tuning a model for domain-specific tasks and provides the capability of transfer learning. LoRA technique can be implemented with fewer resources and is memory efficient. In the following picture, you can see the dimension/rank decomposition, which reduces the memory footprint considerably. We will be applying this by augmenting a LoRA adapter to the existing feed-forward networks. We will be freezing the original feed-forward networks and will be using the LoRA network for training. Refer to the picture below for more details. LoRA can be implemented as an adapter designed to enhance and expand the existing neural network layers. It introduces an additional layer of trainable parameters (weights) while maintaining the original parameters in a frozen state. These trainable parameters possess a substantially reduced rank (dimension) compared to the dimensions of the original network. This is the mechanism through which LoRa simplifies and expedites the process of adapting the original models for domain-specific tasks. Now, let’s take a closer look at the components within the LORA adapter network. The pre-trained parameters of the original model (W) are frozen. During training, these weights will not be modified. A new set of parameters is concurrently added to the networks WA and WB. These networks utilize low-rank weight vectors, where the dimensions of these vectors are represented as dxr and rxd. Here, ‘d’ stands for the dimension of the original frozen network parameters vector, while ‘r’ signifies the chosen low-rank or lower dimension. The value of ‘r’ is always smaller, and the smaller the ‘r’, the more expedited and simplified the model training process becomes. Determining the appropriate value for ‘r’ is a pivotal decision in LoRA. Opting for a lower value results in faster and more cost-effective model training, though it may not yield optimal results. Conversely, selecting a higher value for ‘r’ extends the training time and cost but enhances the model’s capability to handle more complex tasks. The results of the original network and the low-rank network are computed with a dot product, which results in a weight matrix of n dimension, which is used to generate the result. This result is then compared with the expected results (during training) to calculate the loss function, and WA and WB weights are adjusted based on the loss function as part of backpropagation like standard neural networks. Let’s explore how this approach contributes to the reduction of the memory footprint and minimizes infrastructure requirements. Consider a scenario where we have a 512x512 parameter matrix within the feed-forward network, amounting to a total of 262,144 parameters that need to undergo training. If we choose to freeze these parameters during the training process and introduce a LoRA adapter with a rank of 2, the outcome is as follows: WA will have 512*2 parameters, and WB will also have 512*2 parameters, summing up to a total of 2,048 parameters. These are the specific parameters that undergo training with domain-specific data. This represents a significant enhancement in computational efficiency, substantially reducing the number of computations required during the backpropagation process. This mechanism is pivotal in achieving accelerated training. The most advantageous aspect of this approach is that the trained LoRA adapter can be preserved independently and employed as distinct modules. By constructing domain-specific modules in this manner, we effectively achieve a high level of modularity. Additionally, by refraining from altering the original weights, we successfully circumvent the issue of catastrophic forgetting. Now, let’s delve into further enhancements that can be implemented atop LoRA, particularly through the utilization of QLoRA, in order to elevate the optimization to the next level. Quantized Low-Ranking Adaptation (QLoRA) QLoRA extends LoRA to enhance efficiency by quantizing weight values of the original network, from high-resolution data types, such as Float32, to lower-resolution data types like int4. This leads to reduced memory demands and faster calculations. There are three Key optimizations that QLoRA brings on top of LoRA, which makes QLoRA one of the best PEFT methods. 4-bit NF4 Quantization 4-bit NormalFloat4 is an optimized data type that can be used to store weights, which brings down the memory footprint considerably. 4-bit NormalFloat4 quantization is a 3-step process. Normalization and Quantization: As part of normalization and quantization steps, the weights are adjusted to a zero mean and a constant unit variance. A 4-bit data type can only store 16 numbers. As part of normalization, the weights are mapped to these 16 numbers, zero-centered distributed, and instead of storing the weights, the nearest position is stored. Here is an example Let's say we have a FP32 weight with a value of 0.2121. a 4-bit split between -1 to 1 will be the following number positions. 0.2121 is closest to 0.1997, which is the 10th position. Instead of saving the FP32 of 0.2121, we store 10. The typical formula: int4Tensor = roundedValue(totalNumberOfPositions/absmax(inputXTensor)) * FP32WeightsTensor In the above example totalNumberOfPositions = 16 The value totalNumberOfPositions/absmax(inputXTensor) is called the quantization constant Obviously, there is a loss of data when we normalize and quantize as we move from FP32, which is a high-resolution data type, to a low-resolution data type. The loss is not huge as long as there are no outliers in the input tensor, which might affect the absmax () and eventually upset the distribution. To avoid that issue, we generally quantize the weights independently by smaller blocks, which will normalize the outliers. Dequantization: To Dequantize the values, we do exactly the reverse. dequantizedTensor = int4Tensor /roundedValue(totalNumberOfPositions/absmax(inputXTensor)) In the above example totalNumberOfPositions = 16 The 4-bit NormalFloat quantization is applied to the weights of the original model; the LoRA adapter weights will be FP32, as all of the training will happen on these weights. Once all the training is done, the original weights will be de-quantized. Double Quantization Double quantization further reduces the memory footprint by quantizing quantization constants. In the previous 4-bit FP4 quantization step, we calculated the quantization constant. Even that can be quantized for better efficiency, and that is what we do in Double Quantization. Since the quantization is done in blocks, to avoid outliers, typically 64 weights in 1 block, we will have one quantization constant. These quantization constants can be quantized further to reduce the memory footprint. Let's say we have grouped 64 parameters/weights per block, and each quantization constant takes 32 bits, as it is FP32. It adds a 0.5 bit per parameter on average, which means we are talking of at least 500,000 bits for a typical 1Mil parameter model. With Double quantization, we apply quantization on these quantization constants, which will further optimize our memory usage. We can take a group of 256 quantization values and apply 8-bit quantization. we can achieve approximately 0.127 bits per parameter, which brings down the value to 125,000 bits for the 1Mil parameter model. Here is the calculation: We have 64 weights in 256 blocks, which are 32 bits, which is 32/(64*256), which is 0.001953125. We have 8bits for 64 weights, which is 8/64 0.125. If we add it up 0.125+0.001953125, which is 0.127 approximately. Unified Memory Paging Coupled with the above techniques, QLoRA also utilizes the nVidia unified memory feature, which allows GPU->CPU seamless page transfers when GPU runs out of memory, thus managing the sudden memory spikes in GPU and helping memory overflow/overrun issues. LoRA and QLoRA are two of the most emerging and widely used techniques for Parameter Efficient Fine tuning. In the next part, we will implement QLoRA; until then, have fun with LLMs. Hope this was useful; leave your comments and feedback... Bye for now...
Many libraries for AI app development are primarily written in Python or JavaScript. The good news is that several of these libraries have Java APIs as well. In this tutorial, I'll show you how to build a ChatGPT clone using Spring Boot, LangChain, and Hilla. The tutorial will cover simple synchronous chat completions and a more advanced streaming completion for a better user experience. Completed Source Code You can find the source code for the example in my GitHub repository. Requirements Java 17+ Node 18+ An OpenAI API key in an OPENAI_API_KEY environment variable Create a Spring Boot and React project, Add LangChain First, create a new Hilla project using the Hilla CLI. This will create a Spring Boot project with a React frontend. Shell npx @hilla/cli init ai-assistant Open the generated project in your IDE. Then, add the LangChain4j dependency to the pom.xml file: XML <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j</artifactId> <version>0.22.0</version> <!-- TODO: use latest version --> </dependency> Simple OpenAI Chat Completions With Memory Using LangChain We'll begin exploring LangChain4j with a simple synchronous chat completion. In this case, we want to call the OpenAI chat completion API and get a single response. We also want to keep track of up to 1,000 tokens of the chat history. In the com.example.application.service package, create a ChatService.java class with the following content: Java @BrowserCallable @AnonymousAllowed public class ChatService { @Value("${openai.api.key}") private String OPENAI_API_KEY; private Assistant assistant; interface Assistant { String chat(String message); } @PostConstruct public void init() { var memory = TokenWindowChatMemory.withMaxTokens(1000, new OpenAiTokenizer("gpt-3.5-turbo")); assistant = AiServices.builder(Assistant.class) .chatLanguageModel(OpenAiChatModel.withApiKey(OPENAI_API_KEY)) .chatMemory(memory) .build(); } public String chat(String message) { return assistant.chat(message); } } @BrowserCallable makes the class available to the front end. @AnonymousAllowed allows anonymous users to call the methods. @Value injects the OpenAI API key from the OPENAI_API_KEY environment variable. Assistant is the interface that we will use to call the chat API. init() initializes the assistant with a 1,000-token memory and the gpt-3.5-turbo model. chat() is the method that we will call from the front end. Start the application by running Application.java in your IDE, or with the default Maven goal: Shell mvn This will generate TypeScript types and service methods for the front end. Next, open App.tsx in the frontend folder and update it with the following content: TypeScript-JSX export default function App() { const [messages, setMessages] = useState<MessageListItem[]>([]); async function sendMessage(message: string) { setMessages((messages) => [ ...messages, { text: message, userName: "You", }, ]); const response = await ChatService.chat(message); setMessages((messages) => [ ...messages, { text: response, userName: "Assistant", }, ]); } return ( <div className="p-m flex flex-col h-full box-border"> <MessageList items={messages} className="flex-grow" /> <MessageInput onSubmit={(e) => sendMessage(e.detail.value)} /> </div> ); } We use the MessageList and MessageInput components from the Hilla UI component library. sendMessage() adds the message to the list of messages, and calls the chat() method on the ChatService class. When the response is received, it is added to the list of messages. You now have a working chat application that uses the OpenAI chat API and keeps track of the chat history. It works great for short messages, but it is slow for long answers. To improve the user experience, we can use a streaming completion instead, displaying the response as it is received. Streaming OpenAI Chat Completions With Memory Using LangChain Let's update the ChatService class to use a streaming completion instead: Java @BrowserCallable @AnonymousAllowed public class ChatService { @Value("${openai.api.key}") private String OPENAI_API_KEY; private Assistant assistant; interface Assistant { TokenStream chat(String message); } @PostConstruct public void init() { var memory = TokenWindowChatMemory.withMaxTokens(1000, new OpenAiTokenizer("gpt-3.5-turbo")); assistant = AiServices.builder(Assistant.class) .streamingChatLanguageModel(OpenAiStreamingChatModel.withApiKey(OPENAI_API_KEY)) .chatMemory(memory) .build(); } public Flux<String> chatStream(String message) { Sinks.Many<String> sink = Sinks.many().unicast().onBackpressureBuffer(); assistant.chat(message) .onNext(sink::tryEmitNext) .onComplete(sink::tryEmitComplete) .onError(sink::tryEmitError) .start(); return sink.asFlux(); } } The code is mostly the same as before, with some important differences: Assistant now returns a TokenStream instead of a String. init() uses streamingChatLanguageModel() instead of chatLanguageModel(). chatStream() returns a Flux<String> instead of a String. Update App.tsx with the following content: TypeScript-JSX export default function App() { const [messages, setMessages] = useState<MessageListItem[]>([]); function addMessage(message: MessageListItem) { setMessages((messages) => [...messages, message]); } function appendToLastMessage(chunk: string) { setMessages((messages) => { const lastMessage = messages[messages.length - 1]; lastMessage.text += chunk; return [...messages.slice(0, -1), lastMessage]; }); } async function sendMessage(message: string) { addMessage({ text: message, userName: "You", }); let first = true; ChatService.chatStream(message).onNext((chunk) => { if (first && chunk) { addMessage({ text: chunk, userName: "Assistant", }); first = false; } else { appendToLastMessage(chunk); } }); } return ( <div className="p-m flex flex-col h-full box-border"> <MessageList items={messages} className="flex-grow" /> <MessageInput onSubmit={(e) => sendMessage(e.detail.value)} /> </div> ); } The template is the same as before, but the way we handle the response is different. Instead of waiting for the response to be received, we start listening for chunks of the response. When the first chunk is received, we add it as a new message. When subsequent chunks are received, we append them to the last message. Re-run the application, and you should see that the response is displayed as it is received. Conclusion As you can see, LangChain makes it easy to build LLM-powered AI applications in Java and Spring Boot. With the basic setup in place, you can extend the functionality by chaining operations, adding external tools, and more following the examples on the LangChain4j GitHub page, linked earlier in this article. Learn more about Hilla in the Hilla documentation.
Vector technology in AI, often referred to with implementations, vector indexes, and vector search, offers a robust mechanism index and query through high-dimensional data entities spanning images, text, audio, and video. Their prowess becomes evident across diverse spectrums like similarity-driven searches, multi-modal retrieval, dynamic recommendation engines, and platforms leveraging the Retrieval Augmented Generation (RAG) paradigm. Due to its potential impact on a multitude of use cases, vectors have emerged as a hot topic. As one delves deeper, attempting to demystify the essence of "what precisely is vector search?", they are often greeted by a barrage of terms — AI, LLM, generative AI — to name a few. This article aims to paint a clearer picture (quite literally) by likening the concept to something we all know: colors. Infinite hues bloom, A million shades dance and play, Colors light our world. Just the so-called "official colors" span across three long Wikipedia pages. While it's straightforward to store and search these colors by their names using conventional search indices like those in Elastic Search or Couchbase FTS, there's a hitch. Think about the colors Navy and Ocean. Intuitively, they feel closely related, evoking images of deep, serene waters. Yet, linguistically, they share no common ground. This is where traditional search engines hit a wall. The typical workaround? Synonyms. You could map Navy to a plethora of related terms: blue, azure, ocean, turquoise, sky, and so on. But now, consider the gargantuan task of doing this for every color name. Moreover, these lists don't give us a measure of the closeness between colors. Is azure closer to the navy than the sky? A list won't tell you that. To put it simply, seeking similarities among colors is a daunting task. Trying to craft relationships between colors to gauge their similarity? Even more challenging. The simple solution to this is the well-known RGB. Encoding colors in the RGB vector scheme solves both the similarity and distance problem. When we talk about a color's RGB values, we're essentially referencing its coordinates in this 3D space where each dimension can have values ranging from 0 (zero) to 255, totaling 256 values. The vector (R, G, B) is defined by three components: the intensity of Red (R), the intensity of Green (G), and the intensity of Blue (B). Each of these components typically ranges from 0 to 255, allowing for over 16 million, 16777216 to be exact, unique combinations, each representing a distinct color. For instance, the vector (255, 0, 0) signifies the full intensity of red with no contributions from green or blue, resulting in the color red. Here are sample RGB values for some colors: Navy: (0, 0, 128) Turquoise: (64, 224, 208) Orange: (255, 165, 0) Green: (0, 128, 0) Gray: (128, 128, 128) The three values here can be seen as vectors representing a unique value in the color space containing 16777216 colors. Visualizing RGB values as vectors offers a profound advantage: the spatial proximity of two vectors gives a measure of color similarity. Colors that are close in appearance will have vectors that are close in the RGB space. This vector representation, therefore, not only provides a means to encode colors but also allows for an intuitive understanding of color relationships and similarities. Similarity Searching To find colors within an Euclidean distance of 1 from the color (148, 201, 44) in the RGB space, we vary each R, G, and B value by one up and one down to create the search space. This method will generate 3 x 3 x 3 = 27 color combinations but gives us a list of similar colors with specific distances. This is like identifying a small cube inside a larger RGB cube... Plain Text (147, 200, 43), (147, 200, 44), (147, 200, 45) (147, 201, 43), (147, 201, 44), (147, 201, 45) (147, 202, 43), (147, 202, 44), (147, 202, 45) (148, 200, 43), (148, 200, 44), (148, 200, 45) (148, 201, 43), (148, 201, 44) <- This is the original color, (148, 201, 45) (148, 202, 43), (148, 202, 44), (148, 202, 45) (149, 200, 43), (149, 200, 44), (149, 200, 45) (149, 201, 43), (149, 201, 44), (149, 201, 45) (149, 202, 43), (149, 202, 44), (149, 202, 45) All these 27 colors are similar to our original colors (148, 201, 44). This principle can be expanded to various distances and multiple ways to calculate the distance. If we were to store, index, and search RGBs in a database, let's see how this is done. Similarity search on colors via RGB model Hopefully, this gave you a good understanding of how the RGB models the color schemes and solves the similarity search problem. Let's replace the RGB model with an LLM model and input text and images about tennis. We then search for "French open." Even though the input text or image didn't include "French open" directly, the effect of the similarity search is that Djokovic and the two tennis images will still be returned! That's the magic of the LLM model and vector search. Vector indexing and vector search follow the same path. RGB encodes the 16 million colors in 3 bytes. But, the real-world data is more complicated. Languages, images, and videos much. more complicated. Hence, the vector databases use not three, but 300 or 3000 or more dimensions to encode data. Because of this, we need novel methods to store, index, and do similarity searches efficiently. However, the core principle is the same. More on how vector indexing and searching is done in a future blog!
Bloom filters are probabilistic data structures that allow for efficient testing of an element's membership in a set. They effectively filter out unwanted items from extensive data sets while maintaining a small probability of false positives. Since their invention in 1970 by Burton H. Bloom, these data structures have found applications in various fields such as databases, caching, networking, and more. In this article, we will delve into the concept of Bloom filters, their functioning, explore a contemporary real-world application, and illustrate their workings with a practical example. Understanding Bloom Filters A Bloom filter consists of an array of m bits, initially set to 0. It employs k independent hash functions, each mapping an element to one of the m positions in the array. To add an element to the filter, it is hashed using each of the k hash functions, and the corresponding positions in the array are set to 1. To verify if an element is present in the filter, the element is hashed again using the same k hash functions, and if all the corresponding positions are set to 1, the element is considered present. However, there is a possibility of false positives, i.e., the Bloom filter may indicate that an element is present when it is not. This occurs when the positions for a non-present element have been set to 1 by other elements. The rate of false positives can be controlled by adjusting the parameters m and k. Generally, the larger the bit array and the more hash functions used, the lower the probability of false positives. Practical Example: Spell Checker To illustrate how Bloom filters work, let's consider a simple example of a spell checker that uses a Bloom filter to store a dictionary of valid words. The goal is to quickly determine if a given word is in the dictionary or not while minimizing memory usage. Initialize the Bloom filter: First, we create an empty Bloom filter with an array of m bits, all set to 0. For this example, let's assume m = 20. We also choose k independent hash functions that map each word to one of the 20 positions in the array. Add words to the Bloom filter: Now, let's add three words to our dictionary: "apple", "banana", and "orange". We pass each word through the k hash functions, which generate indices corresponding to positions in the bit array. Suppose the hash functions generate the following indices for each word: "apple": 3, 7, 12 "banana": 5, 12, 17 "orange": 2, 7, 19 We set the bits at these positions to 1. Our Bloom filter now looks like this: 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 Query the Bloom filter: To check if a word is in the dictionary, we hash the word using the same k hash functions and check the corresponding positions in the bit array. If all the positions have a value of 1, we assume the word is in the dictionary. Let's check if the word "grape" is in the dictionary. The hash functions generate the following indices for "grape": 2, 5, 19. In our Bloom filter, these positions have the values 1, 1, and 1, so the Bloom filter indicates that "grape" is in the dictionary. However, we know that "grape" was not added to the dictionary, so this is a false positive. Now, let's check if the word "mango" is in the dictionary. The hash functions generate the following indices for "mango": 4, 8, 18. In our Bloom filter, these positions have the values 0, 0, and 0. Since not all positions have the value 1, the Bloom filter correctly indicates that "mango" is not in the dictionary. In this example, the Bloom filter allows us to efficiently check if a word is in the dictionary using minimal memory. However, there is a possibility of false positives, as demonstrated by the "grape" example. By adjusting the size of the bit array (m) and the number of hash functions (k), we can control the probability of false positives to suit our specific use case. Contemporary Application Example: Distributed Systems and Content Delivery Networks A recent and advanced application of Bloom filters is in distributed systems, particularly in content delivery networks (CDNs). CDNs are networks of servers strategically placed around the world to distribute content, such as web pages, videos, images, and other resources, to users more efficiently. CDNs rely on caching to store copies of content temporarily on edge servers, which are closer to the users, to reduce latency and improve user experience. In this context, Bloom filters can be used to optimize cache eviction policies. When an edge server's cache reaches its capacity, it must evict some content to make room for new content. To minimize the impact on user experience, it is crucial to evict content that is least likely to be requested in the near future. A Bloom filter can be employed to track the recent access history of content. When a user requests content, the CDN checks the Bloom filter to see if the content has been accessed recently. If the Bloom filter indicates that the content is not present, the CDN fetches the content from the origin server, caches it on the edge server, and adds the content identifier to the Bloom filter. If the Bloom filter indicates that the content is present, the CDN assumes the content has been recently accessed and retrieves it from the cache. When the cache becomes full, the CDN can evict content that is not present in the Bloom filter, as it is less likely to be requested again soon. The occasional false positives introduced by the Bloom filter do not significantly impact the overall performance of the CDN, as the benefits of reduced memory consumption and optimized cache eviction far outweigh the drawbacks. Conclusion Bloom filters offer a sophisticated solution to the challenge of filtering large data sets with minimal memory usage. While the possibility of false positives exists, it can be mitigated by carefully selecting the parameters of the filter. With practical examples like the spell checker and cutting-edge applications like CDNs, Bloom filters demonstrate their practicality and efficiency in managing massive amounts of data. As data volumes continue to grow exponentially, Bloom filters will remain an indispensable tool for efficient data filtering and management in modern systems.
Relational DataBase Management Systems (RDBMS) represent the state-of-the-art, thanks in part to their well-established ecosystem of surrounding technologies, tools, and widespread professional skills. During this era of technological revolution encompassing both Information Technology (IT) and Operational Technology (OT), it is widely recognized that significant challenges arise concerning performance, particularly in specific use cases where NoSQL solutions outperform traditional approaches. Indeed, the market offers many NoSQL DBMS solutions interpreting and exploiting a variety of different data models: Key-value store (e.g., the simplest storage where the access to persisted data must be instantaneous and the retrieve is made by keys like a hash-map or a dictionary); Documented-oriented (e.g., widely adopted in server-less solutions and lambda functions architectures where clients need a well-structured DTO directly from the database); Graph-oriented (e.g., useful for knowledge management, semantic web, or social networks); Column-oriented (e.g., providing highly optimized “ready-to-use” data projections in query-driven modeling approaches); Time series (e.g., for handling sensors and sample data in the Internet of Things scenarios); Multi-model store (e.g., combining different types of data models for mixed functional purposes). "Errors using inadequate data are much less than those using no data at all."CHARLES BABBAGE A less-explored concern is the ability of software architectures relying on relational solutions to flexibly adapt to rapid and frequent changes in the software domain and functional requirements. This challenge is exacerbated by Agile-like software development methodologies that aim at satisfying the customer in dealing with continuous emerging demands led by its business market. In particular, RDBMS, by their very nature, may suffer when software requirements change over time, inducing rapid effects over database tabular schemas by introducing new association tables -also replacing pre-existent foreign keys- and producing new JOIN clauses in SQL queries, thus resulting in more complex and less maintainable solutions. In our enterprise experience, we have successfully implemented and experimented with a graph-oriented DBMS solution based on the Neo4j Graph Database so as to attenuate architectural consequences of requirements changes within an operational context typical of a digital social community with different users and roles. In this article, we: Exemplify how graph-oriented DBMS is more resilient to functional requirements; Discuss the feasibility of adopting graph-oriented DBMSs in a classic N-tier (layered) architecture, proposing some approach for overcoming main difficulties; Highlight advantages and disadvantages and threats to their adoption in various contexts and use cases. The Neo4j Graph Database The idea behind graph-oriented data models is to adopt a native approach for handling entities (i.e., nodes) and relationships behind them (i.e., edges) so as to query the knowledge base (namely, knowledge graph) by navigating relationships between entities. The Neo4j Graph Database works on oriented property graphs where both nodes and edges own different kinds of property attributes. We choose it as DBMS, primarily for: Its “native” implementation is concretely modeled through a digital graph meta-model, whose runtime instance is composed of nodes (containing the entities with their attributes of the domain) and edges (representing navigable relationships among the interconnected concepts).In this way, relationships are traversed in O(1); The Cypher query language, adopted as a very powerful and intuitive query system of the persisted knowledge within the graph. Furthermore, the Neo4j Graph Database also offers Java libraries for Object Graph Mapping (OGM), which help developers in the automated process of mapping, persisting, and managing model entities, nodes, and relationships. Practically, OGM interprets, for graph-oriented DBMS, the same role that the pattern Object Relational Mapping (ORM) has for relational persistence layers. Comparable to the ORM pattern designed for RDBMS, the OGM pattern serves to streamline the implementation of Data Access Objects (DAOs).Its primary function is to enable semi-automated elaboration in persisting domain model entities that are properly configured and annotated within the source code. With respect to Java Persistence API (JPA)/Hibernate, widely recognized as a leading ORM technology, Neo4j's OGM library operates in a distinctive manner: Write Operations OGM propagates persistence changes across all relationships of a managed entity (analyzing the whole tree of objects relationships starting from the managed object); JPA performs updates table by table, starting from the managed entity and handling relationships based on cascade configurations. Read Operations OGM retrieves an entire "tree of relationships" with a fixed depth by the query, starting from the specified node, acting as the "root of the tree"; JPA allows the configuration of relationships between an EAGER and a LAZY loading approach. Solution Benefits of an Exemplary Case Study To exemplify the meaning of our analysis, we introduce a simple operative scenario: the UML Class Diagram of Fig. 1.1 depicts an entity User which has a 1-to-N relationship with the entity Auth (abbr. of Authorization), which defines permissions and grants inside the application.This Domain Model may be supported in RDBMS by a schema like that of Tab. 1.1 and Tab. 1.2 or, in graph-oriented DBMS, as in the knowledge graph of Fig. 1.2. Fig. 1.1: UML Class Diagram of the Domain Model. users table id firstName lastName ... ... ... Tab. 1.1: Table mapped within RDBMS schema for User entity. AUTHS table id name level user_fk ... ... ... ... Tab. 1.2: Table mapped within RDBMS schema for Auth entity. Fig. 1.2: Knowledge graph related to the Domain Model of Fig. 1.1. Now, imagine that a new requirement emerges during the production lifecycle of the application: the customer, for administrative reasons, needs to bound authorizations in specific time periods (i.e., from and until the date of validity) as in Fig. 2.1, transforming the relationship between User and Auth in a N-to-N. This Domain Model may be supported in RDBMS by a schema like that of Tab. 2.1 or, in graph-oriented DBMS, as in the knowledge graph of Fig. 2.2. Fig. 2.1: UML Class Diagram of the Domain Model after the definition of new requirements. users table id firstName lastName ... ... ... Tab. 2.1: Table mapped within RDBMS schema for User entity. users_AUTHS table user_fk auth_fk from until ... ... ... ... Tab. 2.2: Table mapped within RDBMS schema for storing associations between User and Auth. entities. AUTHS table id name level ... ... ... Tab. 2.3: Table mapped within RDBMS schema for Auth entity. Fig. 2.2: Knowledge graph related to the Domain Model of Fig. 2.1. The advantage is already clear at a schema level: indeed, the graph-oriented approach did not change the schema but only prescribes the definition of two new properties on the edge (modeling the relationship), while the RDBMS approach has created the new association table users_auths substituting the external foreign key in auths table referencing the user's table. Proceeding further with a deeper analysis, we can try to analyze a SQL query wrt a query written in the Cypher query language syntax under the two approaches: we’d like to identify users with the first name “Paul” having an Auth named “admin” with the level greater than or equal to 3. On the one hand, in SQL, the required queries (respectively the first one for retrieving data from Tab. 1.1 and Tab. 1.2, while the second one for Tab. 2.1, Tab. 2.2, and Tab. 2.3) are: SQL SELECT users.* FROM users INNER JOIN auths ON users.id = auths.user_fk WHERE users.firstName = 'Paul' AND auths.name = 'admin' AND auths.level >= 3 SQL SELECT users.* FROM users INNER JOIN users_auths ON users.id = users_auths.user_fk INNER JOIN auths ON auths.id = users_auths.auth_fk WHERE users.firstName = 'Paul' AND auths.name = 'admin' AND auths.level >= 3 On the other hand, in Cypher query language, the required query (for both cases) is: Cypher MATCH (u:User)-[:HAS_AUTH]->(auth:Auth) WHERE u.firstName = 'Paul' AND auth.name = 'admin' AND auth.level >= 3 RETURN u While the SQL query needs one more JOIN clause, it can be noted that, in this specific case, not only the query written in Cypher query language does not present an additional clause or a variation on the MATCH path, but it also remains identical. No changes were necessary on the "query system" of the backend! Conclusions Wedge Engineering contributed as the technological partner within an international Project where a collaborative social platform has been designed as a decoupled Web Application in a 3-tier architecture composed of: A backend module, a layered RESTful architecture, leveraging on the JakartaEE framework; A knowledge graph, the NoSQL provided by the Neo4j Graph Database; A frontend module, a single-page app based on HTML, CSS, and JavaScript, exploiting the Angular framework. The most challenging design choice we had to face was about using a driver that exploits natively the Cypher query language or leveraging on the OGM library to simplify DAO implementations: we discovered that building an entire application with custom queries written in Cypher query language is neither feasible nor scalable at all, while OGM may be not efficient enough when dealing with large data hierarchies that involve a significant number of relationships involving referenced external entities. We finally opted for a custom approach exploiting OGM as the reference solutions for mapping nodes and edges in an ORM-like perspective and supporting the implementation of ad hoc DAOs, therefore optimizing punctually with custom query methods that were incapable of performing well. In conclusion, we can claim that the adopted software architecture well responded to changes in the knowledge graph schema and completely fulfilled customer needs while easing efforts made by the Wedge Engineering developers team. Nevertheless, some threats have to be considered before adopting this architecture: SQL is far more common expertise than Cypher query language → so it’s much easier to find -and thus to include within a development team- experts able to maintain code for RDBMS rather than for theNeo4j Graph Database; Neo4j system requirements for on-premise production are relevant (i.e., for server-based environments, at least 8 GB are recommended) → this solution may not be the best fit for limited resources scenarios and for low-cost implementations; At the best of our efforts, we didn’t find any open source editor “ready and easy to use” for navigating through the Neo4j Graph Database data structure (the official data browser of Neo4j does not allow data modifications through the GUI without custom MERGE/CREATE query) as there are many for RDBMS → this may be intrinsically caused by the characteristic data model which hardens the realization of tabular views of data.
The coming wave of generative AI will be more revolutionary than any technology innovation that's come before in our lifetime, or maybe any lifetime. - Marc Benioff, CEO of Salesforce In today's data-driven landscape, organizations are constantly seeking innovative ways to derive value from their vast and ever-expanding datasets. Data Lakes have emerged as a cornerstone of modern data architecture, providing a scalable and flexible foundation for storing and managing diverse data types. Simultaneously, Generative Artificial Intelligence (AI) has been making waves, enabling machines to mimic human creativity and generate content autonomously. The convergence of Data Lake Houses and Generative AI opens up exciting possibilities for businesses and developers alike. It empowers them to harness the full potential of their data resources by creating AI-driven applications that generate content, insights, and solutions dynamically. However, navigating this dynamic landscape requires the right set of tools and strategies. In this blog, We'll explore the essential tools and techniques that empower developers and data scientists to leverage the synergy between these two transformative technologies. Below are basic capabilities and tools needed on top of your data lake to support Generative AI apps: Vector Database Grounding Large Language Models (LLMs) with generative AI using vector search is a cutting-edge approach aimed at mitigating one of the most significant challenges in AI-driven content generation: hallucinations. LLMs, such as GPT, are remarkable for their ability to generate human-like text, but they can occasionally produce information that is factually incorrect or misleading. This issue, known as hallucination, arises because LLMs generate content based on patterns and associations learned from vast text corpora, sometimes without a factual basis. Vector search, a powerful technique rooted in machine learning and information retrieval, plays a pivotal role in grounding LLMs by aligning generated content with reliable sources, real-world knowledge, and factual accuracy. Auto ML AutoML helps you automatically apply machine learning to a dataset. You provide the dataset and identify the prediction target while AutoML prepares the dataset for model training. AutoML then performs and records a set of trials that create, tune, and evaluate multiple models. You can further streamline the process by integrating AutoML platforms like Google AutoML or Azure AutoML, which can automate the process of training and tuning AI models, reducing the need for extensive manual configuration. Model Serving Model serving is the process of making a trained model available to users so that they can make predictions on new data. In the context of generative AI apps on data lake houses, model serving plays a critical role in enabling users to generate creative text formats, translate languages, and answer questions in an informative way. Here are some of the key benefits of using model serving in generative AI apps on data lake houses: Scalability: Model serving systems can be scaled to handle any volume of traffic. This is important for generative AI apps, which can be very popular and generate a lot of traffic. Reliability: Model serving systems are designed to be highly reliable. This is important for generative AI apps, which need to be available to users 24/7. Security: Model serving systems can be configured to be very secure. This is important for generative AI apps, which may be processing sensitive data. At the same time, the costs of in-house model serving can be prohibitive for smaller companies. This is why many smaller companies choose to outsource their model serving needs to a third-party provider. LLM Gateway LLM Gateway is a system that makes it easier for people to use different large language models (LLMs) from different providers. It does this by providing a single interface for interacting with all of the different LLMs and by encapsulating best practices for using them. It also manages data by tracking what data is sent to and received from the LLMs and by running PII scrubbing heuristics on the data before it is sent. In other words, LLM Gateway is a one-stop shop for using LLMs. It makes it easy to get started with LLMs, and it helps people to use them safely and efficiently. LLM gateways serve below purposes: Simplify the process of integrating these powerful language models into various applications. Provide user-friendly APIs and SDKs, reducing the barrier to entry for leveraging LLMs. Enable prediction caching to track repeated prompts. Rate limiting to manage costs. Prompt Tools Prompt tools can help you write better prompts for generative AI tools, which can lead to improved responses in a number of ways: Reduced ambiguity: Prompt tools can help you to communicate your requests more clearly and precisely, which can help to reduce ambiguity in the AI's responses. Consistent tone and style: Prompt tools can help you to specify the tone and style of the desired output, ensuring that generated content is consistent and on-brand. Mitigated bias: Prompt tools can help you to instruct the AI to avoid sensitive topics or adhere to ethical guidelines, which can help to mitigate bias and promote fairness. Improved relevance: Prompt tools can help you to set the context and goals for the AI, ensuring that generated content stays on-topic and relevant. Here are some specific examples of how prompt tools can be used to address the challenges you mentioned: Avoiding ambiguous or unintended responses: Instead of simply saying, "Write me a blog post about artificial intelligence," you could use a prompt tool to generate a more specific prompt, such as "Write a 1000-word blog post about the different types of artificial intelligence and their potential applications." Ensuring consistent tones and styles: If you are writing an email to clients, you can use a prompt tool to specify a formal and informative tone. If you are writing a creative piece, you can use a prompt tool to specify a more playful or experimental tone. Producing unbiased and politically correct content: If you are writing about a sensitive topic, such as race or religion, you can use a prompt tool to instruct the AI to avoid certain subjects or viewpoints. You can also use a prompt tool to remind the AI to adhere to your organization's ethical guidelines. Staying on-topic and generating relevant information: If you are asking the AI to generate a report on a specific topic, you can use a prompt tool to provide the AI with the necessary context and goals. This will help the AI to stay on-topic and generate relevant information. Overall, prompt tools are a valuable tool for anyone who uses generative AI tools. By using prompt tools, you can write better prompts and get the most out of your generative AI tools. Monitoring Generative AI models have transformed various industries by enabling machines to generate human-like text, images, and more. When integrated with Lake Houses, these models become even more powerful, leveraging vast amounts of data to generate creative content. However, monitoring such models is crucial to ensure their performance, reliability, and ethical use. Here are some monitoring tools and practices tailored for Generative AI on top of Lake Houses: Model Performance Metrics Data Quality and Distribution Cost Monitoring Anomaly Detection Conclusion In conclusion, the convergence of Data Lake Houses and Generative AI marks a groundbreaking era in data-driven innovation. These transformative technologies, when equipped with the right tools and capabilities, empower organizations to unlock the full potential of their data resources. Vector databases and grounding LLMs with vector search address the challenge of hallucinations, ensuring content accuracy. AutoML streamlines machine learning model deployment, while LLM gateways simplify integration. Prompt tools enable clear communication with AI models, mitigating ambiguity and bias. Robust monitoring ensures model performance and ethical use.
In production systems, new features sometimes need a data migration to be implemented. Such a migration can be done with different tools. For simple migrations, SQL can be used. It is fast and easily integrated into Liquibase or other tools to manage database migrations. This solution is for use cases that can not be done in SQL scripts. The Use Case The MovieManager project stores the keys to access TheMovieDB in the database. To improve the project, the keys should now be stored encrypted with Tink. The existing keys need to be encrypted during the data migration, and new keys need to be encrypted during the sign-in process. The movie import service needs to decrypt the keys to use them during the import. The Data Migration Update the Database Table To mark migrated rows in the "user1" table, a "migration" column is added in this Liquibase script: <changeSet id="41" author="angular2guy"> <addColumn tableName="user1"> <column defaultValue="0" type="bigint" name="migration"/> </addColumn> </changeSet> The changeSet adds the "migration" column to the "user1" table and sets the default value "0". Executing the Data Migration The data migration is started with the startMigration(...) method in the CronJobs class: ... private static volatile boolean migrationsDone = false; ... @Scheduled(initialDelay = 2000, fixedRate = 36000000) @SchedulerLock(name = "Migrations_scheduledTask", lockAtLeastFor = "PT2H", lockAtMostFor = "PT3H") public void startMigrations() { LOG.info("Start migrations."); if (!migrationsDone) { this.dataMigrationService.encryptUserKeys().thenApplyAsync(result -> { LOG.info("Users migrated: {}", result); return result; }); } migrationsDone = true; } The method startMigrations() is called with the @Scheduled annotation because that enables the use of @SchedulerLock. The @SchedulerLock annotation sets a database lock to limit the execution to one instance to enable horizontal scalability. The startMigrations() method is called 2 seconds after startup and then every hour with the @Scheduled annotation. The encryptUserKeys() method returns a CompletableFuture that enables the use of thenApplyAsync(...) to log the amount of migrated users nonblocking. The static variable migrationsDone makes sure that each application instance calls the dataMigrationService only once and makes the other calls essentially free. Migrating the Data To query the Users, the JpaUserRepository has the method findOpenMigrations: public interface JpaUserRepository extends CrudRepository<User, Long> { ... @Query("select u from User u where u.migration < :migrationId") List<User> findOpenMigrations(@Param(value = "migrationId") Long migrationId); } The method searches for entities where the migration property has not been increased to the migrationId that marks them as migrated. The DataMigrationService contains the encryptUserKeys() method to do the migration: @Service @Transactional(propagation = Propagation.REQUIRES_NEW) public class DataMigrationService { ... @Async public CompletableFuture<Long> encryptUserKeys() { List<User> migratedUsers = this.userRepository.findOpenMigrations(1L) .stream().map(myUser -> { myUser.setUuid(Optional.ofNullable(myUser.getUuid()) .filter(myStr -> !myStr.isBlank()) .orElse(UUID.randomUUID().toString())); myUser.setMoviedbkey(this.userDetailService .encrypt(myUser.getMoviedbkey(), myUser.getUuid())); myUser.setMigration(myUser.getMigration() + 1); return myUser; }).collect(Collectors.toList()); this.userRepository.saveAll(migratedUsers); return CompletableFuture.completedFuture( Integer.valueOf(migratedUsers.size()).longValue()); } } The service has the Propagation.REQUIRES_NEW in the annotation to make sure that each method gets wrapped in its own transaction. The encryptUserKeys() method has the Async annotation to avoid any timeouts on the calling side. The findOpenMigrations(...) method of the repository returns the not migrated entities and uses map for the migration. In the map it is first checked if the user's UUID is set, or if it is created and set. Then the encrypt(...) method of the UserDetailService is used to encrypt the user key, and the migration property is increased to show that the entity was migrated. The migrated entities are put in a list and saved with the repository. Then the result CompletableFuture is created to return the amount of migrations done. If the migrations are already done, findOpenMigrations(...) returns an empty collection and nothing is mapped or saved. The UserDetailServiceBase does the encryption in its encrypt() method: ... @Value("${tink.json.key}") private String tinkJsonKey; private DeterministicAead daead; ... @PostConstruct public void init() throws GeneralSecurityException { DeterministicAeadConfig.register(); KeysetHandle handle = TinkJsonProtoKeysetFormat.parseKeyset( this.tinkJsonKey, InsecureSecretKeyAccess.get()); this.daead = handle.getPrimitive(DeterministicAead.class); } ... public String encrypt(String movieDbKey, String uuid) { byte[] cipherBytes; try { cipherBytes = daead.encryptDeterministically( movieDbKey.getBytes(Charset.defaultCharset()), uuid.getBytes(Charset.defaultCharset())); } catch (GeneralSecurityException e) { throw new RuntimeException(e); } String cipherText = new String(Base64.getEncoder().encode(cipherBytes), Charset.defaultCharset()); return cipherText; } The tinkJsonKey is a secret, and must be injected as an environment variable or Helm chart value into the application for security reasons. The init() method is annotated with @PostConstruct to run as initialization, and it registers the config and creates the KeysetHandle with the tinkJsonKey. Then the primitive is initialized. The encrypt(...) method creates the cipherBytes with encryptDeterministcally(...) and the parameters of the method. The UUID is used to have unique cipherBytes for each user. The result is Base64 encoded and returned as String. Conclusion: Data Migration This migration needs to run as an application and not as a script. The trade-off is that the migration code is now in the application, and after the migration is run it, is dead code. That code should be removed then, but in the real world, the time to do this is limited and after some time it is forgotten. The alternative is to use something like Spring Batch, but doing that will take more effort and time because the JPA entities/repos can not be reused that easily. A TODO to clean up the method in the DataMigrationService should do the trick sooner or later. One operations constraint has to be considered: during migration, the database is in an inconsistent state and the user access to the applications should be stopped. Finally Using the Keys The MovieService contains the decrypt(...) method: @Value("${tink.json.key}") private String tinkJsonKey; private DeterministicAead daead; ... @PostConstruct public void init() throws GeneralSecurityException { DeterministicAeadConfig.register(); KeysetHandle handle = TinkJsonProtoKeysetFormat .parseKeyset(this.tinkJsonKey, InsecureSecretKeyAccess.get()); this.daead = handle.getPrimitive(DeterministicAead.class); } ... private String decrypt(String cipherText, String uuid) throws GeneralSecurityException { String result = new String(daead.decryptDeterministically( Base64.getDecoder().decode(cipherText), uuid.getBytes(Charset.defaultCharset()))); return result; } The properties and the init() method are the same as with the encryption. The decrypt(...) method first Base64 decodes the cipherText and then uses the result and the UUID to decrypt the key and return it as a String. That key string is used with the movieDbRestClient methods to import movie data into the database. Conclusion The Tink library makes using encryption easy enough. The tinkJsonKey has to be injected at runtime and should not be in a repo file or the application jar. A tinkJsonKey can be created with the EncryptionTest createKeySet(). The ShedLock library enables horizontal scalability, and Spring provides the toolbox that is used. The solution tries to balance the trade-offs for a horizontally scalable data migration that can not be done in a script.
In a previous lab titled “Building News Sentiment and Stock Price Performance Analysis NLP Application With Python,” I briefly touched upon the concept of algorithmic trading using automated market news sentiment analysis and its correlation with stock price performance. Market movements, especially in the short term, are often influenced by investors’’ sentiment. One of the main components of sentiment analysis trading strategies is the algorithmic computation of a sentiment score from raw text and then incorporating the sentiment score into the trading strategy. The more accurate the sentiment score, the better the likelihood of algorithmic trading predicting potential stock price movements. In that previous lab, I used the vaderSentiment library. This time, I’ve decided to explore another NLP contender, the FinBERT NLP algorithm, and compare it against Vader's sentiment score accuracy with the intent of improving trading strategy returns. The primary data source remains unchanged. Leveraging the Yahoo Finance API available on RapidAPI Hub, I’ve sourced the news data for our sentiment analysis exercise. I used a Python Jupyter Notebook as the development playground for this experiment. In my Jupyter notebook, I first call the API class that retrieves market data from Yahoo and converts the JSON response into a Pandas data frame. You can find this code in my previous lab or the GitHub repo. I then apply the Vader and FinBERT ML algorithms against the "Headline" column in the data frame to compute corresponding sentiment scores and add them in a new sentiment score column for each NLP ML algorithm. A manual comparison of these scores shows that the FinBERT ML algorithm is more accurate. I have also introduced a significant code restructure by incorporating the following SOLID principles. Single responsibility principle: Market news preparation logic has been consolidated into the API class Open/closed principle: Both, Vader and FinBERT-specific logic reside in the subclasses of SentimentAnalysisBase Python import plotly.express as px import plotly.graph_objects as go class SentimentAnalysisBase(): def calc_sentiment_score(self): pass def plot_sentiment(self) -> go.Figure: column = 'sentiment_score' df_plot = self.df.drop( self.df[self.df[f'{column}'] == 0].index) fig = px.bar(data_frame=df_plot, x=df_plot['Date Time'], y=f'{column}', title=f"{self.symbol} Hourly Sentiment Scores") return fig class FinbertSentiment (SentimentAnalysisBase): def __init__(self): self._sentiment_analysis = pipeline( "sentiment-analysis", model="ProsusAI/finbert") super().__init__() def calc_sentiment_score(self): self.df['sentiment'] = self.df['Headline'].apply( self._sentiment_analysis) self.df['sentiment_score'] = self.df['sentiment'].apply( lambda x: {x[0]['label'] == 'negative': -1, x[0]['label'] == 'positive': 1}.get(True, 0) * x[0]['score']) super().calc_sentiment_score() class VaderSentiment (SentimentAnalysisBase): nltk.downloader.download('vader_lexicon') def __init__(self) -> None: self.vader = SentimentIntensityAnalyzer() super().__init__() def calc_sentiment_score(self): self.df['sentiment'] = self.df['Headline'].apply( self.vader.polarity_scores) self.df['sentiment_score'] = self.df['sentiment'].apply( lambda x: x['compound']) super().calc_sentiment_score() I hope this article was worth your time. You can find the code in this GitHub repo. Happy coding!!!
Our industry is in the early days of an explosion in software using LLMs, as well as (separately, but relatedly) a revolution in how engineers write and run code, thanks to generative AI. Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very first time. Both types of engineers are finding themselves plunged into a disorienting new world — one where a particular flavor of production problem they may have encountered occasionally in their careers is now front and center. Namely, that LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying. It’s important to understand this, and why. 100% Debuggable? Maybe Not Software is traditionally assumed to be testable, debuggable, and reproducible, depending on the flexibility and maturity of your tooling and the complexity of your code. The original genius of computing was one of constraint; that by radically constraining language and mathematics to a defined set, we could create algorithms that would run over and over and always return the same result. In theory, all software is debuggable. However, there are lots of things that can chip away at that beauteous goal and make your software mathematically less than 100% debuggable, like: Adding concurrency and parallelism. Certain types of bugs. Stacking multiple layers of abstractions (e.g., containers). Randomness. Using JavaScript (HA HA). There is a much longer list of things that make software less than 100% debuggable in practice. Some of these things are related to cost/benefit tradeoffs, but most are about weak telemetry, instrumentation, and tooling. If you have only instrumented your software with metrics, for example, you have no way of verifying that a spike in api_requests and an identical spike in 503 errors are for the same events (i.e., you are getting a lot of api_requests returning 503) or for a disjoint set of events (the spike in api_requests is causing general congestion causing a spike in 503s across ALL events). It is mathematically impossible; all you can do is guess. But if you have a log line that emits both the request_path and the error_code, and a tool that lets you break down and group by arbitrary dimensions, this would be extremely easy to answer. Or if you emit a lot of events or wide log lines but cannot trace them, or determine what order things executed in, there will be lots of other questions you won’t be able to answer. There is another category of software errors that are logically possible to debug, but prohibitively expensive in practice. Any time you see a report from a big company that tracked down some obscure error in a kernel or an ethernet device, you’re looking at one of the rare entities with 1) enough traffic for these one in a billion errors to be meaningful, and 2) enough raw engineering power to dedicate to something most of us just have to live with. But software is typically understandable because we have given it structure and constraints. IF (); THEN (); ELSE () is testable and reproducible. Natural languages, on the other hand, are infinitely more expressive than programming languages, query languages, or even a UI that users interact with. The most common and repeated patterns may be fairly predictable, but the long tail your users will create is very long, and they expect meaningful results there, as well. For complex reasons that we won’t get into here, LLMs tend to have a lot of randomness in the long tail of possible results. So with software, if you ask the exact same question, you will always get the exact same answer. With LLMs, you might not. LLMs Are Their Own Beast Unit testing involves asserting predictable outputs for defined inputs, but this obviously cannot be done with LLMs. Instead, ML teams typically build evaluation systems to evaluate the effectiveness of the model or prompt. However, to get an effective evaluation system bootstrapped in the first place, you need quality data based on real use of an ML model. With software, you typically start with tests and graduate to production. With ML, you have to start with production to generate your tests. Even bootstrapping with early access programs or limited user testing can be problematic. It might be ok for launching a brand new feature, but it’s not good enough for a real production use case. Early access programs and user testing often fail to capture the full range of user behavior and potential edge cases that may arise in real-world usage when there are a wide range of users. All these programs do is delay the inevitable failures you’ll encounter when an uncontrolled and unprompted group of end users does things you never expected them to do. Instead of relying on an elaborate test harness to give you confidence in your software a priori, it’s a better idea to embrace a “ship to learn” mentality and release features earlier, then systematically learn from what is shipped and wrap that back into your evaluation system. And once you have a working evaluation set, you also need to figure out how quickly the result set is changing. Phillip gives this list of things to be aware of when building with LLMs: Failure will happen — it’s a question of when, not if. Users will do things you can’t possibly predict. You will ship a “bug fix” that breaks something else. You can’t really write unit tests for this (nor practice TDD). Latency is often unpredictable. Early access programs won’t help you. Sound at all familiar? Observability-Driven Development Is Necessary With LLMs Over the past decade or so, teams have increasingly come to grips with the reality that the only way to write good software at scale is by looping in production via observability — not by test-driven development, but observability-driven development. This means shipping sooner, observing the results, and wrapping your observations back into the development process. Modern applications are dramatically more complex than they were a decade ago. As systems get increasingly more complex, and nondeterministic outputs and emergent properties become the norm, the only way to understand them is by instrumenting the code and observing it in production. LLMs are simply on the far end of a spectrum that has become ever more unpredictable and unknowable. Observability — both as a practice and a set of tools — tames that complexity and allows you to understand and improve your applications. We have written a lot about what differentiates observability from monitoring and logging, but the most important bits are 1) the ability to gather and store telemetry as very wide events, ordered in time as traces, and 2) the ability to break down and group by any arbitrary, high-cardinality dimension. This allows you to explore your data and group by frequency, input, or result. In the past, we used to warn developers that their software usage patterns were likely to be unpredictable and change over time; now we inform you that if you use LLMs, your data set is going to be unpredictable, and it will absolutely change over time, and you must have a way of gathering, aggregating, and exploring that data without locking it into predefined data structures. With good observability data, you can use that same data to feed back into your evaluation system and iterate on it in production. The first step is to use this data to evaluate the representativity of your production data set, which you can derive from the quantity and diversity of use cases. You can make a surprising amount of improvements to an LLM based product without even touching any prompt engineering, simply by examining user interactions, scoring the quality of the response, and acting on the correctable errors (mainly data model mismatches and parsing/validation checks). You can fix or handle for these manually in the code, which will also give you a bunch of test cases that your corrections actually work! These tests will not verify that a particular input always yields a correct final output, but they will verify that a correctable LLM output can indeed be corrected. You can go a long way in the realm of pure software, without reaching for prompt engineering. But ultimately, the only way to improve LLM-based software is by adjusting the prompt, scoring the quality of the responses (or relying on scores provided by end users), and readjusting accordingly. In other words, improving software that uses LLMs can only be done by observability and experimentation. Tweak the inputs, evaluate the outputs, and every now and again, consider your dataset for representivity drift. Software engineers who are used to boolean/discrete math and TDD now need to concern themselves with data quality, representivity, and probabilistic systems. ML engineers need to spend more time learning how to develop products and concern themselves with user interactions and business use cases. Everyone needs to think more holistically about business goals and product use cases. There’s no such thing as a LLM that gives good answers that don’t serve the business reason it exists, after all. So, What Do You Need to Get Started With LLMs? Do you need to hire a bunch of ML experts in order to start shipping LLM software? Not necessarily. You cannot (there aren’t enough of them), you should not (this is something everyone needs to learn), and you don’t want to (these are changes that will make software engineers categorically more effective at their jobs). Obviously, you will need ML expertise if your goal is to build something complex or ambitious, but entry-level LLM usage is well within the purview of most software engineers. It is definitely easier for software engineers to dabble in using LLMs than it is for ML engineers to dabble in writing production applications. But learning to write and maintain software in the manner of LLMs is going to transform your engineers and your engineering organizations. And not a minute too soon. The hardest part of software has always been running it, maintaining it, and understanding it — in other words, operating it. But this reality has been obscured for many years by the difficulty and complexity of writing software. We can’t help but notice the upfront cost of writing software, while the cost of operating it gets amortized over many years, people, and teams, which is why we have historically paid and valued software engineers who write code more than those who own and operate it. When people talk about the 10x engineer, everyone automatically assumes it means someone who churns out 10x as many lines of code, not someone who can operate 10x as much software. But generative AI is about to turn all of these assumptions upside down. All of a sudden, writing software is as easy as sneezing. Anyone can use ChatGPT or other tools to generate reams of code in seconds. But understanding it, owning it, operating it, extending and maintaining it... all of these are more challenging than ever, because in the past, the way most of us learned to understand software was by writing it. What can we possibly do to make sure our code makes sense and works, and is extendable and maintainable (and our code base is consistent and comprehensible) when we didn’t go through the process of writing it? Well, we are in the early days of figuring that out, too. If you’re an engineer who cares about your craft: Do code reviews. Follow coding standards and conventions. Write (or generate) tests for it. But ultimately, the only way you can know for sure whether or not it works is to ship it to production and watch what happens. This has always been true, by the way. It’s just more true now. If you’re an engineer adjusting to the brave new era: Take some of that time you used to spend writing lines of code and reinvest it back into understanding, shipping under controlled circumstances, and observing. This means instrumenting your code with intention, and inspecting its output. This means shipping as soon as possible into the production environment. This means using feature flags to decouple deploys from releases and gradually roll new functionality out in a controlled fashion. Invest in these — and other — guardrails to make the process of shipping software more safe, fine-grained, and controlled. Most of all, it means developing the habit of looking at your code in production, through the lens of your telemetry, and asking yourself: Does this do what I expected it to do? Does anything else look weird? Or maybe I should say “looking at your systems” instead of “looking at your code,” since people might confuse the latter with an admonition to “read the code.” The days when you could predict how your system would behave simply by reading lines of code are long, long gone. Software behaves in unpredictable, emergent ways, and the important part is observing your code as it’s running in production, while users are using it. Code in a buffer can tell you very little. This Future Is a Breath of Fresh Air This, for once, is not a future I am afraid of. It’s a future I cannot wait to see manifest. For years now, I’ve been giving talks on modern best practices for software engineering — developers owning their code in production, testing in production, observability-driven development, continuous delivery in a tight feedback loop, separating deploys from releases using feature flags. No one really disputes that life is better, code is better, and customers are happier when teams adopt these practices. Yet, only 11% of teams can deploy their code in less than a day, according to the DORA report. Only a tiny fraction of teams are operating in the way everybody agrees we all should! Why? The answers often boil down to organizational roadblocks, absurd security/compliance policies, or lack of buy-in/prioritizing. Saddest of all are the ones who say something like, “our team just isn’t that good” or “our people just aren’t that smart” or “that only works for world-class teams like the Googles of the world.” Completely false. Do you know what’s hard? Trying to build, run, and maintain software on a two month delivery cycle. Running with a tight feedback loop is so much easier. Just Do the Thing So how do teams get over this hump and prove to themselves that they can have nice things? In my experience, only one thing works: When someone joins the team who has seen it work before, has confidence in the team’s abilities, and is empowered to start making progress against those metrics (which they tend to try to do, because people who have tried writing code the modern way become extremely unwilling to go back to the bad old ways). And why is this relevant? I hypothesize that over the course of the next decade, developing with LLMs will stop being anything special, and will simply be one skill set of many, alongside mobile development, web development, etc. I bet most engineers will be writing code that interacts with an LLM. I bet it will become not quite as common as databases, but up there. And while they’re doing that, they will have to learn how to develop using short feedback loops, testing in production, observability-driven development, etc. And once they’ve tried it, they too may become extremely unwilling to go back. In other words, LLMs might ultimately be the Trojan Horse that drags software engineering teams into the modern era of development best practices. (We can hope.) In short, LLMs demand we modify our behavior and tooling in ways that will benefit even ordinary, deterministic software development. Ultimately, these changes are a gift to us all, and the sooner we embrace them, the better off we will be.