A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Reflections From a DBA
Navigating the Skies
In part one of this two-part series, we looked at how walletless dApps smooth the web3 user experience by abstracting away the complexities of blockchains and wallets. Thanks to account abstraction from Flow and the Flow Wallet API, we can easily build walletless dApps that enable users to sign up with credentials that they're accustomed to using (such as social logins or email accounts). We began our walkthrough by building the backend of our walletless dApp. Here in part two, we'll wrap up our walkthrough by building the front end. Here we go! Create a New Next.js Application Let's use the Next.js framework so we have the frontend and backend in one application. On our local machine, we will use create-next-app to bootstrap our application. This will create a new folder for our Next.js application. We run the following command: Shell $ npx create-next-app flow_walletless_app Some options will appear; you can mark them as follows (or as you prefer!). Make sure to choose No for using Tailwind CSS and the App Router. This way, your folder structure and style references will match what I demo in the rest of this tutorial. Shell ✔ Would you like to use TypeScript with this project? ... Yes ✔ Would you like to use ESLint with this project? ... No ✔ Would you like to use Tailwind CSS with this project? ... No <-- IMPORTANT ✔ Would you like to use `src/` directory with this project? ... No ✔ Use App Router (recommended)? ... No <-- IMPORTANT ✔ Would you like to customize the default import alias? ... No Start the development server. Shell $ npm run dev The application will run on port 3001 because the default port (3000) is occupied by our wallet API running through Docker. Set Up Prisma for Backend User Management We will use the Prisma library as an ORM to manage our database. When a user logs in, we store their information in a database at a user entity. This contains the user's email, token, Flow address, and other information. The first step is to install the Prisma dependencies in our Next.js project: Shell $ npm install prisma --save-dev To use Prisma, we need to initialize the Prisma Client. Run the following command: Shell $ npx prisma init The above command will create two files: prisma/schema.prisma: The main Prisma configuration file, which will host the database configuration .env: Will contain the database connection URL and other environment variables Configure the Database Used by Prisma We will use SQLite as the database for our Next.js application. Open the schema.prisma file and change the datasource db settings as follows: Shell datasource db { provider = "sqlite" url = env("DATABASE_URL") } Then, in our .env file for the Next.js application, we will change the DATABASE_URL field. Because we’re using SQLite, we need to define the location (which, for SQLite, is a file) where the database will be stored in our application: Shell DATABASE_URL="file:./dev.db" Create a User Model Models represent entities in our app. The model describes how the data should be stored in our database. Prisma takes care of creating tables and fields. Let’s add the following User model in out schema.prisma file: Shell model User { id Int @id @default(autoincrement()) email String @unique name String? flowWalletJobId String? flowWalletAddress String? createdAt DateTime @default(now()) updatedAt DateTime @updatedAt } With our model created, we need to synchronize with the database. For this, Prisma offers a command: Shell $ npx prisma db push Environment variables loaded from .env Prisma schema loaded from prisma/schema.prisma Datasource "db": SQLite database "dev.db" at "file:./dev.db" SQLite database dev.db created at file:./dev.db -> Your database is now in sync with your Prisma schema. Done in 15ms After successfully pushing our users table, we can use Prisma Studio to track our database data. Run the command: Shell $ npx prisma studio Set up the Prisma Client That's it! Our entity and database configuration are complete. Now let's go to the client side. We need to install the Prisma client dependencies in our Next.js app. To do this, run the following command: Shell $ npm install @prisma/client Generate the client from the Prisma schema file: Shell $ npx prisma generate Create a folder named lib in the root folder of your project. Within that folder, create a file entitled prisma.ts. This file will host the client connection. Paste the following code into that file: TypeScript // lib/prisma.ts import { PrismaClient } from '@prisma/client'; let prisma: PrismaClient; if (process.env.NODE_ENV === "production") { prisma = new PrismaClient(); } else { let globalWithPrisma = global as typeof globalThis & { prisma: PrismaClient; }; if (!globalWithPrisma.prisma) { globalWithPrisma.prisma = new PrismaClient(); } prisma = globalWithPrisma.prisma; } export default prisma; Build the Next.js Application Frontend Functionality With our connection on the client part finalized, we can move on to the visual part of our app! Replace the code inside pages/index.tsx file, delete all lines of code and paste in the following code: TypeScript # pages/index.tsx import styles from "@/styles/Home.module.css"; import { Inter } from "next/font/google"; import Head from "next/head"; const inter = Inter({ subsets: ["latin"] }); export default function Home() { return ( <> <Head> <title>Create Next App</title> <meta name="description" content="Generated by create next app" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link rel="icon" href="/favicon.ico" /> </Head> <main className={styles.main}> <div className={styles.card}> <h1 className={inter.className}>Welcome to Flow Walletless App!</h1> <div style={{ display: "flex", flexDirection: "column", gap: "20px", margin: "20px", } > <button style={{ padding: "20px", width: 'auto' }>Sign Up</button> <button style={{ padding: "20px" }>Sign Out</button> </div> </div> </main> </> ); } In this way, we have the basics and the necessities to illustrate the creation of wallets and accounts! The next step is to configure the Google client to use the Google API to authenticate users. Set up Use of Google OAuth for Authentication We will need Google credentials. For that, open your Google console. Click Create Credentials and select the OAuth Client ID option. Choose Web Application as the application type and define a name for it. We will use the same name: flow_walletless_app. Add http://localhost:3001/api/auth/callback/google as the authorized redirect URI. Click on the Create button. A modal should appear with the Google credentials. We will need the Client ID and Client secret to use in our .env file shortly. Next, we’ll add the next-auth package. To do this, run the following command: Shell $ npm i next-auth Open the .env file and add the following new environment variables to it: Shell GOOGLE_CLIENT_ID= <GOOGLE CLIENT ID> GOOGLE_CLIENT_SECRET=<GOOGLE CLIENT SECRET> NEXTAUTH_URL=http://localhost:3001 NEXTAUTH_SECRET=<YOUR NEXTAUTH SECRET> Paste in your copied Google Client ID and Client Secret. The NextAuth secret can be generated via the terminal with the following command: Shell $ openssl rand -base64 32 Copy the result, which should be a random string of letters, numbers, and symbols. Use this as your value for NEXTAUTH_SECRET in the .env file. Configure NextAuth to Use Google Next.js allows you to create serverless API routes without creating a full backend server. Each file under api is treated like an endpoint. Inside the pages/api/ folder, create a new folder called auth. Then create a file in that folder, called [...nextauth].ts, and add the code below: TypeScript // pages/api/auth/[...nextauth].ts import NextAuth from "next-auth" import GoogleProvider from "next-auth/providers/google"; export default NextAuth({ providers: [ GoogleProvider({ clientId: process.env.GOOGLE_CLIENT_ID as string, clientSecret: process.env.GOOGLE_CLIENT_SECRET as string, }) ], }) Update _app.tsx file to use NextAuth SessionProvider Modify the _app.tsx file found inside the pages folder by adding the SessionProvider from the NextAuth library. Your file should look like this: TypeScript // pages/_app.tsx import "@/styles/globals.css"; import { SessionProvider } from "next-auth/react"; import type { AppProps } from "next/app"; export default function App({ Component, pageProps }: AppProps) { return ( <SessionProvider session={pageProps.session}> <Component {...pageProps} /> </SessionProvider> ); } Update the Main Page To Use NextAuth Functions Let us go back to our index.tsx file in the pages folder. We need to import the functions from the NextAuth library and use them to log users in and out. Our update index.tsx file should look like this: TypeScript // pages/index.tsx import styles from "@/styles/Home.module.css"; import { Inter } from "next/font/google"; import Head from "next/head"; import { useSession, signIn, signOut } from "next-auth/react"; const inter = Inter({ subsets: ["latin"] }); export default function Home() { const { data: session } = useSession(); console.log("session data",session) const signInWithGoogle = () => { signIn(); }; const signOutWithGoogle = () => { signOut(); }; return ( <> <Head> <title>Create Next App</title> <meta name="description" content="Generated by create next app" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link rel="icon" href="/favicon.ico" /> </Head> <main className={styles.main}> <div className={styles.card}> <h1 className={inter.className}>Welcome to Flow Walletless App!</h1> <div style={{ display: "flex", flexDirection: "column", gap: "20px", margin: "20px", } > <button onClick={signInWithGoogle} style={{ padding: "20px", width: "auto" }>Sign Up</button> <button onClick={signOutWithGoogle} style={{ padding: "20px" }>Sign Out</button> </div> </div> </main> </> ); } Build the “Create User” Endpoint Let us now create a users folder underneath pages/api. Inside this new folder, create a file called index.ts. This file is responsible for: Creating a user (first we check if this user already exists) Calling the Wallet API to create a wallet for this user Calling the Wallet API and retrieving the jobId data if the User entity does not yet have the address created These actions are performed within the handle function, which calls the checkWallet function. Paste the following snippet into your index.ts file: TypeScript // pages/api/users/index.ts import { User } from "@prisma/client"; import { BaseNextRequest, BaseNextResponse } from "next/dist/server/base-http"; import prisma from "../../../lib/prisma"; export default async function handle( req: BaseNextRequest, res: BaseNextResponse ) { const userEmail = JSON.parse(req.body).email; const userName = JSON.parse(req.body).name; try { const user = await prisma.user.findFirst({ where: { email: userEmail, }, }); if (user == null) { await prisma.user.create({ data: { email: userEmail, name: userName, flowWalletAddress: null, flowWalletJobId: null, }, }); } else { await checkWallet(user); } } catch (e) { console.log(e); } } const checkWallet = async (user: User) => { const jobId = user.flowWalletJobId; const address = user.flowWalletAddress; if (address != null) { return; } if (jobId != null) { const request: any = await fetch(`http://localhost:3000/v1/jobs/${jobId}`, { method: "GET", }); const jsonData = await request.json(); if (jsonData.state === "COMPLETE") { const address = await jsonData.result; await prisma.user.update({ where: { id: user.id, }, data: { flowWalletAddress: address, }, }); return; } if (request.data.state === "FAILED") { const request: any = await fetch("http://localhost:3000/v1/accounts", { method: "POST", }); const jsonData = await request.json(); await prisma.user.update({ where: { id: user.id, }, data: { flowWalletJobId: jsonData.jobId, }, }); return; } } if (jobId == null) { const request: any = await fetch("http://localhost:3000/v1/accounts", { method: "POST", }); const jsonData = await request.json(); await prisma.user.update({ where: { id: user.id, }, data: { flowWalletJobId: jsonData.jobId, }, }); return; } }; POST requests to the api/users path will result in calling the handle function. We’ll get to that shortly, but first, we need to create another endpoint for retrieving existing user information. Build the “Get User” Endpoint We’ll create another file in the pages/api/users folder, called getUser.ts. This file is responsible for finding a user in our database based on their email. Copy the following snippet and paste it into getUser.ts: TypeScript // pages/api/users/getUser.ts import prisma from "../../../lib/prisma"; export default async function handle( req: { query: { email: string; }; }, res: any ) { try { const { email } = req.query; const user = await prisma.user.findFirst({ where: { email: email, }, }); return res.json(user); } catch (e) { console.log(e); } } And that's it! With these two files in the pages/api/users folder, we are ready for our Next.js application frontend to make calls to our backend. Add “Create User” and “Get User” Functions to Main Page Now, let’s go back to the pages/index.tsx file to add the new functions that will make the requests to the backend. Replace the contents of index.tsx file with the following snippet: TypeScript // pages/index.tsx import styles from "@/styles/Home.module.css"; import { Inter } from "next/font/google"; import Head from "next/head"; import { useSession, signIn, signOut } from "next-auth/react"; import { useEffect, useState } from "react"; import { User } from "@prisma/client"; const inter = Inter({ subsets: ["latin"] }); export default function Home() { const { data: session } = useSession(); const [user, setUser] = useState<User | null>(null); const signInWithGoogle = () => { signIn(); }; const signOutWithGoogle = () => { signOut(); }; const getUser = async () => { const response = await fetch( `/api/users/getUser?email=${session?.user?.email}`, { method: "GET", } ); const data = await response.json(); setUser(data); return data?.flowWalletAddress != null ? true : false; }; console.log(user) const createUser = async () => { await fetch("/api/users", { method: "POST", body: JSON.stringify({ email: session?.user?.email, name: session?.user?.name }), }); }; useEffect(() => { if (session) { getUser(); createUser(); } }, [session]); return ( <> <Head> <title>Create Next App</title> <meta name="description" content="Generated by create next app" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link rel="icon" href="/favicon.ico" /> </Head> <main className={styles.main}> <div className={styles.card}> <h1 className={inter.className}>Welcome to Flow Walletless App!</h1> <div style={{ display: "flex", flexDirection: "column", gap: "20px", margin: "20px", } > {user ? ( <div> <h5 className={inter.className}>User Name: {user.name}</h5> <h5 className={inter.className}>User Email: {user.email}</h5> <h5 className={inter.className}>Flow Wallet Address: {user.flowWalletAddress ? user.flowWalletAddress : 'Creating address...'}</h5> </div> ) : ( <button onClick={signInWithGoogle} style={{ padding: "20px", width: "auto" } > Sign Up </button> )} <button onClick={signOutWithGoogle} style={{ padding: "20px" }> Sign Out </button> </div> </div> </main> </> ); } We have added two functions: getUser searches the database for a user with the email logged in. createUser creates a user or updates it if it does not have an address yet. We also added a useEffect that checks if the user is logged in with their Google account. If so, the getUser function is called, returning true if the user exists and has a registered email address. If not, we call the createUser function, which makes the necessary checks and calls. Test Our Next.js Application Finally, we restart our Next.js application with the following command: Shell $ npm run dev You can now sign in with your Google account, and the app will make the necessary calls to our wallet API to create a Flow Testnet address! This is the first step in the walletless Flow process! By following these instructions, your app will create users and accounts in a way that is convenient for the end user. But the wallet API does not stop there. You can do much more with it, such as execute and sign transactions, run scripts to fetch data from the blockchain, and more. Conclusion Account abstraction and walletless onboarding in Flow offer developers a unique solution. By being able to delegate control over accounts, Flow allows developers to create applications that provide users with a seamless onboarding experience. This will hopefully lead to greater adoption of dApps and a new wave of web3 users.
Vector technology in AI, often referred to with implementations, vector indexes, and vector search, offers a robust mechanism index and query through high-dimensional data entities spanning images, text, audio, and video. Their prowess becomes evident across diverse spectrums like similarity-driven searches, multi-modal retrieval, dynamic recommendation engines, and platforms leveraging the Retrieval Augmented Generation (RAG) paradigm. Due to its potential impact on a multitude of use cases, vectors have emerged as a hot topic. As one delves deeper, attempting to demystify the essence of "what precisely is vector search?", they are often greeted by a barrage of terms — AI, LLM, generative AI — to name a few. This article aims to paint a clearer picture (quite literally) by likening the concept to something we all know: colors. Infinite hues bloom, A million shades dance and play, Colors light our world. Just the so-called "official colors" span across three long Wikipedia pages. While it's straightforward to store and search these colors by their names using conventional search indices like those in Elastic Search or Couchbase FTS, there's a hitch. Think about the colors Navy and Ocean. Intuitively, they feel closely related, evoking images of deep, serene waters. Yet, linguistically, they share no common ground. This is where traditional search engines hit a wall. The typical workaround? Synonyms. You could map Navy to a plethora of related terms: blue, azure, ocean, turquoise, sky, and so on. But now, consider the gargantuan task of doing this for every color name. Moreover, these lists don't give us a measure of the closeness between colors. Is azure closer to the navy than the sky? A list won't tell you that. To put it simply, seeking similarities among colors is a daunting task. Trying to craft relationships between colors to gauge their similarity? Even more challenging. The simple solution to this is the well-known RGB. Encoding colors in the RGB vector scheme solves both the similarity and distance problem. When we talk about a color's RGB values, we're essentially referencing its coordinates in this 3D space where each dimension can have values ranging from 0 (zero) to 255, totaling 256 values. The vector (R, G, B) is defined by three components: the intensity of Red (R), the intensity of Green (G), and the intensity of Blue (B). Each of these components typically ranges from 0 to 255, allowing for over 16 million, 16777216 to be exact, unique combinations, each representing a distinct color. For instance, the vector (255, 0, 0) signifies the full intensity of red with no contributions from green or blue, resulting in the color red. Here are sample RGB values for some colors: Navy: (0, 0, 128) Turquoise: (64, 224, 208) Orange: (255, 165, 0) Green: (0, 128, 0) Gray: (128, 128, 128) The three values here can be seen as vectors representing a unique value in the color space containing 16777216 colors. Visualizing RGB values as vectors offers a profound advantage: the spatial proximity of two vectors gives a measure of color similarity. Colors that are close in appearance will have vectors that are close in the RGB space. This vector representation, therefore, not only provides a means to encode colors but also allows for an intuitive understanding of color relationships and similarities. Similarity Searching To find colors within an Euclidean distance of 1 from the color (148, 201, 44) in the RGB space, we vary each R, G, and B value by one up and one down to create the search space. This method will generate 3 x 3 x 3 = 27 color combinations but gives us a list of similar colors with specific distances. This is like identifying a small cube inside a larger RGB cube... Plain Text (147, 200, 43), (147, 200, 44), (147, 200, 45) (147, 201, 43), (147, 201, 44), (147, 201, 45) (147, 202, 43), (147, 202, 44), (147, 202, 45) (148, 200, 43), (148, 200, 44), (148, 200, 45) (148, 201, 43), (148, 201, 44) <- This is the original color, (148, 201, 45) (148, 202, 43), (148, 202, 44), (148, 202, 45) (149, 200, 43), (149, 200, 44), (149, 200, 45) (149, 201, 43), (149, 201, 44), (149, 201, 45) (149, 202, 43), (149, 202, 44), (149, 202, 45) All these 27 colors are similar to our original colors (148, 201, 44). This principle can be expanded to various distances and multiple ways to calculate the distance. If we were to store, index, and search RGBs in a database, let's see how this is done. Similarity search on colors via RGB model Hopefully, this gave you a good understanding of how the RGB models the color schemes and solves the similarity search problem. Let's replace the RGB model with an LLM model and input text and images about tennis. We then search for "French open." Even though the input text or image didn't include "French open" directly, the effect of the similarity search is that Djokovic and the two tennis images will still be returned! That's the magic of the LLM model and vector search. Vector indexing and vector search follow the same path. RGB encodes the 16 million colors in 3 bytes. But, the real-world data is more complicated. Languages, images, and videos much. more complicated. Hence, the vector databases use not three, but 300 or 3000 or more dimensions to encode data. Because of this, we need novel methods to store, index, and do similarity searches efficiently. However, the core principle is the same. More on how vector indexing and searching is done in a future blog!
In today's data-driven world, the quest for efficient and flexible database solutions is an ongoing pursuit for developers and businesses alike. One such solution is HarperDB. HarperDB is a modern and versatile database management system with simplicity, speed, and scalability. In this article, we will delve into the world of HarperDB, exploring why it has gained popularity and what makes it a compelling choice for developers and organizations. Additionally, we will take our first steps towards integrating HarperDB with the Java programming language. Java is a widely adopted, robust, and platform-independent programming language known for its reliability in building diverse applications. By bridging the gap between HarperDB and Java, we will unlock many possibilities for managing and accessing data seamlessly. So, join us on this journey as we unravel HarperDB and embark on our first integration with plain Java. Discover how this combination can empower you to build efficient and responsive applications, streamline data management, and take your development projects to the next level. HarperDB: A Modern Database Solution HarperDB blends the simplicity of traditional functionality with the power and flexibility required by modern applications. Essentially, HarperDB is a globally distributed edge application platform comprised of an edge database, streaming broker, and user-defined applications, with near-zero latency, huge cost savings, and a superior developer experience. This versatility makes it an option for businesses and developers grappling with the complexities of managing diverse data sources. HarperDB can run anywhere from edge to cloud, with a user-friendly management interface that enables developers of any skill level to get up and running quickly. Unlike many traditional databases that require extensive setup, configuration, and database administration expertise, HarperDB streamlines these processes. This simplicity reduces the learning curve and saves valuable development time, allowing teams to focus on building applications rather than managing the database. Performance is critical to any database system, especially in today's real-time and data-intensive applications. HarperDB's architecture is designed for speed and scale, ensuring that data retrieval and processing happens at lightning speed. HarperDB offers horizontal scalability, allowing you to add resources seamlessly as your data grows. HarperDB goes beyond pigeonholing data into predefined structures. This flexibility is precious in today's data landscape, where information comes in diverse formats. With HarperDB, you can store, query, and analyze data in a way that aligns with your application's unique requirements without being constrained by rigid schemas. HarperDB enables cost savings in numerous ways. The ease of use and low maintenance requirements translate into reduced operational expenses. Additionally, HarperDB delivers the same throughput as existing solutions with less hardware (or enables you to use the same amount of hardware and have greater throughput). As we delve deeper into HarperDB's integration with Java, we will unlock the potential of this database system and explore how it can elevate your data projects to new heights. Installing HarperDB Locally In our exploration of HarperDB and its integration with Java, one of the first steps is to install HarperDB locally. While a cloud version is available, this article focuses on the local installation to provide you with hands-on experience. You can choose your preferred flavor and installation method from the official documentation here. However, for simplicity, we’ll demonstrate how to set up HarperDB using Docker, a popular containerization platform. Docker Installation Docker simplifies the process of installing and running HarperDB in a containerized environment. Please note that the following Docker command is for demonstration purposes and should not be used in production. In production, you should follow best practices for securing your database credentials. Here’s how to run HarperDB in a Docker container with a simple username and password: Shell docker run -d \ -e HDB_ADMIN_USERNAME=root \ -e HDB_ADMIN_PASSWORD=password \ -e HTTP_THREADS=4 \ -p 9925:9925 \ -p 9926:9926 \ harperdb/harperdb Let’s break down what this command does: -d: Runs the container in detached mode (in the background) -e HDB_ADMIN_USERNAME=root: Sets the admin username to root (you can change this) -e HDB_ADMIN_PASSWORD=password: Sets the admin password to password (remember to use a robust and secure password in production) -e HTTP_THREADS=4: Configures the number of HTTP threads for handling requests -p 9925:9925 and -p 9926:9926: Maps the container’s internal ports 9925 and 9926 to the corresponding ports on your host machine This local installation will serve as the foundation for exploring HarperDB’s capabilities and its integration with Java. In subsequent sections, we will dive deeper into using HarperDB and connecting it with Java to leverage its features for building robust and data-driven applications. Creating Schema, Table, and Fields in HarperDB Now that we have HarperDB running locally, let’s create a schema and table and define the fields for our “dev” schema and “person” table. We’ll perform these operations using HTTP requests. Please note that the authorization header in these requests uses a primary authentication method with the username “root” and password “password”. In a production environment, always ensure secure authentication methods. To start working with HarperDB locally, we must create a schema, define a table, and specify its fields. These operations can be performed through HTTP requests. In our example, we’ll create a dev schema and a "person" table with "id", "name", and "age" columns. We’ll use curl commands for this purpose. Before running these commands, ensure that your HarperDB Docker container is up and running, as explained earlier. Creating a Schema (‘dev’): Shell curl --location --request POST 'http://localhost:9925/' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data-raw '{ "operation": "create_schema", "schema": "dev" }' This command sends an HTTP POST request to create a dev schema. The authorization header includes the basic authentication credentials (Base64 encoded username and password). Replace cm9vdDpwYXNzd29yZA== with your base64-encoded credentials. Creating a "person" Table With "id" as the Hash Attribute: Shell curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_table", "schema": "dev", "table": "person", "hash_attribute": "id" }' This command creates a "person" table within the "dev" schema and designates the "id" column as the hash attribute. The "hash_attribute" is used for distributed data storage and retrieval. Creating "name" and "age" Columns in the "person" Table: Shell curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_attribute", "schema": "dev", "table": "person", "attribute": "name" }' curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_attribute", "schema": "dev", "table": "person", "attribute": "age" }' These two commands create "name" and "age" columns within the "person" table. These columns define the structure of your data. With these HTTP requests, you’ve set up the schema, table, and columns in your local HarperDB instance. You are now ready to start working with data and exploring how to integrate HarperDB with Java for powerful data-driven applications. Exploring the Java Code for HarperDB Integration This session will explore the Java code to integrate HarperDB into a plain Java SE (Standard Edition) application. We will create a simple “Person” entity with “id”, “name”, and “age” fields. We must set up a Maven project and include the HarperDB JDBC driver to start. Step 1: Create a Maven Project Begin by creating a new Maven project using the Maven Quickstart Archetype. You can use the following command to create the project: Shell mvn archetype:generate -DgroupId=com.example -DartifactId=harperdb-demo -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false This command will generate a basic Maven project structure. Navigate to the project root directory. Step 2: Include the HarperDB JDBC Driver Download the HarperDB JDBC driver from the official HarperDB resources page: HarperDB Drivers. Extract the contents of the downloaded ZIP file. Create a new folder named lib in your project root directory. Copy the HarperDB JDBC driver JAR file from the extracted contents and paste it into the lib folder. Step 3: Update the Maven POM File Open the pom.xml file in your project. Add the following Maven dependency to include the HarperDB JDBC driver. Make sure to adjust the <version> and <systemPath> to match your JAR file: XML <dependency> <groupId>cdata.jdbc.harperdb</groupId> <artifactId>cdata.jdbc.harperdb</artifactId> <scope>system</scope> <version>1.0</version> <systemPath>${project.basedir}/lib/cdata.jdbc.harperdb.jar</systemPath> </dependency> This dependency instructs Maven to include the HarperDB JDBC driver JAR file as a system dependency for your project. Creating a Person Record and a PersonDAO Class for HarperDB Integration We will create a Person record, an immutable class introduced in Java for data modeling. We will also implement a PersonDAO class to interact with HarperDB using direct JDBC API calls. 1. Create the Person Record First, we define the Person record with three attributes: id, name, and age. We also provide a static factory method of for creating Person instances. This record simplifies data modeling and reduces code by automatically generating a constructor, accessor methods, and equals() and hashCode() implementations. Java public record Person(String id, String name, Integer age) { public static Person of(String name, Integer age) { return new Person(null, name, age); } } 2. Create the PersonDAO Class Next, we create the PersonDAO class, responsible for database operations using the HarperDB JDBC driver. This class provides methods for inserting, finding by ID, deleting, and retrieving all Person records from the database. Java import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; import java.util.ArrayList; import java.util.List; import java.util.Optional; import java.util.Properties; public class PersonDAO { private static final String INSERT = "INSERT INTO dev.person (name, age) VALUES (?, ?)"; private static final String SELECT = "select * From dev.person"; private static final String FIND_ID = "select * From dev.person where id = ?"; private static final String DELETE = "delete From dev.person where id = ?"; public void insert(Person person) throws SQLException { try(Connection connection = createConnection()){ var statement = connection.prepareStatement(INSERT); statement.setString(1, person.name()); statement.setInt(2, person.age()); statement.execute(); } } public Optional<Person> findById(String id) throws SQLException { try(Connection connection = createConnection()) { var statement = connection.prepareStatement(FIND_ID); statement.setString(1, id); var resultSet = statement.executeQuery(); if(resultSet.next()) { var name = resultSet.getString("name"); var age = resultSet.getInt("age"); return Optional.of(new Person(id, name, age)); } return Optional.empty(); } } public void delete(String id) throws SQLException { try(Connection connection = createConnection()) { var statement = connection.prepareStatement(DELETE); statement.setString(1, id); statement.execute(); } } public List<Person> findAll() throws SQLException { List<Person> people = new ArrayList<>(); try(Connection connection = createConnection()) { var statement = connection.prepareStatement(SELECT); var resultSet = statement.executeQuery(); while (resultSet.next()) { var id = resultSet.getString("id"); var name = resultSet.getString("name"); var age = resultSet.getInt("age"); people.add(new Person(id, name, age)); } } return people; } static Connection createConnection() throws SQLException { var properties = new Properties(); properties.setProperty("Server","http://localhost:9925/"); properties.setProperty("User","root"); properties.setProperty("Password","password"); return DriverManager.getConnection("jdbc:harperdb:", properties); } } With the Person record and PersonDAO class in place, you can now interact with HarperDB using Java, performing operations such as inserting, finding by ID, deleting, and retrieving Person records from the database. Adjust the database connection properties in the createConnection method to match your HarperDB setup. Executing the Java Application with HarperDB Integration With your Person record and PersonDAO class in place, you can execute the Java application to interact with HarperDB. Here’s your App class for implementing the application: Java import java.sql.SQLException; import java.util.List; public class App { public static void main(String[] args) throws SQLException { PersonDAO dao = new PersonDAO(); dao.insert(Person.of( "Ada", 10)); dao.insert(Person.of("Poliana", 20)); dao.insert(Person.of("Jhon", 30)); List<Person> people = dao.findAll(); people.forEach(System.out::println); System.out.println("Find by id: "); var id = people.get(0).id(); dao.findById(id).ifPresent(System.out::println); dao.delete(id); System.out.println("After delete: is present? " + dao.findById(id).isPresent()); } private App() { } } In this App class: We create an instance of the PersonDAO class to interact with the database. We insert sample Person records using the dao.insert(...) method. We retrieve all Person records using dao.findAll() and print them. We find a Person by ID and print it using dao.findById(...). We delete a Person by ID using dao.delete(...) and then check if it’s still in the database. Executing this App class will perform these operations against your HarperDB database, demonstrating how your Java application can interact with HarperDB using the Person record and PersonDAO class for database operations. Make sure to have HarperDB running and the HarperDB JDBC driver adequately configured in your project, as mentioned earlier in the article. Conclusion In our journey to explore HarperDB and its integration with Java, we’ve discovered a versatile and modern database solution that combines simplicity, speed, and flexibility to meet a wide range of data management needs. In our conclusion, we recap what we’ve learned and highlight the resources available for further exploration. Next Steps Documentation: For a deeper dive into HarperDB’s features and capabilities, consult the official documentation at HarperDB Documentation (linked earlier in this article). Sample Code: Explore practical examples and sample code for integrating HarperDB with Java in the HarperDB Samples GitHub Repository. Incorporating HarperDB into your Java applications empowers you to manage data efficiently, make informed decisions in real time, and build robust, data-driven solutions. Whether you’re developing IoT applications, web and mobile apps, or a global gaming solution, HarperDB is a modern and accessible choice.
This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report Good database design is essential to ensure data accuracy, consistency, and integrity and that databases are efficient, reliable, and easy to use. The design must address the storing and retrieving of data quickly and easily while handling large volumes of data in a stable way. An experienced database designer can create a robust, scalable, and secure database architecture that meets the needs of modern data systems. Architecture and Design A modern data architecture for microservices and cloud-native applications involves multiple layers, and each one has its own set of components and preferred technologies. Typically, the foundational layer is constructed as a storage layer, encompassing one or more databases such as SQL, NoSQL, or NewSQL. This layer assumes responsibility for the storage, retrieval, and management of data, including tasks like indexing, querying, and transaction management. To enhance this architecture, it is advantageous to design a data access layer that resides above the storage layer but below the service layer. This data access layer leverages technologies like object-relational mapping or data access objects to simplify data retrieval and manipulation. Finally, at the topmost layer lies the presentation layer, where the information is skillfully presented to the end user. The effective transmission of data through the layers of an application, culminating in its presentation as meaningful information to users, is of utmost importance in a modern data architecture. The goal here is to design a scalable database with the ability to handle a high volume of traffic and data while minimizing downtime and performance issues. By following best practices and addressing a few challenges, we can meet the needs of today's modern data architecture for different applications. Figure 1: Layered architecture Considerations By taking into account the following considerations when designing a database for enterprise-level usage, it is possible to create a robust and efficient system that meets the specific needs of the organization while ensuring data integrity, availability, security, and scalability. One important consideration is the data that will be stored in the database. This involves assessing the format, size, complexity, and relationships between data entities. Different types of data may require specific storage structures and data models. For instance, transactional data often fits well with a relational database model, while unstructured data like images or videos may require a NoSQL database model. The frequency of data retrieval or access plays a significant role in determining the design considerations. In read-heavy systems, implementing a cache for frequently accessed data can enhance query response times. Conversely, the emphasis may be on lower data retrieval frequencies for data warehouse scenarios. Techniques such as indexing, caching, and partitioning can be employed to optimize query performance. Ensuring the availability of the database is crucial for maintaining optimal application performance. Techniques such as replication, load balancing, and failover are commonly used to achieve high availability. Additionally, having a robust disaster recovery plan in place adds an extra layer of protection to the overall database system. As data volumes grow, it is essential that the database system can handle increased loads without compromising performance. Employing techniques like partitioning, sharding, and clustering allows for effective scalability within a database system. These approaches enable the efficient distribution of data and workload across multiple servers or nodes. Data security is a critical consideration in modern database design, given the rising prevalence of fraud and data breaches. Implementing robust access controls, encryption mechanisms for sensitive personally identifiable information, and conducting regular audits are vital for enhancing the security of a database system. In transaction-heavy systems, maintaining consistency in transactional data is paramount. Many databases provide features such as appropriate locking mechanisms and transaction isolation levels to ensure data integrity and consistency. These features help to prevent issues like concurrent data modifications and inconsistencies. Challenges Determining the most suitable tool or technology for our database needs can be a challenge due to the rapid growth and evolving nature of the database landscape. With different types of databases emerging daily and even variations among vendors offering the same type, it is crucial to plan carefully based on your specific use cases and requirements. By thoroughly understanding our needs and researching the available options, we can identify the right tool with the appropriate features to meet our database needs effectively. Polyglot persistence is a consideration that arises from the demand of certain applications, leading to the use of multiple SQL or NoSQL databases. Selecting the right databases for transactional systems, ensuring data consistency, handling financial data, and accommodating high data volumes pose challenges. Careful consideration is necessary to choose the appropriate databases that can fulfill the specific requirements of each aspect while maintaining overall system integrity. Integrating data from different upstream systems, each with its own structure and volume, presents a significant challenge. The goal is to achieve a single source of truth by harmonizing and integrating the data effectively. This process requires comprehensive planning to ensure compatibility and future-proofing the integration solution to accommodate potential changes and updates. Performance is an ongoing concern in both applications and database systems. Every addition to the database system can potentially impact performance. To address performance issues, it is essential to follow best practices when adding, managing, and purging data, as well as properly indexing, partitioning, and implementing encryption techniques. By employing these practices, you can mitigate performance bottlenecks and optimize the overall performance of your database system. Considering these factors will contribute to making informed decisions and designing an efficient and effective database system for your specific requirements. Advice for Building Your Architecture Goals for a better database design should include efficiency, scalability, security, and compliance. In the table below, each goal is accompanied by its corresponding industry expectation, highlighting the key aspects that should be considered when designing a database for optimal performance, scalability, security, and compliance. GOALS FOR DATABASE DESIGN Goal Industry Expectation Efficiency Optimal performance and responsiveness of the database system, minimizing latency and maximizing throughput. Efficient handling of data operations, queries, and transactions. Scalability Ability to handle increasing data volumes, user loads, and concurrent transactions without sacrificing performance. Scalable architecture that allows for horizontal or vertical scaling to accommodate growth. Security Robust security measures to protect against unauthorized access, data breaches, and other security threats. Implementation of access controls, encryption, auditing mechanisms, and adherence to industry best practices and compliance regulations. Compliance Adherence to relevant industry regulations, standards, and legal requirements. Ensuring data privacy, confidentiality, and integrity. Implementing data governance practices and maintaining audit trails to demonstrate compliance. Table 1 When building your database architecture, it's important to consider several key factors to ensure the design is effective and meets your specific needs. Start by clearly defining the system's purpose, data types, volume, access patterns, and performance expectations. Consider clear requirements that provide clarity on the data to be stored and the relationships between the data entities. This will help ensure that the database design aligns with quality standards and conforms to your requirements. Also consider normalization, which enables efficient storage use by minimizing redundant data, improves data integrity by enforcing consistency and reliability, and facilitates easier maintenance and updates. Selecting the right database model or opting for polyglot persistence support is crucial to ensure the database aligns with your specific needs. This decision should be based on the requirements of your application and the data it handles. Planning for future growth is essential to accommodate increasing demand. Consider scalability options that allow your database to handle growing data volumes and user loads without sacrificing performance. Alongside growth, prioritize data protection by implementing industry-standard security recommendations and ensuring appropriate access levels are in place and encourage implementing IT security measures to protect the database from unauthorized access, data theft, and security threats. A good back-up system is a testament to the efficiency of a well-designed database. Regular backups and data synchronization, both on-site and off-site, provide protection against data loss or corruption, safeguarding your valuable information. To validate the effectiveness of your database design, test the model using sample data from real-world scenarios. This testing process will help validate the performance, reliability, and functionality of the database system you are using, ensuring it meets your expectations. Good documentation practices play a vital role in improving feedback systems and validating thought processes and implementations during the design and review phases. Continuously improving documentation will aid in future maintenance, troubleshooting, and system enhancement efforts. Primary and secondary keys contribute to data integrity and consistency. Use indexes to optimize database performance by indexing frequently queried fields and limiting the number of fields returned in queries. Regularly backing up the database protects against data loss during corruption, system failure, or other unforeseen circumstances. Data archiving and purging practices help remove infrequently accessed data, reducing the size of the active dataset. Proper error handling and logging aid in debugging, troubleshooting, and system maintenance. Regular maintenance is crucial for growing database systems. Plan and schedule regular backups, perform performance tuning, and stay up to date with software upgrades to ensure optimal database performance and stability. Conclusion Designing a modern data architecture that can handle the growing demands of today's digital world is not an easy job. However, if you follow best practices and take advantage of the latest technologies and techniques, it is very much possible to build a scalable, flexible, and secure database. It just requires the right mindset and your commitment to learning and improving with a proper feedback loop. Additional reading: Semantic Modeling for Data: Avoiding Pitfalls and Breaking Dilemmas by Panos Alexopoulos Learn PostgreSQL: Build and manage high-performance database solutions using PostgreSQL 12 and 13 by Luca Ferrari and Enrico Pirozzi Designing Data-Intensive Applications by Martin Kleppmann This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report
This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report Database design is a critical factor in microservices and cloud-native solutions because a microservices-based architecture results in distributed data. Instead of data management happening in a single process, multiple processes can manipulate the data. The rise of cloud computing has made data even more distributed. To deal with this complexity, several data management patterns have emerged for microservices and cloud-native solutions. In this article, we will look at the most important patterns that can help us manage data in a distributed environment. The Challenges of Database Design for Microservices and the Cloud Before we dig into the specific data management patterns, it is important to understand the key challenges with database design for microservices and the cloud: In a microservices architecture, data is distributed across different nodes. Some of these nodes can be in different data centers in completely different geographic regions of the world. In this situation, it is tough to guarantee consistency of data across all the nodes. At any given point in time, there can be differences in the state of data between various nodes. This is also known as the problem of eventual consistency. Since the data is distributed, there's no central authority that manages data like in single-node monolithic systems. It's important for the various participating systems to use a mechanism (e.g., consensus algorithms) for data management. The attack surface for malicious actors is larger in a microservices architecture since there are multiple moving parts. This means we need to establish a more robust security posture while building microservices. The main promise of microservices and the cloud is scalability. While it becomes easier to scale the application processes, it is not so easy to scale the database nodes horizontally. Without proper scalability, databases can turn into performance bottlenecks. Diving Into Data Management Patterns Considering the associated challenges, several patterns are available to manage data in microservices and cloud-native applications. The main job of these patterns is to facilitate the developers in addressing the various challenges mentioned above. Let's look at each of these patterns one by one. Database per Service As the name suggests, this pattern proposes that each microservices manages its own data. This implies that no other microservices can directly access or manipulate the data managed by another microservice. Any exchange or manipulation of data can be done only by using a set of well-defined APIs. The figure below shows an example of a database-per-service pattern. Figure 1: Database-per-service pattern At face value, this pattern seems quite simple. It can be implemented relatively easily when we are starting with a brand-new application. However, when we are migrating an existing monolithic application to a microservices architecture, the demarcation between services is not so clear. Most of the functionality is written in a way where different parts of the system access data from other parts informally. Two main areas that we need to focus on when using a database-per-service pattern: Defining bounded contexts for each service Managing business transactions spanning multiple microservices Shared Database The next important pattern is the shared database pattern. Though this pattern supports microservices architecture, it adopts a much more lenient approach by using a shared database accessible to multiple microservices. For existing applications transitioning to a microservices architecture, this is a much safer pattern, as we can slowly evolve the application layer without changing the database design. However, this approach takes away some benefits of microservices: Developers across teams need to coordinate schema changes to tables. Runtime conflicts may arise when multiple services are trying to access the same database resources. CQRS and Event Sourcing In the command query responsibility segregation (CQRS) pattern, an application listens to domain events from other microservices and updates a separate database for supporting views and queries. We can then serve complex aggregation queries from this separate database while optimizing the performance and scaling it up as needed. Event sourcing takes it a bit further by storing the state of the entity or the aggregate as a sequence of events. Whenever we have an update or an insert on an object, a new event is created and stored in the event store. We can use CQRS and event sourcing together to solve a lot of challenges around event handling and maintaining separate query data. This way, you can scale the writes and reads separately based on their individual requirements. Figure 2: Event sourcing and CQRS in action together On the downside, this is an unfamiliar style of building applications for most developers, and there are more moving parts to manage. Saga Pattern The saga pattern is another solution for handling business transactions across multiple microservices. For example, placing an order on a food delivery app is a business transaction. In the saga pattern, we break this business transaction into a sequence of local transactions handled by different services. For every local transaction, the service that performs the transaction publishes an event. The event triggers a subsequent transaction in another service, and the chain continues until the entire business transaction is completed. If any particular transaction in the chain fails, the saga rolls back by executing a series of compensating transactions that undo the impact of all the previous transactions. There are two types of saga implementations: Orchestration-based saga Choreography-based saga Sharding Sharding helps in building cloud-native applications. It involves separating rows of one table into multiple different tables. This is also known as horizontal partitioning, but when the partitions reside on different nodes, they are known as shards. Sharding helps us improve the read and write scalability of the database. Also, it improves the performance of queries because a particular query must deal with fewer records as a result of sharding. Replication Replication is a very important data management pattern. It involves creating multiple copies of the database. Each copy is identical and runs on a different server or node. Changes made to one copy are propagated to the other copies. This is known as replication. There are several types of replication approaches, such as: Single-leader replication Multi-leader replication Leaderless replication Replication helps us achieve high availability and boosts reliability, and it lets us scale out read operations since read requests can be diverted to multiple servers. Figure 3 below shows sharding and replication working in combination. Figure 3: Using sharding and replication together Best Practices for Database Design in a Cloud-Native Environment While these patterns can go a long way in addressing data management issues in microservices and cloud-native architecture, we also need to follow some best practices to make life easier. Here are a few best practices: We must try to design a solution for resilience. This is because faults are inevitable in a microservices architecture, and the design should accommodate failures and recover from them without disrupting the business. We must implement proper migration strategies when transitioning to one of the patterns. Some of the common strategies that can be evaluated are schema first versus data first, blue-green deployments, or using the strangler pattern. Don't ignore backups and well-tested disaster recovery systems. These things are important even for single-node databases. However, in a distributed data management approach, disaster recovery becomes even more important. Constant monitoring and observability are equally important in microservices or cloud-native applications. For example, techniques like sharding can lead to unbalanced partitions and hotspots. Without proper monitoring solutions, any reactions to such situations may come too late and may put the business at risk. Conclusion We can conclude that good database design is absolutely vital in a microservices and cloud-native environment. Without proper design, an application will face multiple problems due to the inherent complexity of distributed data. Multiple data management patterns exist to help us deal with data in a more reliable and scalable manner. However, each pattern has its own challenges and set of advantages and disadvantages. No pattern fits all the possible scenarios, and we should select a particular pattern only after managing the various trade-offs. This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report
I have recently been working on a self-paced learning course for Spring Data Neo4j and wanted users to be able to test the database connection. Typically, in a Spring Boot application with Spring Data, you set the database credentials as properties in the application.properties file. You can run the application with just these details, but it only fails when the database URI has improper syntax. The application does not actually test the connection to see if it is valid and successfully connects. In this blog post, I will show you how to test the connection to a Neo4j database from a Spring Boot application using the verifyConnectivity() method from the Driver class. Ways To Test the Connection You might ask, "Why doesn't the application test the connection?" This is because it isn't a config property, so we have to test it at runtime. There are a few different ways to go about this. Use a CommandLineRunner with the driver and use the verifyConnectivity() method. Move the CommandLineRunner to its own config class (cleaner). Write a test for that uses the verifyConnectivity() method. Write application functionality (domain, repository, controller classes) that utilize the connection. The last option has been what I have done in the past because I haven't focused on only the connectivity step. However, it is not ideal for testing the connection because it requires you to write a lot of code that you don't need. If the connection is wrong, then we have to troubleshoot a lot more code when something else might actually be causing the problem. We want to only deal with the database connection. The first and second options were my next approach and are pretty good options, but require you to run the whole application. Once you have the test method, it either gets run along with it every time, or you have to comment out/remove that piece of code. The third option is the best because it is a test that you can run at any time. It doesn't increase the overhead of the actual application, and you can run individual tests only when desired. This will be our goal, but I will show you how to write the first and second options, as well. The verifyConnectivity() Method First, let's look at the verifyConnectivity() method. I didn't realize this existed until now, so I did a bit of research. The info in the Java API docs says that it verifies the driver can connect to the database and throws an exception if it fails to connect. This is exactly what we want! The method is part of the Driver class, which is part of the Neo4j Java Driver. So, in order to execute the verifyConnectivity() method, we will need to create a driver object. Setup: Create a Spring Boot Project Let's start by creating a Spring Boot project. I like to do this through the Spring Initializr site. I will use the following settings: Project: Maven Project Language: Java Spring Boot: Latest stable release (currently 3.1.3) Project Metadata: Group: com.jmhreif Artifact: verify-connectivity Dependencies: Spring Data Neo4j Spring Initializr settings Once you have downloaded the project, open it in your preferred IDE. The first thing we will need to do is to set the database credentials in the application.properties file. This will give us something to test. If you don't already have an instance of Neo4j running, you can spin up a free cloud instance of Neo4j Aura in a few minutes. Neo4j Aura is a fully managed cloud database service. Once you have an instance, you can get the connection URI from the Aura console. Next, open the application.properties file and add the following properties: Properties files spring.neo4j.uri=neo4j+s://dbhash.databases.neo4j.io spring.neo4j.authentication.username=neo4j spring.neo4j.authentication.password=test spring.data.neo4j.database=neo4j Note that you will need to update at least the URI and password fields to match your instance (the username and database fields are defaulted unless you customize them later). Now, we can create a CommandLineRunner class to test the connection. Each of the options we will cover in this post is in a separate branch in the accompanying GitHub project. You can follow along by checking out the branch for the option as we walk through each one. The main branch is the preferred solution using a test in the test class. Option 1: Method in main application class Option 2: Method in config class Option 3 (main): Test in test class Option 1: Use CommandLineRunner With our project ready, we can start adding code to test the database connection. Open the main application class (VerifyConnectivityApplication.java, if your project name is verify-connectivity) and add code so it matches the class below: Java @SpringBootApplication public class VerifyConnectivityApplication implements CommandLineRunner { public static void main(String[] args) { SpringApplication.run(VerifyConnectivityApplication.class, args); } final Driver driver; public VerifyConnectivityApplication(@Autowired Driver driver) { this.driver = driver; } public final void run(String... args) { driver.verifyConnectivity(); } } We have our class implement the CommandLineRunner interface so that the bean we create to test our connection is run on application startup. Next, we inject the driver on line 17 and create a constructor that uses the driver on line 19. Line 23 is where we actually test the connection, though. We implement the run() method, which uses the Driver object to call its verifyConnectivity() method. If the connection is successful, then the method will return a 0 success status code. If the connection fails, then the method will throw an exception and the application will exit with an error code. We can test this by running the application. If the output returns the 0 status code, then it works as it's supposed to. You can also test to make sure it fails by putting some bad data into the database properties in the application.properties file and running the app again. Testing the connection in the main application class isn't the best solution because we have cluttered up our main class with the test code. We can make this a bit cleaner by moving this code to its own config class. Option 2: Set Up a Config Class We are not really changing any functionality with this option, but are rather moving a chunk of configuration code to a separate class. This will allow us to keep our main application class clean and focused on the application's main functionality. First, we need to create a new Java class. You can name it anything you like, but I called it Config.java. Open the class and copy/paste the code from the main application class so that your config class looks like the following: Java @Configuration public class Config implements CommandLineRunner { final Driver driver; public Config(@Autowired Driver driver) { this.driver = driver; } public final void run(String... args) { driver.verifyConnectivity(); } } Ensure you remove the copied code from the main class, and then test the application again. It should still work the same as before when a 0 status code means success, but now we have separated the connection test code into its own configuration part of the application. This solution also isn't ideal because we still have to run the whole application to test the connection. We can do better by writing a test in the test class so that it only runs when we need to check that piece of functionality. Option 3: Write a Test The third option is the best one. It doesn't increase the overhead of the actual application, and we can run an individual test as needed. To do this, we need to open the VerifyConnectivityApplicationTests.java file and add the following code: Java @SpringBootTest class VerifyConnectivityApplicationTests { final Driver driver; public VerifyConnectivityApplicationTests(@Autowired Driver driver) { this.driver = driver; } @Test final void testConnection() { driver.verifyConnectivity(); } } You will also need to remove the Config.java class, as we don't need it anymore. Now, we can run the test, and it will verify the connection. If the connection is successful, then the test will pass. If the connection fails, then the test will fail. You can alter the values in the application.properties to verify you get the expected results for both success and failure. This version of the code is much cleaner, and since we want to test a connection, it makes sense to put this in the test class. For more rigorous and comprehensive application testing, we could improve this further by using a test suite such as Neo4j harness or Testcontainers, but that is out of the scope of this blog post. In our case, it is sufficient to create a plain test that verifies our application can connect to the database. Wrap Up! In today's post, we saw how to use the verifyConnectivity() method to test the connection to a Neo4j database from a Spring Boot application. We saw three different ways to do this, and the pros and cons of each. We also discussed why the best option is to utilize the test class and write a test. If the connection succeeds, the test passes. If the connection fails, the test fails, and we can troubleshoot connection details. Happy debugging! Resources Documentation: Java API verifyConnectivity() method
In today’s dynamic world of web development, the foundation upon which we build our applications is crucial. At the heart of many modern web applications lies the unsung hero: the database. But how we interact with this foundation — how we query, shape, and manipulate our data — can mean the difference between an efficient, scalable app and one that buckles under pressure. Enter the formidable trio of Node.js, Knex.js, and PostgreSQL. Node.js, with its event-driven architecture, promises speed and efficiency. Knex.js, a shining gem in the Node ecosystem, simplifies database interactions, making them more intuitive and less error-prone. And then there’s PostgreSQL — a relational database that’s stood the test of time, renowned for its robustness and versatility. So, why this particular blend of technologies? And how can they be harnessed to craft resilient and reliable database models? Journey with us as we unpack the synergy of Node.js, Knex.js, and PostgreSQL, exploring the myriad ways they can be leveraged to elevate your web development endeavors. Initial Setup In a previous article, I delved into the foundational setup and initiation of services using Knex.js and Postgres. However, this article hones in on the intricacies of the model aspect in service development. I won’t be delving into Node.js setups or explaining the intricacies of Knex migrations and seeds in this piece, as all that information is covered in the previous article. Postgres Connection Anyway, let’s briefly create a database using docker-compose: YAML version: '3.6' volumes: data: services: database: build: context: . dockerfile: postgres.dockerfile image: postgres:latest container_name: postgres environment: TZ: Europe/Paris POSTGRES_DB: ${DB_NAME} POSTGRES_USER: ${DB_USER} POSTGRES_PASSWORD: ${DB_PASSWORD} networks: - default volumes: - data:/var/lib/postgresql/data ports: - "5432:5432" restart: unless-stopped Docker Compose Database Setup And in your .env file values for connection: Plain Text DB_HOST="localhost" DB_PORT=5432 DB_NAME="modeldb" DB_USER="testuser" DB_PASSWORD="DBPassword" Those environment variables will be used in docker-compose file for launching your Postgres database. When all values are ready we can start to run it with docker-compose up. Kenx Setup Before diving into Knex.js setup, we’ll be using Node.js version 18. To begin crafting models, we only need the following dependencies: "dependencies": { "dotenv": "^16.3.1", "express": "^4.18.2", "knex": "^2.5.1", "pg": "^8.11.3" } Create knexfile.ts and add the following content: TypeScript require('dotenv').config(); require('ts-node/register'); import type { Knex } from 'knex'; const environments: string[] = ['development', 'test', 'production']; const connection: Knex.ConnectionConfig = { host: process.env.DB_HOST as string, database: process.env.DB_NAME as string, user: process.env.DB_USER as string, password: process.env.DB_PASSWORD as string, }; const commonConfig: Knex.Config = { client: 'pg', connection, migrations: { directory: './database/migrations', }, seeds: { directory: './database/seeds', } }; export default Object.fromEntries(environments.map((env: string) => [env, commonConfig])); Knex File Configuration Next, in the root directory of your project, create a new folder named database. Within this folder, add a index.ts file. This file will serve as our main database connection handler, utilizing the configurations from knexfile. Here's what the content index.ts should look like: TypeScript import Knex from 'knex'; import configs from '../knexfile'; export const database = Knex(configs[process.env.NODE_ENV || 'development']); Export database with applied configs This setup enables a dynamic database connection based on the current Node environment, ensuring that the right configuration is used whether you’re in a development, test, or production setting. Within your project directory, navigate to src/@types/index.ts. Here, we'll define a few essential types to represent our data structures. This will help ensure consistent data handling throughout our application. The following code outlines an enumeration of user roles and type definitions for both a user and a post: TypeScript export enum Role { Admin = 'admin', User = 'user', } export type User = { email: string; first_name: string; last_name: string; role: Role; }; export type Post = { title: string; content: string; user_id: number; }; Essential Types These types act as a blueprint, enabling you to define the structure and relationships of your data, making your database interactions more predictable and less prone to errors. After those setups, you can do migrations and seeds. Run npx knex migrate:make create_users_table: TypeScript import { Knex } from "knex"; import { Role } from "../../src/@types"; const tableName = 'users'; export async function up(knex: Knex): Promise<void> { return knex.schema.createTable(tableName, (table: Knex.TableBuilder) => { table.increments('id'); table.string('email').unique().notNullable(); table.string('password').notNullable(); table.string('first_name').notNullable(); table.string('last_name').notNullable(); table.enu('role', [Role.User, Role.Admin]).notNullable(); table.timestamps(true, true); }); } export async function down(knex: Knex): Promise<void> { return knex.schema.dropTable(tableName); } Knex Migration File for Users And npx knex migrate:make create_posts_table: TypeScript import { Knex } from "knex"; const tableName = 'posts'; export async function up(knex: Knex): Promise<void> { return knex.schema.createTable(tableName, (table: Knex.TableBuilder) => { table.increments('id'); table.string('title').notNullable(); table.string('content').notNullable(); table.integer('user_id').unsigned().notNullable(); table.foreign('user_id').references('id').inTable('users').onDelete('CASCADE'); table.timestamps(true, true); }); } export async function down(knex: Knex): Promise<void> { return knex.schema.dropTable(tableName); } Knex Migration File for Posts After setting things up, proceed by running npx knex migrate:latest to apply the latest migrations. Once this step is complete, you're all set to inspect the database table using your favorite GUI tool: Created Table by Knex Migration We are ready for seeding our tables. Run npx knex seed:make 01-users with the following content: TypeScript import { Knex } from 'knex'; import { faker } from '@faker-js/faker'; import { User, Role } from '../../src/@types'; const tableName = 'users'; export async function seed(knex: Knex): Promise<void> { await knex(tableName).del(); const users: User[] = [...Array(10).keys()].map(key => ({ email: faker.internet.email().toLowerCase(), first_name: faker.person.firstName(), last_name: faker.person.lastName(), role: Role.User, })); await knex(tableName).insert(users.map(user => ({ ...user, password: 'test_password' }))); } Knex Seed Users And for posts run npx knex seed:make 02-posts with the content: TypeScript import { Knex } from 'knex'; import { faker } from '@faker-js/faker'; import type { Post } from '../../src/@types'; const tableName = 'posts'; export async function seed(knex: Knex): Promise<void> { await knex(tableName).del(); const usersIds: Array<{ id: number }> = await knex('users').select('id'); const posts: Post[] = []; usersIds.forEach(({ id: user_id }) => { const randomAmount = Math.floor(Math.random() * 10) + 1; for (let i = 0; i < randomAmount; i++) { posts.push({ title: faker.lorem.words(3), content: faker.lorem.paragraph(), user_id, }); } }); await knex(tableName).insert(posts); } Knex Seed Posts The naming convention we’ve adopted for our seed files, 01-users and 02-posts, is intentional. This sequential naming ensures the proper order of seeding operations. Specifically, it prevents posts from being seeded before users, which is essential to maintain relational integrity in the database. Models and Tests As the foundation of our database is now firmly established with migrations and seeds, it’s time to shift our focus to another critical component of database-driven applications: models. Models act as the backbone of our application, representing the data structures and relationships within our database. They provide an abstraction layer, allowing us to interact with our data in an object-oriented manner. In this section, we’ll delve into the creation and intricacies of models, ensuring a seamless bridge between our application logic and stored data. In the src/models/Model/index.ts directory, we'll establish the foundational setup: TypeScript import { database } from 'root/database'; export abstract class Model { protected static tableName?: string; private static get table() { if (!this.tableName) { throw new Error('The table name must be defined for the model.'); } return database(this.tableName); } } Initial Setup for Model To illustrate how to leverage our Model class, let's consider the following example using TestModel: TypeScript class TestModel extends Model { protected static tableName = 'test_table'; } Usage of Extended Model This subclass, TestModel, extends our base Model and specifies the database table it corresponds to as 'test_table'. To truly harness the potential of our Model class, we need to equip it with methods that can seamlessly interact with our database. These methods would encapsulate common database operations, making our interactions not only more intuitive but also more efficient. Let's delve into and enhance our Model class with some essential methods: TypeScript import { database } from 'root/database'; export abstract class Model { protected static tableName?: string; private static get table() { if (!this.tableName) { throw new Error('The table name must be defined for the model.'); } return database(this.tableName); } protected static async insert<Payload>(data: Payload): Promise<{ id: number; }> { const [result] = await this.table.insert(data).returning('id'); return result; } protected static async findOneById<Result>(id: number): Promise<Result> { return this.table.where('id', id).select("*").first(); } protected static async findAll<Item>(): Promise<Item[]> { return this.table.select('*'); } } Essential Methods of Model In the class, we’ve added methods to handle the insertion of data (insert), fetch a single entry based on its ID (findOneById), and retrieve all items (findAll). These foundational methods will streamline our database interactions, paving the way for more complex operations as we expand our application. How should we verify its functionality? By crafting an integration test for our Model. Let's dive into it. Yes, I'm going to use Jest for integration tests as I have the same tool and for unit tests. Of course, Jest is primarily known as a unit testing framework, but it’s versatile enough to be used for integration tests as well. Ensure that your Jest configuration aligns with the following: TypeScript import type { Config } from '@jest/types'; const config: Config.InitialOptions = { clearMocks: true, preset: 'ts-jest', testEnvironment: 'node', coverageDirectory: 'coverage', verbose: true, modulePaths: ['./'], transform: { '^.+\\.ts?$': 'ts-jest', }, testRegex: '.*\\.(spec|integration\\.spec)\\.ts$', testPathIgnorePatterns: ['\\\\node_modules\\\\'], moduleNameMapper: { '^root/(.*)$': '<rootDir>/$1', '^src/(.*)$': '<rootDir>/src/$1', }, }; export default config; Jest Configurations Within the Model directory, create a file named Model.integration.spec.ts. TypeScript import { Model } from '.'; import { database } from 'root/database'; const testTableName = 'test_table'; class TestModel extends Model { protected static tableName = testTableName; } type TestType = { id: number; name: string; }; describe('Model', () => { beforeAll(async () => { process.env.NODE_ENV = 'test'; await database.schema.createTable(testTableName, table => { table.increments('id').primary(); table.string('name'); }); }); afterEach(async () => { await database(testTableName).del(); }); afterAll(async () => { await database.schema.dropTable(testTableName); await database.destroy(); }); it('should insert a row and fetch it', async () => { await TestModel.insert<Omit<TestType, 'id'>>({ name: 'TestName' }); const allResults = await TestModel.findAll<TestType>(); expect(allResults.length).toEqual(1); expect(allResults[0].name).toEqual('TestName'); }); it('should insert a row and fetch it by id', async () => { const { id } = await TestModel.insert<Omit<TestType, 'id'>>({ name: 'TestName' }); const result = await TestModel.findOneById<TestType>(id); expect(result.name).toEqual('TestName'); }); }); Model Integration Test In the test, it showcased an ability to seamlessly interact with a database. I've designed a specialized TestModel class that inherits from our foundational, utilizing test_table as its designated test table. Throughout the tests, I'm emphasizing the model's core functions: inserting data and subsequently retrieving it, be it in its entirety or via specific IDs. To maintain a pristine testing environment, I've incorporated mechanisms to set up the table prior to testing, cleanse it post each test, and ultimately dismantle it once all tests are concluded. Here leveraged the Template Method design pattern. This pattern is characterized by having a base class (often abstract) with defined methods like a template, which can then be overridden or extended by derived classes. Following the pattern you’ve established with the Model class, we can create a UserModel class to extend and specialize for user-specific behavior. In our Model change private to protected for reusability in sub-classes. TypeScript protected static tableName?: string; And then create UserModel in src/models/UserModel/index.ts like we did for the baseModel with the following content: TypeScript import { Model } from 'src/models/Model'; import { Role } from 'src/@types'; type UserType = { id: number; email: string; first_name: string; last_name: string; role: Role; } class UserModel extends Model { protected static tableName = 'users'; public static async findByEmail(email: string): Promise<UserType | null> { return this.table.where('email', email).select('*').first(); } } UserModel class To conduct rigorous testing, we need a dedicated test database where table migrations and deletions can occur. Recall our configuration in the knexfile, where we utilized the same database name across environments with this line: TypeScript export default Object.fromEntries(environments.map((env: string) => [env, commonConfig])); To both develop and test databases, we must adjust the docker-composeconfiguration for database creation and ensure the correct connection settings. The necessary connection adjustments should also be made in the knexfile. TypeScript // ... configs of knexfile.ts export default { development: { ...commonConfig, }, test: { ...commonConfig, connection: { ...connection, database: process.env.DB_NAME_TEST as string, } } } knexfile.ts With the connection established, setting process.env.NODE_ENV to "test" ensures that we connect to the appropriate database. Next, let's craft a test for the UserModel. TypeScript import { UserModel, UserType } from '.'; import { database } from 'root/database'; import { faker } from '@faker-js/faker'; import { Role } from 'src/@types'; const test_user: Omit<UserType, 'id'> = { email: faker.internet.email().toLowerCase(), first_name: faker.person.firstName(), last_name: faker.person.lastName(), password: 'test_password', role: Role.User, }; describe('UserModel', () => { beforeAll(async () => { process.env.NODE_ENV = 'test'; await database.migrate.latest(); }); afterEach(async () => { await database(UserModel.tableName).del(); }); afterAll(async () => { await database.migrate.rollback(); await database.destroy(); }); it('should insert and retrieve user', async () => { await UserModel.insert<typeof test_user>(test_user); const allResults = await UserModel.findAll<UserType>(); expect(allResults.length).toEqual(1); expect(allResults[0].first_name).toEqual(test_user.first_name); }); it('should insert user and retrieve by email', async () => { const { id } = await UserModel.insert<typeof test_user>(test_user); const result = await UserModel.findOneById<UserType>(id); expect(result.first_name).toEqual(test_user.first_name); }); }); UserModel Integration Test Initially, this mock user is inserted into the database, after which a retrieval operation ensures that the user was successfully stored, as verified by matching their first name. In another segment of the test, once the mock user finds its way into the database, we perform a retrieval using the user’s ID, further confirming the integrity of our insertion mechanism. Throughout the testing process, it’s crucial to maintain an isolated environment. To this end, before diving into the tests, the database is migrated to the most recent structure. Post each test, the user entries are cleared to avoid any data residue. Finally, as the tests wrap up, a migration rollback cleans the slate, and the database connection gracefully closes. Using this approach, we can efficiently extend each of our models to handle precise database interactions. TypeScript import { Model } from 'src/models/Model'; export type PostType = { id: number; title: string; content: string; user_id: number; }; export class PostModel extends Model { public static tableName = 'posts'; protected static async findAllByUserId(user_id: number): Promise<PostType[]> { if (!user_id) return []; return this.table.where('user_id', user_id).select('*'); } } PostModel.ts The PostModel specifically targets the 'posts' table in the database, as indicated by the static tableName property. Moreover, the class introduces a unique method, findAllByUserId, designed to fetch all posts associated with a specific user. This method checks the user_id attribute, ensuring posts are only fetched when a valid user ID is provided. If necessary to have a generic method for updating, we can add an additional method in the base Model: TypeScript public static async updateOneById<Payload>( id: number, data: Payload ): Promise<{ id: number; } | null> { const [result] = await this.table.where({ id }).update(data).returning('id'); return result; } Update by id in base Model So, this method updateOneById can be useful for all model sub-classes. Conclusion In wrapping up, it’s evident that a modular approach not only simplifies our development process but also enhances the maintainability and scalability of our applications. By compartmentalizing logic into distinct models, we set a clear path for future growth, ensuring that each module can be refined or expanded upon without causing disruptions elsewhere. These models aren’t just theoretical constructs — they’re practical tools, effortlessly pluggable into controllers, ensuring streamlined and reusable code structures. So, as we journey through, let’s savor the transformative power of modularity, and see firsthand its pivotal role in shaping forward-thinking applications. I welcome your feedback and am eager to engage in discussions on any aspect. References GitHub Repository Knex.js
In the world of data management, graph databases have emerged as a powerful tool that revolutionizes the way we handle and analyze complex relationships. Unlike traditional relational databases, which rely on tables and columns, graph databases excel in capturing and representing connections between data points. This article explores the fundamental concepts of graph databases and highlights their applications and benefits. What Is a Graph Database? A graph database, at its core, is a particular kind of database created to store and manage interconnected data. It uses graph theory to model and represent the data structure, a branch of mathematics that focuses on understanding relationships between objects. Data elements are shown as nodes (also known as vertices) in a graph database, which are connected by edges (also known as relationships or arcs). Due to the efficient querying and traversal of complex relationships made possible by this graph-like structure, in-depth insights and analysis are made possible. Key Concepts and Terminology To understand graph databases, it’s essential to familiarize yourself with key concepts and terminology associated with them. Here are the fundamental concepts: Graph: A graph is a data structure composed of nodes/vertices and edges/relationships. It represents the connections between different data elements. Node/Vertex: A node or vertex represents an entity or object in the graph database. It can store properties or attributes related to the entity it represents. For example, in a social network graph, a node can represent a person. Edge/Relationship: An edge or relationship defines the connection between nodes in the graph. It signifies the relationship or interaction between entities. Edges can have properties to provide additional information about the relationship. For instance, a friendship relationship between two users in a social network graph. Direction: Edges can be directed or undirected. In a directed graph, edges have a specific direction, indicating the flow or nature of the relationship. In an undirected graph, the relationship is bidirectional, and the edges have no specified direction. Label: Labels are used to categorize or classify nodes based on their properties or types. They provide a way to group similar nodes together. For instance, labels like “person,” “product,” or “location” can be used to categorize nodes based on their entity type. Property: Properties are attributes or key-value pairs associated with nodes or edges. They store additional information about the entities or relationships they represent. For example, a person node may have properties such as name, age, or occupation. Path: A path is a sequence of connected nodes and edges that represent a specific route or connection in the graph. It allows traversal from one node to another through the relationships defined by the edges. Graph Query Language: Graph databases often have their own query languages optimized for traversing and querying graph data. These query languages allow you to perform operations like creating, reading, updating, and deleting nodes, edges, and properties, as well as querying the relationships and patterns within the graph. Understanding these key concepts and terminology provides a solid foundation for working with graph databases and harnessing their power to model and analyze complex relationships in your data. Applications of Graph Databases Due to their capacity to efficiently manage and analyze complex relationships, graph databases have a wide range of applications in a variety of industries. The following are some important uses and advantages of graph databases: Social Networks: Graph databases are exceptionally well-suited for modeling and analyzing social networks. They can represent users as nodes and friendships or connections as edges, enabling efficient querying and exploration of social relationships. Graph databases can power social network platforms, recommendation systems, and targeted advertising based on social connections. Recommendation Systems: Graph databases excel in generating personalized recommendations by analyzing relationships and patterns. By leveraging the connections between users, items, or content, graph databases can identify similar users, discover relevant items, and provide accurate recommendations. This application is widely used in e-commerce, content streaming platforms, and personalized marketing. Fraud Detection: Graph databases are valuable in fraud detection and prevention. By modeling relationships among entities such as customers, transactions, and accounts, graph databases can uncover suspicious patterns, detect fraud networks, and identify anomalies in real time. The ability to traverse relationships quickly and perform complex queries makes graph databases a powerful tool in fraud analysis. Knowledge Graphs: Knowledge graphs capture and represent complex relationships among various entities, enabling rich semantic connections and knowledge representation. Graph databases are commonly used to build and query knowledge graphs, which find applications in semantic search, question-answering systems, natural language processing, and recommendation engines. Logistics and Supply Chain Management: Graph databases can optimize logistics and supply chain management by representing the interconnected nature of the supply chain. Nodes can represent locations, products, or transportation hubs, while edges capture relationships such as transportation routes, dependencies, or delivery timelines. Graph databases enable efficient route planning, supply chain visibility, and optimization of operations. Network and IT Operations: Graph databases can be used for network and IT operations management, enabling efficient representation and analysis of network infrastructure, dependencies, and service relationships. They can facilitate network troubleshooting, impact analysis, and root cause analysis by modeling the relationships between network components, devices, and services. Data Integration and Master Data Management: Graph databases can assist in data integration and master data management (MDM) scenarios. By representing relationships between various data sources, systems, and entities, graph databases enable data mapping, data lineage tracking, and data quality management. They facilitate efficient data integration and synchronization in complex data landscapes. Benefits of Graph Databases Graph databases offer several benefits compared to traditional database models. Here are the key advantages of using graph databases: Relationship Focus: Graph databases excel at managing and analyzing relationships between data elements. They are specifically designed to efficiently store, traverse, and query complex interconnections, making them ideal for applications that heavily rely on relationships. Performance: Graph databases provide fast and efficient query performance when it comes to navigating relationships. They use graph-specific algorithms and indexing techniques to optimize traversal operations, allowing for quick retrieval of connected data. Flexibility: Graph databases offer schema flexibility, allowing the database structure to evolve over time. New nodes, relationships, and properties can be added without requiring significant changes to the existing data model. This flexibility facilitates agile development and accommodates changing business requirements. Scalability: Graph databases can scale horizontally by distributing data across multiple servers or nodes. This architecture enables them to handle large and growing datasets with ease while maintaining high performance. The distributed nature of graph databases also supports high availability and fault tolerance. Deeper Insights: Graph databases enable the discovery of hidden patterns, dependencies, and insights that may not be immediately apparent in other database models. By analyzing relationships, graph databases uncover valuable insights that can drive informed decision-making, facilitate recommendations, and power advanced analytics. Natural Representation of Data: Graph databases align well with the way data is naturally structured, especially in domains where relationships play a crucial role. The graph model closely mirrors real-world scenarios, making it intuitive for developers and analysts to work with. Real-Time Analysis: Graph databases excel in real-time analysis of relationship-rich data. They can quickly traverse and query connections, making them suitable for use cases that require on-the-fly analysis, such as fraud detection, recommendation systems, and network operations. Integration and Interoperability: Graph databases can easily integrate and interoperate with other data systems. They can ingest and connect data from various sources, including relational databases, NoSQL databases, APIs, and external services. This capability enables organizations to leverage existing data assets and create unified views of their data. These benefits make graph databases a powerful tool for managing and analyzing interconnected data, unlocking valuable insights, and facilitating innovative applications across industries. Different Graph Databases There are several graph databases available, each with its own features and characteristics. Here are some popular graph databases: Neo4j: Neo4j is one of the most widely used and mature graph databases. It is a fully ACID-compliant, native graph database written in Java. Neo4j offers a flexible data model, powerful querying capabilities with its query language Cypher, and supports high availability and clustering. Amazon Neptune: Amazon Neptune is a fully managed graph database service provided by Amazon Web Services (AWS). It is built for high-performance and scalable graph applications. Neptune supports the property graph model and provides compatibility with Apache TinkerPop and Gremlin query language. Microsoft Azure Cosmos DB: Azure Cosmos DB is a globally distributed, multi-model database service by Microsoft Azure. It supports the Gremlin query language for graph database functionality, allowing you to build highly available and scalable graph applications. JanusGraph: JanusGraph is an open-source, distributed graph database that provides horizontal scalability and fault tolerance. It is built on Apache Cassandra and Apache TinkerPop, offering compatibility with Gremlin for querying and traversal operations. OrientDB: OrientDB is a multi-model database that combines graph and document-oriented features. It provides support for ACID transactions, distributed architecture, and flexible schema. OrientDB supports both SQL and Gremlin query languages. ArangoDB: ArangoDB is a multi-model database that supports key-value, document, and graph data models. It offers a native graph database engine with support for property graphs and graph traversals. ArangoDB also supports its query language, AQL (ArangoDB Query Language), for graph traversals and complex graph queries. TigerGraph: TigerGraph is a distributed graph database designed for high-performance graph analytics. It provides a native parallel graph computation engine, supporting massive-scale graph data processing and traversal. TigerGraph offers its own query language called GSQL. These are just a handful of the graph databases that are offered on the market. Every database has a different set of special features, scalability choices, and query languages. Specific needs, scalability requirements, performance considerations, and the ecosystem or infrastructure being used all play a role in the decision regarding the graph database. Conclusion An effective and adaptable method for managing and analyzing complex relationships in data is provided by graph databases. They open up new possibilities for understanding and utilizing relationships in our increasingly interconnected world thanks to their ability to efficiently capture and navigate connections. As industries continue to struggle with ever-increasing data volumes, graph databases present a useful tool for generating insightful conclusions and stimulating innovation.
When testing the FastAPI application with two different async sessions to the database, the following error may occur: In the test, an object is created in the database (the test session). A request is made to the application itself in which this object is changed (the application session). An object is loaded from the database in the test, but there are no required changes in it (the test session). Let’s find out what’s going on. Most often, we use two different sessions in the application and in the test. Moreover, in the test, we usually wrap the session in a fixture that prepares the database for tests, and after the tests, everything is cleaned up. Below is an example of the application. A file with a database connection app/database.py: Python """ Database settings file """ from typing import AsyncGenerator from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker from sqlalchemy.orm import declarative_base DATABASE_URL = "postgresql+asyncpg://user:password@host:5432/dbname" engine = create_async_engine(DATABASE_URL, echo=True, future=True) async_session = async_sessionmaker(bind=engine, class_=AsyncSession, expire_on_commit=False) async def get_session() -> AsyncGenerator: """ Returns async session """ async with async_session() as session: yield session Base = declarative_base() A file with a model description app/models.py: Python """ Model file """ from sqlalchemy import Integer, String from sqlalchemy.orm import Mapped, mapped_column from .database import Base class Lamp(Base): """ Lamp model """ __tablename__ = 'lamps' id: Mapped[int] = mapped_column(Integer, primary_key=True, index=True) status: Mapped[str] = mapped_column(String, default="off") A file with an endpoint description app/main.py: Python """ Main file """ import logging from fastapi import FastAPI, Depends from sqlalchemy import select from sqlalchemy.ext.asyncio import AsyncSession from .database import get_session from .models import Lamp app = FastAPI() @app.post("/lamps/{lamp_id}/on") async def check_lamp( lamp_id: int, session: AsyncSession = Depends(get_session) ) -> dict: """ Lamp on endpoint """ results = await session.execute(select(Lamp).where(Lamp.id == lamp_id)) lamp = results.scalar_one_or_none() if lamp: logging.error("Status before update: %s", lamp.status) lamp.status = "on" session.add(lamp) await session.commit() await session.refresh(lamp) logging.error("Status after update: %s", lamp.status) return {} I have added logging and a few more requests to the example on purpose to make it clear. Here, a session is created using Depends. Below is the file with a test example tests/test_lamp.py: Python """ Test lamp """ import logging from typing import AsyncGenerator import pytest import pytest_asyncio from httpx import AsyncClient from sqlalchemy import select from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker from app.database import Base, engine from app.main import app, Lamp @pytest_asyncio.fixture(scope="function", name="test_session") async def test_session_fixture() -> AsyncGenerator: """ Async session fixture """ async_session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False) async with async_session() as session: async with engine.begin() as conn: await conn.run_sync(Base.metadata.create_all) yield session async with engine.begin() as conn: await conn.run_sync(Base.metadata.drop_all) await engine.dispose() @pytest.mark.asyncio async def test_lamp_on(test_session): """ Test lamp switch on """ lamp = Lamp() test_session.add(lamp) await test_session.commit() await test_session.refresh(lamp) logging.error("New client status: %s", lamp.status) assert lamp.status == "off" async with AsyncClient(app=app, base_url="http://testserver") as async_client: response = await async_client.post(f"/lamps/{lamp.id}/on") assert response.status_code == 200 results = await test_session.execute(select(Lamp).where(Lamp.id == lamp.id)) new_lamp = results.scalar_one_or_none() logging.error("Updated status: %s", new_lamp.status) assert new_lamp.status == "on" This is a regular Pytest with getting a session to the database in a fixture. In this fixture, all tables are created before the session is returned, and after using it, they are deleted. Please note again that in the test, we use a session from the test_session fixture and, in the main code, from the app/database.py file. Despite the fact that we use the same engine, different sessions are generated. It is important. The expected sequence of database requests status = on should return from the database. In the test, I create an object in the database first. This is a usual INSERT through a session from a test. Let’s call it Session 1. At this moment, only this session is connected to the database. The application session is not connected yet. After creating an object, I perform a refresh. This is SELECT of a newly created object with an instance update via Session 1. As a result, I make sure that the object is created correctly and the status field is filled with the needed value — off. Then, I perform a POST request to the /lamps/1/on endpoint. This is turning on the lamp. To make the example shorter, I don’t use a fixture. As soon as the request starts working, a new session to the database is created. Let’s call it Session 2. With this session, I load the needed object from the database. I output the status to the log. It is off. After that, I update this status and save the update in the database. A request is made to the database: SQL BEGIN (implicit) UPDATE lamps SET status=$1::VARCHAR WHERE lamps.id = $2::INTEGER parameters: ('on', 1) COMMIT Note that the COMMIT command is also present. Despite the fact that the transaction is implicit, its result is instantly available after COMMIT in other sessions. Next, I make a request to get an updated object from the database using refresh. I output status. And its value is now on. It would seem that everything should work. The endpoint stops working, closes Session 2, and transfers control to the test. In the test, I make a usual request from Session 1 to get a modified object. But in the status field, I see the off value. Below is the scheme of the sequence of actions in the code. Sequence of actions in the code At the same time, according to all logs, the last SELECT request to the database was executed and returned status = on. Its value is definitely equal to on in the database at this moment. This is the value that engine asyncpg receives in response to the SELECT request. So, what happened? Here is what happened. It turned out that the request made to get a new object did not update the current one but found and used an existing one. In the beginning, I added a lamp object using ORM. I changed it in another session. When the change was made, the current session knew nothing about this change. And commit made in Session 2 did not request the expire_all method in Session 1. To fix this, you can do one of the following: Use a shared session for the test and application. Refresh the instance rather than trying to get it from the database Forcibly expire instance. Close the session. Dependency Overrides To use the same session, you can simply override the session in the application with the one I created in the test. It’s easy. To do this, we need to add the following code to the test: Python async def _override_get_db(): yield test_session app.dependency_overrides[get_session] = _override_get_db If you want, you can wrap this part into a fixture to use it in all tests. The resulting algorithm will be as follows: Steps in the code when using dependency overrides Below is the test code with session substitution: Python @pytest.mark.asyncio async def test_lamp_on(test_session): """ Test lamp switch on """ async def _override_get_db(): yield test_session app.dependency_overrides[get_session] = _override_get_db lamp = Lamp() test_session.add(lamp) await test_session.commit() await test_session.refresh(lamp) logging.error("New client status: %s", lamp.status) assert lamp.status == "off" async with AsyncClient(app=app, base_url="http://testserver") as async_client: response = await async_client.post(f"/lamps/{lamp.id}/on") assert response.status_code == 200 results = await test_session.execute(select(Lamp).where(Lamp.id == 1)) new_lamp = results.scalar_one_or_none() logging.error("Updated status: %s", new_lamp.status) assert new_lamp.status == "on" However, if the application uses multiple sessions (which is possible), that may not be the best way. Also, if commit or rollback is not called in the tested function, this will not help. Refresh The second solution is the simplest and most logical. We should not create a new request to get an object. To update, it is enough to call refresh immediately after processing the request to the endpoint. Internally, it calls expires, which leads to the fact that the saved instance is not used for a new request, and the data is filled in anew. This solution is the most logical and easiest to understand. Python await test_session.refresh(lamp) After it, you do not need to try and load the new_lamp object again, it is enough to check the same lamp. Below is the code scheme using refresh. Steps in the code when using refresh Below is the test code with the update. Python @pytest.mark.asyncio async def test_lamp_on(test_session): """ Test lamp switch on """ lamp = Lamp() test_session.add(lamp) await test_session.commit() await test_session.refresh(lamp) logging.error("New client status: %s", lamp.status) assert lamp.status == "off" async with AsyncClient(app=app, base_url="http://testserver") as async_client: response = await async_client.post(f"/lamps/{lamp.id}/on") assert response.status_code == 200 await test_session.refresh(lamp) logging.error("Updated status: %s", lamp.status) assert lamp.status == "on" Expire But if we change a lot of objects, it might be better to call expire_all. Then, all instances will be read from the database, and the consistency will not be broken. Python test_session.expire_all() You can also call expire on a particular instance and even on instance attribute. Python test_session.expire(lamp) After these calls, you will have to read the objects from the database manually. Below is the sequence of steps in the code when using expire. Steps in the code when using expire Below is the test code with expires. Python @pytest.mark.asyncio async def test_lamp_on(test_session): """ Test lamp switch on """ lamp = Lamp() test_session.add(lamp) await test_session.commit() await test_session.refresh(lamp) logging.error("New client status: %s", lamp.status) assert lamp.status == "off" async with AsyncClient(app=app, base_url="http://testserver") as async_client: response = await async_client.post(f"/lamps/{lamp.id}/on") assert response.status_code == 200 test_session.expire_all() # OR: # test_session.expire(lamp) results = await test_session.execute(select(Lamp).where(Lamp.id == 1)) new_lamp = results.scalar_one_or_none() logging.error("Updated status: %s", new_lamp.status) assert new_lamp.status == "on" Close In fact, the last approach with session termination also calls expire_all, but the session can be used further. And when reading the new data, we will get the up-to-date objects. Python await test_session.close() This should be called immediately after the request for the application is completed and before the checks begin. Below are the steps in the code when using close. Steps in the code when using close Below is the test code with session closure. Python @pytest.mark.asyncio async def test_lamp_on(test_session): """ Test lamp switch on """ lamp = Lamp() test_session.add(lamp) await test_session.commit() await test_session.refresh(lamp) logging.error("New client status: %s", lamp.status) assert lamp.status == "off" async with AsyncClient(app=app, base_url="http://testserver") as async_client: response = await async_client.post(f"/lamps/{lamp.id}/on") assert response.status_code == 200 await test_session.close() results = await test_session.execute(select(Lamp).where(Lamp.id == 1)) new_lamp = results.scalar_one_or_none() logging.error("Updated status: %s", new_lamp.status) assert new_lamp.status == "on" Calling rollback() will help as well. It also calls expire_all, but it explicitly rolls back the transaction. If the transaction needs to be executed, commit() also executes expire_all. But in this example, neither rollback nor commit will be relevant since the transaction in the test has already been completed, and the transaction in the application does not affect the session from the test. In fact, this feature only works in SQLAlchemy ORM in async mode in transactions. However, the behavior in which I do make a request to the database in the code to get a new object seems illogical if it still returns a cached object but not the forcibly received one from the database. This is a bit confusing when debugging the code. But when used correctly, this is how it should be. Conclusion Working in async mode with SQLAlchemy ORM, you have to track transactions and threads in parallel sessions. If all this seems too difficult, then use SQLAlchemy ORM synchronous mode. Everything is much simpler in it.
In the age of burgeoning data complexity and high-dimensional information, traditional databases often fall short when it comes to efficiently handling and extracting meaning from intricate datasets. Enter vector databases, a technological innovation that has emerged as a solution to the challenges posed by the ever-expanding landscape of data. Understanding Vector Databases Vector databases have gained significant importance in various fields due to their unique ability to efficiently store, index, and search high-dimensional data points, often referred to as vectors. These databases are designed to handle data where each entry is represented as a vector in a multi-dimensional space. The vectors can represent a wide range of information, such as numerical features, embeddings from text or images, and even complex data like molecular structures. Let's represent the vector database using a 2D grid where one axis represents the color of the animal (brown, black, white) and the other axis represents the size (small, medium, large). In this representation: Image A: Brown color, Medium size Image B: Black color, Small size Image C: White color, Large size Image E: Black color, Large size You can imagine each image as a point plotted on this grid based on its color and size attributes. This simplified grid captures the essence of how a vector database could be represented visually, even though the actual vector spaces might have many more dimensions and use sophisticated techniques for search and retrieval. Explain Vector Databases Like I’m Five Imagine you have a bunch of different types of fruit, like apples, oranges, bananas, and grapes. You love the taste of apples and want to find other fruits that taste similar to apples. Instead of sorting the fruits by their colors or sizes, you decide to group them based on how sweet or sour they are. So, you put all the sweet fruits together, like apples, grapes, and ripe bananas. You put the sour fruits in another group, like oranges and unripe bananas. Now, when you want to find fruits that taste like apples, you just look in the group of sweet fruits because they're more likely to have a similar taste. But what if you're looking for something specific, like a fruit that's as sweet as an apple but also has a tangy flavor like an orange? It might be a bit hard to find in your groups, right? That's when you ask someone who knows a lot about different fruits, like a fruit expert. They can suggest a fruit that matches your unique taste request because they know about the flavors of many fruits. In this case, that knowledgeable person is acting like a "vector database." They have a lot of information about different fruits and can help you find one that fits your special taste, even if it's not based on the usual things like colors or shapes. Similarly, a vector database is like this helpful expert for computers. It's designed to remember lots of details about things, like food, in a special way. So, if you're looking for a food that's similar in taste to something you love or a food with a combination of flavors you enjoy, this vector database can quickly find the right options for you. It's like having a flavor expert for computers who knows all about tastes and can suggest great choices based on what you're craving, just like that knowledgeable person with fruit. How Do Vector Databases Store Data? Vector databases store data by using vector embeddings. Vector embeddings in vector databases refer to a way of representing objects, such as items, documents, or data points, as vectors in a multi-dimensional space. Each object is assigned a vector that captures various characteristics or features of that object. These vectors are designed in such a way that similar objects have vectors that are closer to each other in the vector space, while dissimilar objects have vectors that are farther apart. Think of vector embeddings as a special code that describes the important aspects of an object. Imagine you have different animals, and you want to represent them in a way that similar animals have similar codes. For instance, cats and dogs might have codes that are quite close, as they share common features like being four-legged and having fur. On the other hand, animals like fish and birds would have codes that are further apart, reflecting their differences. In a vector database, these embeddings are used to store and organize objects. When you want to find objects that are similar to a given query, the database looks at the embeddings and calculates the distances between the query's embedding and the embeddings of other objects. This helps the database quickly identify objects that are most similar to the query. For example, in a music streaming app, songs could be represented as vectors using embeddings that capture musical features like tempo, genre, and instruments used. When you search for songs similar to your favorite track, the app's vector database would compare the embeddings to find songs that match your preferences closely. Vector embeddings are a way of turning complex objects into numerical vectors that capture their characteristics, and vector databases use these embeddings to efficiently search and retrieve similar or relevant objects based on their positions in the vector space. How Do Vector Databases Work? Image credits: KDnuggets User Query You input a question or request into the ChatGPT application. Embedding Creation The application converts your input into a compact numerical form called a vector embedding. This embedding captures the essence of your query in a mathematical representation. Database Comparison The vector embedding is compared with other embeddings stored in the vector database. Similarity measures help identify the most related embeddings based on content. Output Generation The database generates a response composed of embeddings closely matching your query's meaning. User Response The response, containing relevant information linked to the identified embeddings, is sent back to you. Follow-up Queries When you make subsequent queries, the embedding model generates new embeddings. These new embeddings are used to find similar embeddings in the database, connecting back to the original content. How Vector Databases Know Which Vectors Are Similar A vector database determines the similarity between vectors using various mathematical techniques, with one of the most common methods being cosine similarity. When you search for "Best cricket player in the world" on Google, and it shows a list of top players, there are several steps involved, of which cosine similarity is the main one. The vector representation of the search query is compared to the vector representations of all the player profiles in the database using cosine similarity. The more similar the vectors are, the higher the cosine similarity score. Note: Well, this is just for the sake of an example. it's important to note that search engines like Google use complex algorithms that go beyond simple vector similarity. They consider various factors such as the user's location, search history, authority of the sources, and more to provide the most relevant and personalized search results. Vector Database Capabilities The significance of vector databases lies in their capabilities and applications: Efficient Similarity Search Vector databases excel at performing similarity searches, where you can retrieve vectors that are most similar to a given query vector. This is crucial in various applications like recommendation systems (finding similar products or content), image and video retrieval, facial recognition, and information retrieval. High-Dimensional Data Traditional relational databases struggle with high-dimensional data because of the "curse of dimensionality," where distances between data points become less meaningful as the number of dimensions increases. Vector databases are designed to handle high-dimensional data more efficiently, making them suitable for applications like natural language processing, computer vision, and genomics. Machine Learning and AI Vector databases are often used to store embeddings generated by machine learning models. These embeddings capture the essential features of the data and can be used for various tasks, such as clustering, classification, and anomaly detection. Real-Time Applications Many vector databases are optimized for real-time or near-real-time querying, making them suitable for applications that require quick responses, such as recommendation systems in e-commerce, fraud detection, and monitoring IoT sensor data. Personalization and User Profiling Vector databases enable personalized experiences by allowing systems to understand and predict user preferences. This is crucial in platforms like streaming services, social media, and online marketplaces. Spatial and Geographic Data Vector databases can handle geographic data, such as points, lines, and polygons, efficiently. This is essential in applications like geographical information systems (GIS), location-based services, and navigation applications. Healthcare and Life Sciences In genomics and molecular biology, vector databases are used to store and analyze genetic sequences, protein structures, and other molecular data. This helps in drug discovery, disease diagnosis, and personalized medicine. Data Fusion and Integration Vector databases can integrate data from various sources and types, enabling more comprehensive analysis and insights. This is valuable in scenarios where data comes from multiple modalities, such as combining text, image, and numerical data. Multilingual Search Vector databases can be used to create powerful multilingual search engines by representing text documents as vectors in a common space, enabling cross-lingual similarity searches. Graph Data Vector databases can represent and process graph data efficiently, which is crucial in social network analysis, recommendation systems, and fraud detection. The Crucial Role of Vector Databases in Today’s Data Landscape Vector databases are experiencing high demand due to their essential role in tackling the challenges posed by the explosion of high-dimensional data in modern applications. As industries increasingly adopt technologies like machine learning, artificial intelligence, and data analytics, the need to efficiently store, search, and analyze complex data representations has become paramount. Vector databases enable businesses to harness the power of similarity search, personalized recommendations, and content retrieval, driving enhanced user experiences and improved decision-making. With applications ranging from e-commerce and content platforms to healthcare and autonomous vehicles, the demand for vector databases stems from their ability to handle diverse data types and deliver accurate results in real time. As data continues to grow in complexity and volume, the scalability, speed, and accuracy offered by vector databases position them as a critical tool for extracting meaningful insights and unlocking new opportunities across various domains. SingleStore as a Vector Database Harness the robust vector database capabilities of SingleStoreDB, tailored to seamlessly serve AI-driven applications, chatbots, image recognition systems, and more. With SingleStoreDB at your disposal, the necessity for maintaining a dedicated vector database for your vector-intensive workloads becomes obsolete. Diverging from conventional vector database approaches, SingleStoreDB takes a novel approach by housing vector data within relational tables alongside diverse data types. This innovative amalgamation empowers you to effortlessly access comprehensive metadata and additional attributes pertaining to your vector data, all while leveraging the extensive querying prowess of SQL. SingleStoreDB has been meticulously architected with a scalable framework, ensuring unfaltering support for your burgeoning data requirements. Say goodbye to limitations and embrace a solution that grows in tandem with your data demands. Example of Face Matching With SQL in SingleStore We loaded 16,784,377 rows into this table: create table people( id bigint not null primary key, filename varchar(255), vector blob ); Each row represents one image of a celebrity and contains a unique ID number, the file name where the image is stored, and a 128-element floating point vector representing the meaning of the face. This vector was obtained using facenet, a pre-trained neural network for creating vector embeddings from a face image. Don't worry; you don't need to understand the AI to use this kind of approach — you just need to use somebody else's pre-trained neural network or any tool that can provide you with summary vectors for an object. Now, we query this table using: select vector into @v from people where filename = "Emma_Thompson/Emma_Thompson_0001.jpg"; select filename, dot_product(vector, @v) as score from people where score > 0.1 order by score desc limit 5; The first query gets a query vector @v for the image Emma_Thompson_0001.jpg. The second query finds the top five closest matches: Emma_Thompson_0001.jpg is a perfect match for itself, so the score is close to 1. But interestingly, the next closest match is Emma_Thompson_0002.jpg. Here are the query image and closest match: Moreover, the search speed we obtained was truly incredible. The 2nd query took only 0.005 seconds on a 16 vcpu machine. And it processed all 16M vectors. This is a rate of over 3.3 billion vector matches per second. The significance of vector databases stems from their ability to handle complex, high-dimensional data while offering efficient querying and retrieval mechanisms. As data continues to grow in complexity and volume, vector databases are becoming increasingly vital in a wide range of applications across industries.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs
Sahiti Kappagantula
Product Associate,
Oorwin