Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A Guide to Data-Driven Design and Architecture
The Evolution of Data Pipelines
This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report Data quality is an undetachable part of data engineering. Because any data insight can only be as good as its input data, building robust and resilient data systems that consistently deliver high-quality data is the data engineering team's holiest responsibility. Achieving and maintaining adequate data quality is no easy task. It requires data engineers to design data systems with data quality in mind. In the hybrid world of data at rest and data in motion, engineering data quality could be significantly different for batch and event streaming systems. This article will cover key components in data engineering systems that are critical for delivering high-quality data: Monitoring data quality – Given any data pipeline, how to measure the correctness of the output data, and how to ensure the output is correct not only today but also in the foreseeable future. Data recovery and backfill – In case of application failures or data quality violations, how to perform data recovery to minimize impact on downstream users. Preventing data quality regressions – When data sources undergo changes or when adding new features to existing data applications, how to prevent unexpected regression. Monitoring Data Quality As the business evolves, the data also evolves. Measuring data quality is never a one-time task, and it is important to continuously monitor the quality of data in data pipelines to catch any regressions at the earliest stage possible. The very first step of monitoring data quality is defining data quality metrics based on the business use cases. Defining Data Quality Defining data quality is to set expectations for the output data and measure the deviation in the actual data from the established expectations in the form of quantitative metrics. When defining data quality metrics, the very first thing data engineers should consider is, "What truth does the data represent?" For example, the output table should contain all advertisement impression events that happened on the retail website. The data quality metrics should be designed to ensure the data system accurately captures that truth. In order to accurately measure the data quality of a data system, data engineers need to track not only the baseline application health and performance metrics (such as job failures, completion timestamp, processing latency, and consumer lag) but also customized metrics based on the business use cases the data system serves. Therefore, data engineers need to have a deep understanding of the downstream use cases and the underlying business problems. As the business model determines the nature of the data, business context allows data engineers to grasp the meanings of the data, traffic patterns, and potential edge cases. While every data system serves a different business use case, some common patterns in data quality metrics can be found in Table 1. METRICS FOR MEASURING DATA QUALITY IN A DATA PIPELINE Type Limitations Application health The number of jobs succeeded or running (for streaming) should be N. SLA/latency The job completion time should be by 8 a.m. PST daily. The max event processing latency should be < 2 seconds (for streaming). Schema Column account_id should be INT type and can't be NULL. Column values Column account_id must be positive integers. Column account_type can only have the values: FREE, STANDARD, or MAX. Comparison with history The total number of confirmed orders on any date should be within +20%/-20% of the daily average of the last 30 days. Comparison with other datasets The number of shipped orders should correlate to the number of confirmed orders. Table 1 Implementing Data Quality Monitors Once a list of data quality metrics is defined, these metrics should be captured as part of the data system and metric monitors should be automated as much as possible. In case of any data quality violations, the on-call data engineers should be alerted to investigate further. In the current data world, data engineering teams often own a mixed bag of batched and streaming data applications, and the implementation of data quality metrics can be different for batched vs. streaming systems. Batched Systems The Write-Audit-Publish (WAP) pattern is a data engineering best practice widely used to monitor data quality in batched data pipelines. It emphasizes the importance of always evaluating data quality before releasing the data to downstream users. Figure 1: Write-Audit-Publish pattern in batched data pipeline design Streaming Systems Unfortunately, the WAP pattern is not applicable to data streams because event streaming applications have to process data nonstop, and pausing production streaming jobs to troubleshoot data quality issues would be unacceptable. In a Lambda architecture, the output of event streaming systems is also stored in lakehouse storage (e.g., an Apache Iceberg or Apache Hudi table) for batched usage. As a result, it is also common for data engineers to implement WAP-based batched data quality monitors on the lakehouse table. To monitor data quality in near real-time, one option is to implement data quality checks as real-time queries on the output, such as an Apache Kafka topic or an Apache Druid datasource. For large-scale output, sampling is typically applied to improve the query efficiency of aggregated metrics. Helper frameworks such as Schema Registry can also be useful for ensuring output events have a compatible as-expected schema. Another option is to capture data quality metrics in an event-by-event manner as part of the application logic and log the results in a time series data store. This option introduces additional side output but allows more visibility into intermediate data stages/operations and easier troubleshooting. For example, assuming the application logic decides to drop events that have invalid account_id, account_type, or order_id, if an upstream system release introduces a large number of events with invalid account_id, the output-based data quality metrics will show a decline in the total number output events. However, it would be difficult to identify what filter logic or column is the root cause without metrics or logs on intermediate data stages/operations. Data Recovery and Backfill Every data pipeline will fail at some point. Some of the common failure causes include: Incompatible source data updates (e.g., critical columns were removed from source tables) Source or sink data systems failures (e.g., sink databases became unavailable) Altered truth in data (e.g., data processing logic became outdated after a new product release) Human errors (e.g., a new build introduces new edge-case errors left unhandled) Therefore, all data systems should be able to be backfilled at all times in order to minimize the impact of potential failures on downstream business use cases. In addition, in event streaming systems, the ability to backfill is also required for bootstrapping large stateful stream processing jobs. The data storage and processing frameworks used in batched and streaming architectures are usually different, and so are the challenges that lie behind supporting backfill. Batched Systems The storage solutions for batched systems, such as AWS S3 and GCP Cloud Storage, are relatively inexpensive and source data retention is usually not a limiting factor in backfill. Batched data are often written and read by event-time partitions, and data processing jobs are scheduled to run at certain intervals and have clear start and completion timestamps. The main technical challenge in backfilling batched data pipelines is data lineage: what jobs updated/read which partitions at what timestamp. Clear data lineage enables data engineers to easily identify downstream jobs impacted by problematic data partitions. Modern lakehouse table formats such as Apache Iceberg provide queryable table-level changelogs and history snapshots, which allow users to revert any table to a specific version in case a recent data update contaminated the table. The less queryable data lineage metadata, the more manual work is required for impact estimation and data recovery. Streaming Systems The source data used in streaming systems, such as Apache Kafka topics, often have limited retention due to the high cost of low-latency storage. For instance, for web-scale data streams, data retention is often set to several hours to keep costs reasonable. As troubleshooting failures can take data engineers hours if not days, the source data could have already expired before backfill. As a result, data retention is often a challenge in event streaming backfill. Below are the common backfill methodologies for event streaming systems: METHODS FOR BACKFILLING STREAMING DATA SYSTEMS Method Description Replaying source streams Reprocess source data from the problematic time period before those events expire in source systems (e.g., Apache Kafka). Tiered storage can help reduce stream retention cost. Lambda architecture Maintain a parallel batched data application (e.g., Apache Spark) for backfill, reading source data from a lakehouse storage with long retention. Kappa architecture The event streaming application is capable of streaming data from both data streams (for production) and lakehouse storage (for backfill) Unified batch and streaming Data processing frameworks, such as Apache Beam, support both streaming (for production) and batch mode (for backfill). Table 2 Preventing Data Quality Regressions Let's say a data pipeline has a comprehensive collection of data quality metrics implemented and a data recovery mechanism to ensure that reasonable historical data can be backfilled at any time. What could go wrong from here? Without prevention mechanisms, the data engineering team can only react passively to data quality issues, finding themselves busy putting out the same fire over and over again. To truly future-proof the data pipeline, data engineers must proactively establish programmatic data contracts to prevent data quality regression at the root. Data quality issues can either come from upstream systems or the application logic maintained by data engineers. For both cases, data contracts should be implemented programmatically, such as unit tests and/or integration tests to stop any contract-breaking changes from going into production. For example, let's say that a data engineering team owns a data pipeline that consumes advertisement impression logs for an online retail store. The expectations of the impression data logging should be implemented as unit and/or regression tests in the client-side logging test suite since it is owned by the client and data engineering teams. The advertisement impression logs are stored in a Kafka topic, and the expectation on the data schema is maintained in a Schema Registry to ensure the events have compatible data schemas for both producers and consumers. As the main logic of the data pipeline is attributing advertisement click events to impression events, the data engineering team developed unit tests with mocked client-side logs and dependent services to validate the core attribution logic and integration tests to verify that all components of the data system together produce the correct final output. Conclusion Data quality should be the first priority of every data pipeline and the data architecture should be designed with data quality in mind. The first step of building robust and resilient data systems is defining a set of data quality metrics based on the business use cases. Data quality metrics should be captured as part of the data system and monitored continuously, and the data should be able to be backfilled at all times to minimize potential impact to downstream users in case of data quality issues. The implementation of data quality monitors and backfill methods can be different for batched vs. event streaming systems. Last but not least, data engineers should establish programmatic data contracts as code to proactively prevent data quality regressions. Only when the data engineering systems are future-proofed to deliver qualitative data, data-driven business decisions can be made with confidence. This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report
Gossip protocol is a communication scheme used in distributed systems for efficiently disseminating information among nodes. It is inspired by the way people gossip, where information spreads through a series of casual conversations. This article will discuss the gossip protocol in detail, followed by its potential implementation in social media networks, including Instagram. We will also include code snippets to provide a deeper technical understanding. Gossip Protocol The gossip protocol is based on an epidemic algorithm that uses randomized communication to propagate information among nodes in a network. The nodes exchange information about their state and the state of their neighbors. This process is repeated at regular intervals, ensuring that the nodes eventually become aware of each other's states. The key features of gossip protocol include: Fault-tolerance: The protocol can handle node failures effectively, as it does not rely on a central authority or a single point of failure. Scalability: Gossip protocol can efficiently scale to large networks with minimal overhead. Convergence: The system converges to a consistent state quickly, even in the presence of failures or network delays. Gossip Protocol in Social Media Networks: Instagram Social media networks are distributed systems that need to handle massive amounts of data and user interactions. One of the critical aspects of such networks is efficiently propagating updates and notifications to users. Gossip protocol can be employed to achieve this goal by allowing user nodes to exchange information about their state and the state of their connections. For instance, consider Instagram, a social media platform where users can post photos and follow other users. When a user posts a new photo, it needs to be propagated to all their followers. Using the gossip protocol, the photo can be efficiently disseminated across the network, ensuring that all followers receive the update in a timely manner. Technical Implementation of Gossip Protocol in Social Media Networks To illustrate the implementation of gossip protocol in a social media network, let's consider a simplified example using Python. In this example, we will create a basic network of users who can post updates and follow other users, similar to Instagram. First, let's define a User class to represent a user in the network: Python class User: def __init__(self, user_id): self.user_id = user_id self.followers = set() self.posts = [] def post_photo(self, photo): self.posts.append(photo) def follow(self, user): self.followers.add(user) Next, we'll implement the gossip protocol to propagate updates among users. We will create a GossipNetwork class that manages the user nodes and initiates gossip communication: Python import random class GossipNetwork: def __init__(self): self.users = {} def add_user(self, user_id): self.users[user_id] = User(user_id) def post_photo(self, user_id, photo): self.users[user_id].post_photo(photo) self.gossip(user_id, photo) def gossip(self, user_id, photo): user = self.users[user_id] for follower in user.followers: # Propagate the photo to the follower self.users[follower].posts.append(photo) # Continue gossiping with a random subset of the follower's followers if len(self.users[follower].followers) > 0: next_follower = random.choice(list(self.users[follower].followers)) self.gossip(next_follower, photo) The main method to test the behavior: Python if __name__ == "__main__": # Create a gossip network network = GossipNetwork() # Add users to the network for i in range(1, 6): network.add_user(i) # Establish follower relationships network.users[1].follow(2) network.users[2].follow(3) network.users[3].follow(4) network.users[4].follow(5) # Post a photo by user 1 network.post_photo(1, "photo1") # Print the posts of each user for i in range(1, 6): print(f"User {i}: {network.users[i].posts}") This code creates a simple network of five users with a chain of follower relationships (1 -> 2 -> 3 -> 4 -> 5). When user 1 posts a photo, it will be propagated through the gossip protocol to all users in the chain. The output will show that all users have received the posted photo: Plain Text User 1: ['photo1'] User 2: ['photo1'] User 3: ['photo1'] User 4: ['photo1'] User 5: ['photo1'] In this example, when a user posts a photo, the GossipNetwork.post_photo() method is called. This method initiates gossip communication by propagating the photo to the user's followers and their followers using the GossipNetwork.gossip() method. Conclusion The gossip protocol is an efficient and robust method for disseminating information among nodes in a distributed system. Its implementation in social media networks like Instagram can help propagate updates and notifications to users, ensuring timely delivery and fault tolerance. By understanding the inner workings of the gossip protocol in social media networks, developers can better appreciate its role in maintaining a consistent and reliable distributed platform.
In production systems, new features sometimes need a data migration to be implemented. Such a migration can be done with different tools. For simple migrations, SQL can be used. It is fast and easily integrated into Liquibase or other tools to manage database migrations. This solution is for use cases that can not be done in SQL scripts. The Use Case The MovieManager project stores the keys to access TheMovieDB in the database. To improve the project, the keys should now be stored encrypted with Tink. The existing keys need to be encrypted during the data migration, and new keys need to be encrypted during the sign-in process. The movie import service needs to decrypt the keys to use them during the import. The Data Migration Update the Database Table To mark migrated rows in the "user1" table, a "migration" column is added in this Liquibase script: <changeSet id="41" author="angular2guy"> <addColumn tableName="user1"> <column defaultValue="0" type="bigint" name="migration"/> </addColumn> </changeSet> The changeSet adds the "migration" column to the "user1" table and sets the default value "0". Executing the Data Migration The data migration is started with the startMigration(...) method in the CronJobs class: ... private static volatile boolean migrationsDone = false; ... @Scheduled(initialDelay = 2000, fixedRate = 36000000) @SchedulerLock(name = "Migrations_scheduledTask", lockAtLeastFor = "PT2H", lockAtMostFor = "PT3H") public void startMigrations() { LOG.info("Start migrations."); if (!migrationsDone) { this.dataMigrationService.encryptUserKeys().thenApplyAsync(result -> { LOG.info("Users migrated: {}", result); return result; }); } migrationsDone = true; } The method startMigrations() is called with the @Scheduled annotation because that enables the use of @SchedulerLock. The @SchedulerLock annotation sets a database lock to limit the execution to one instance to enable horizontal scalability. The startMigrations() method is called 2 seconds after startup and then every hour with the @Scheduled annotation. The encryptUserKeys() method returns a CompletableFuture that enables the use of thenApplyAsync(...) to log the amount of migrated users nonblocking. The static variable migrationsDone makes sure that each application instance calls the dataMigrationService only once and makes the other calls essentially free. Migrating the Data To query the Users, the JpaUserRepository has the method findOpenMigrations: public interface JpaUserRepository extends CrudRepository<User, Long> { ... @Query("select u from User u where u.migration < :migrationId") List<User> findOpenMigrations(@Param(value = "migrationId") Long migrationId); } The method searches for entities where the migration property has not been increased to the migrationId that marks them as migrated. The DataMigrationService contains the encryptUserKeys() method to do the migration: @Service @Transactional(propagation = Propagation.REQUIRES_NEW) public class DataMigrationService { ... @Async public CompletableFuture<Long> encryptUserKeys() { List<User> migratedUsers = this.userRepository.findOpenMigrations(1L) .stream().map(myUser -> { myUser.setUuid(Optional.ofNullable(myUser.getUuid()) .filter(myStr -> !myStr.isBlank()) .orElse(UUID.randomUUID().toString())); myUser.setMoviedbkey(this.userDetailService .encrypt(myUser.getMoviedbkey(), myUser.getUuid())); myUser.setMigration(myUser.getMigration() + 1); return myUser; }).collect(Collectors.toList()); this.userRepository.saveAll(migratedUsers); return CompletableFuture.completedFuture( Integer.valueOf(migratedUsers.size()).longValue()); } } The service has the Propagation.REQUIRES_NEW in the annotation to make sure that each method gets wrapped in its own transaction. The encryptUserKeys() method has the Async annotation to avoid any timeouts on the calling side. The findOpenMigrations(...) method of the repository returns the not migrated entities and uses map for the migration. In the map it is first checked if the user's UUID is set, or if it is created and set. Then the encrypt(...) method of the UserDetailService is used to encrypt the user key, and the migration property is increased to show that the entity was migrated. The migrated entities are put in a list and saved with the repository. Then the result CompletableFuture is created to return the amount of migrations done. If the migrations are already done, findOpenMigrations(...) returns an empty collection and nothing is mapped or saved. The UserDetailServiceBase does the encryption in its encrypt() method: ... @Value("${tink.json.key}") private String tinkJsonKey; private DeterministicAead daead; ... @PostConstruct public void init() throws GeneralSecurityException { DeterministicAeadConfig.register(); KeysetHandle handle = TinkJsonProtoKeysetFormat.parseKeyset( this.tinkJsonKey, InsecureSecretKeyAccess.get()); this.daead = handle.getPrimitive(DeterministicAead.class); } ... public String encrypt(String movieDbKey, String uuid) { byte[] cipherBytes; try { cipherBytes = daead.encryptDeterministically( movieDbKey.getBytes(Charset.defaultCharset()), uuid.getBytes(Charset.defaultCharset())); } catch (GeneralSecurityException e) { throw new RuntimeException(e); } String cipherText = new String(Base64.getEncoder().encode(cipherBytes), Charset.defaultCharset()); return cipherText; } The tinkJsonKey is a secret, and must be injected as an environment variable or Helm chart value into the application for security reasons. The init() method is annotated with @PostConstruct to run as initialization, and it registers the config and creates the KeysetHandle with the tinkJsonKey. Then the primitive is initialized. The encrypt(...) method creates the cipherBytes with encryptDeterministcally(...) and the parameters of the method. The UUID is used to have unique cipherBytes for each user. The result is Base64 encoded and returned as String. Conclusion: Data Migration This migration needs to run as an application and not as a script. The trade-off is that the migration code is now in the application, and after the migration is run it, is dead code. That code should be removed then, but in the real world, the time to do this is limited and after some time it is forgotten. The alternative is to use something like Spring Batch, but doing that will take more effort and time because the JPA entities/repos can not be reused that easily. A TODO to clean up the method in the DataMigrationService should do the trick sooner or later. One operations constraint has to be considered: during migration, the database is in an inconsistent state and the user access to the applications should be stopped. Finally Using the Keys The MovieService contains the decrypt(...) method: @Value("${tink.json.key}") private String tinkJsonKey; private DeterministicAead daead; ... @PostConstruct public void init() throws GeneralSecurityException { DeterministicAeadConfig.register(); KeysetHandle handle = TinkJsonProtoKeysetFormat .parseKeyset(this.tinkJsonKey, InsecureSecretKeyAccess.get()); this.daead = handle.getPrimitive(DeterministicAead.class); } ... private String decrypt(String cipherText, String uuid) throws GeneralSecurityException { String result = new String(daead.decryptDeterministically( Base64.getDecoder().decode(cipherText), uuid.getBytes(Charset.defaultCharset()))); return result; } The properties and the init() method are the same as with the encryption. The decrypt(...) method first Base64 decodes the cipherText and then uses the result and the UUID to decrypt the key and return it as a String. That key string is used with the movieDbRestClient methods to import movie data into the database. Conclusion The Tink library makes using encryption easy enough. The tinkJsonKey has to be injected at runtime and should not be in a repo file or the application jar. A tinkJsonKey can be created with the EncryptionTest createKeySet(). The ShedLock library enables horizontal scalability, and Spring provides the toolbox that is used. The solution tries to balance the trade-offs for a horizontally scalable data migration that can not be done in a script.
String reversal is a common operation in programming that involves reversing the order of characters in a given string. While it may seem like a simple task, there are various algorithms and techniques to accomplish string reversal efficiently. Understanding these algorithms will equip you with the knowledge to manipulate and transform text in different programming contexts. This article will explore different string reversal algorithms, discuss their approaches, analyze their time and space complexities, and provide insights into choosing the most suitable algorithm for your specific requirements. The Importance of String Reversal Algorithms String reversal algorithms have numerous applications in programming. They are used for tasks such as text manipulation, palindrome detection, data encryption, and pattern matching. Reversing strings can be vital for solving programming challenges, implementing algorithms, or processing textual data. By exploring different string reversal techniques, you can improve your problem-solving skills and gain insights into algorithmic thinking. Naive Approach The most straightforward approach to reverse a string is to iterate through it from the last character to the first and build a new string character by character. This approach has a time complexity of O(n), where n is the length of the string. Although simple to implement, this method may not be the most efficient for large strings due to the need to create a new string. Two-Pointer Technique The two-pointer technique is a popular and efficient approach for string reversal. It involves initializing two pointers, one pointing to the first character of the string and the other pointing to the last character. The pointers gradually move towards the middle, swapping characters along the way until they meet. This approach has a time complexity of O(n/2), which simplifies to O(n), where n is the length of the string. It is an in-place reversal method that modifies the original string without requiring additional memory. Using Recursion Recursion can also be employed to reverse a string. The recursive algorithm breaks down the problem into smaller subproblems. It recursively calls the reverse function on the substring excluding the first character, and appends the first character at the end. The base case is when the string length becomes 0 or 1, in which case the string itself is returned. The time complexity of the recursive approach is O(n), and it requires additional space on the call stack for each recursive call. Built-in Functions or Libraries Many programming languages offer built-in functions or libraries specifically designed for string manipulation. These functions often include a built-in string reversal method that handles the reversal operation efficiently. Using these functions can be a convenient and optimized way to reverse strings. However, it's essential to be aware of the underlying implementation and any associated time or space complexities. Functional Programming Approach Functional programming languages offer elegant ways to reverse strings using higher-order functions. For example, in languages like Haskell or Lisp, functional constructs like fold or reduce can be utilized to reverse strings in a concise and declarative manner. These approaches showcase the power of functional programming paradigms for string manipulation tasks. Unicode and Multibyte Character Considerations When dealing with Unicode strings or strings containing multibyte characters, extra care must be taken during string reversal. Since some characters occupy multiple bytes or code points, a simple character-level reversal may lead to incorrect results. Proper encoding and decoding techniques should be applied to ensure accurate reversal while preserving character integrity. Advanced Techniques In certain scenarios, specialized techniques can provide further optimizations. For example, when dealing with very large strings or performance-critical applications, using character arrays or mutable string types can offer improved efficiency compared to immutable string objects. Analyzing the specific requirements and constraints of your application can help identify opportunities for optimization. Conclusion Even though it might seem simple to reverse a string, choosing the right algorithm can have a significant impact on performance and efficiency. The algorithm to use depends on a number of variables, including the length of the string, memory requirements, and desired level of time complexity. The two-pointer technique is a popular and effective method that works in place, whereas the naive approach and recursive method are straightforward to implement. Additionally, many programming languages can offer optimized solutions by utilizing built-in functions or libraries. Making educated choices when it comes to reversing strings in your code requires an understanding of the various string reversal algorithms and their traits. Think about the particular specifications of your application, the volume of the input, and the desired trade-offs between processing time and memory consumption. The effectiveness and performance of your programs can be improved by using the appropriate string reversal algorithm, resulting in a seamless execution and the best possible resource utilization. String reversal algorithms play a significant role in text manipulation and various programming tasks. By exploring different techniques like iterative reversal, built-in functions, recursion, pointer manipulation, and functional programming approaches, you can choose the most suitable algorithm for your specific programming language and context. Understanding these algorithms helps you become a better problem-solver and gives you access to tools that you can use to manipulate and transform textual data efficiently. You can improve your comprehension of text manipulation and your ability to write effective, elegant code by continually exploring and applying string reversal algorithms.
Databricks unveiled Liquid Clustering at this year's Data + AI Summit, a new approach aimed at improving both read and write performance through a dynamic data layout. Recap: Partitioning and Z-Ordering Both partitioning and z-ordering rely on data layout to perform data processing optimizations. They are complementary since they operate on different levels and apply to different types of columns. Partition on most queried, low-cardinality columns. Do not partition tables that contains less than 1TB of data. Rule of thumb: All partitions to contain at least 1GB of data. Z-order on most queried, high-cardinality columns. Use Z-order indexes alongside partitions to speed up queries on large datasets. Z-order clustering only occurs within a partition and cannot be applied to fields used for partitioning. Now, let's assume we've come up with the right partition strategy for our data and Z-ordered on the correct columns. First off, insert, delete, and update operations break Z-ordering. Although Low shuffle merge tries to preserve the data layout on existing data that is not modified, that updated data layout may still end up not being optimal, so it may be necessary to run the OPTIMIZE ZORDER BY commands again each time. Although Z-ordering aims to be an incremental operation, the actual time it takes is not guaranteed to reduce over multiple runs. But more importantly, query patterns change over time. The same partition that worked well in the past might now be suboptimal. Partition evolution is a real challenge, and to my knowledge, only the Iceberg table format has support for it so far. Liquid Clustering Liquid Clustering ( abbreviated as LC in this article) automatically adjusts the data layout based on clustering keys. In contrast to a fixed data layout, as in Hive-style partitioning, the flexible ("liquid") layout dynamically adjusts to changing query patterns, addressing the problem of suboptimal partitioning, column cardinality, etc. Clustering columns can be changed without rewriting the data. For this (very) short example, we're using the farmers markets geographic dataset just to try out the commands. Next, let's switch to SQL, run a CTAS on the markets table, and cluster by the State and County columns without bothering with neither partitioning nor Z-ordering. We enabled LC by specifying the clustering columns with CLUSTER BY and triggered the process with the OPTIMIZE command. We can verify the clustered columns this way: Sliding the results table to check the next columns: A few important details emerge from the screenshot above: Through table protocol versioning, Delta Lake tracks minimum reader and writer versions separately for Delta Lake evolution. As a new milestone in that evolution, LC automatically bumps up the reader and writer versions to 3 and 7, respectively. Be aware that protocol version upgrades are irreversible and may break existing Delta Lake table readers, writers, or both. By the same token, Delta Lake clients need to support deletion vectors. Those are optimization features (Databricks Runtime 12.1 and above) allowing you to mark deleted rows as such, but without rewriting the whole Parquet file. Just for the fun of it, here are a couple of other ways to display the same information: Finally, we can read the table, filtering by the clustering keys, e.g.: Up to 4 columns with statistics collected for can be specified as clustering keys. Those keys can be changed at any time using the ALTER TABLE command. Remember that, by default, the first 32 columns in Delta tables have statistics collected on them. This number can be controlled through the table property delta.dataSkippingNumIndexedCols. All columns with a position index less than that property value will have statistics collected. As for writing to a clustered table, there are some important limitations at this time. Only the following operations automatically cluster data on write, provided they do not exceed 512GB of data in size: INSERT INTO CREATE TABLE AS SELECT (CTAS) statements COPY INTO from Parquet Write appends, i.e., spark.write.format("delta").mode("append") Because of the above limitations, LC should be scheduled on a regular basis using the OPTIMIZE command to ensure that data is effectively clustered. As the process is incremental, those jobs should not take long. Note that LC is in Public Preview at this time and requires Databricks Runtime 13.2 and above. It aims at effectively replacing both Hive-style partitioning and Z-ordering, which relied on (static) physical data layout to perform their optimizations. Be sure to give it a try.
In this tutorial, we’ll learn how to make a website for collecting digital collectibles (or NFTs) on the blockchain Flow. We'll use the smart contract language Cadence along with React to make it all happen. We'll also learn about Flow, its advantages, and the fun tools we can use. By the end of this article, you’ll have the tools and knowledge you need to create your own apps. Let’s dive right in! Final Output What are we building? We're building an application for digital collectibles, and each collectible using a Non-Fungible Token (NFT). To make all this work, we will use Flow's NonFungibleToken Standard, which is a set of rules that helps us manage these special digital items. It's similar to ERC-721, which is used on a different platform called Ethereum. However, since we are using the Cadence programming language, there are some small differences to be aware of. Our app will allow you to collect NFTs, and each item will be unique from the others. Prerequisites Before you begin, be sure to install the Flow CLI on your system. If you haven't done so, follow these installation instructions. Setting Up If you're ready to kickstart your project, the first thing you need to do is type in the command "flow setup." This command does some magic behind the scenes to set up the foundation of your project. It creates a folder system and sets up a file called "flow.json" to configure your project, making sure everything is organized and ready to go! The project will contain the following folders and files: /contracts: Contains all Cadence contracts. /scripts: Holds all Cadence scripts. /transactions: Stores all Cadence transactions. /tests: Contains all Cadence tests. flow.json: A configuration file for your project, automatically maintained. Follow the steps below to use Flow NFT Standard. Step 1: Make a new folder. First, go to the "flow-collectibles-portal" folder and find the "Cadence" folder. Inside it, create a new folder called "interfaces." Step 2: Create a file. Inside the "interfaces" folder, make a new file and name it "NonFungibleToken.cdc." Step 3: Copy and paste. Now, open the link named NonFungibleToken which contains the NFT standard. Copy all the content from that file and paste it into the new file you just created ("NonFungibleToken.cdc"). That's it! You've successfully set up the standards for your project.Now, let’s write some code! But before we dive into coding, it's important to establish a mental model of how our code will be structured. As developers, it's crucial to have a clear idea. At the top level, our codebase consists of three main components: NFT: Each collectible is represented as an NFT. Collection: A collection refers to a group of NFTs owned by a specific user. Global Functions and Variables: These are functions and variables defined at the global level for the smart contract and are not associated with any particular resource. Collectibles Smart Contract Creating the Collectibles Smart Contract Create a new file named Collectibles.cdc inside flow-collectibles-portal/cadence/contracts. This is where we will write the code for our NFT Collection. Contract Structure JavaScript import NonFungibleToken from "./interfaces/NonFungibleToken.cdc" pub contract Collectibles: NonFungibleToken{ pub var totalSupply: UInt64 // other code will come here init(){ self.totalSupply = 0 } } Let's break down the code line by line: First, we'll need to include something called "NonFungibleToken" from our interface folder. This will help us with our contract. Now, let's write the contract itself. We use the word "contract" followed by the name of the contract. (For this example, let’s call it "Collectibles".) We’ll write all the code inside this contract. Next, we want to make sure our contract follows certain rules. To do that, we use a special syntax “NonFungibleToken", which means our contract will follow the NonFungibleToken standard. Then, we’ll create a global variable called "totalSupply." This variable will keep track of how many Collectibles we have. We use the data type "UInt64" for this, which simply means we can only have positive numbers in this variable. No negative numbers allowed! Now, let's give "totalSupply" an initial value of 0, which means we don't have any Collectibles yet. We'll do this inside a function called "init()". That's it! We set up the foundation for our Collectibles contract. Now we can start adding more features and functionalities to make it even more exciting. Before moving forward, please check out the code snippet to understand how we define variables in cadence: NFT Structure Now, we'll create a simple NFT resource that holds all the data related to each NFT. We'll define the NFT resource with the pub resource keywords. Add the following code to your smart contract: JavaScript import NonFungibleToken from "./interfaces/NonFungibleToken.cdc" pub contract Collectibles: NonFungibleToken{ pub var totalSupply: UInt64 pub resource NFT: NonFungibleToken.INFT{ pub let id: UInt64 pub var name: String pub var image: String init(_id:UInt64, _name:String, _image:String){ self.id = _id self.name = _name self.image = _image } } init(){ self.totalSUpply = 0 } } As you have seen before, the contract implements the NonFungibleToken standard interface, represented by pub contract Collectibles: NonFungibleToken. Similarly, resources can also implement various resource interfaces. The NFT resource must also implement the NonFungibleToken.INFT interface, which is a super simple interface that just mandates the existence of a public property called id within the resource.This is a good opportunity to explain some of the variables we will be using in the NFT resource: id: The Token ID of the NFT name: The name of the owner who will mint this NFT. image: The image of the NFT. After defining the variable, make sure you initialize its value in the init() function.Let’s move forward and create another resource called Collection Resource. Collection Structure Imagine a Collection as a special folder on your computer that can hold unique digital items called NFTs. Every person who uses this system has their own Collection, just like how everyone has their own folders on their computer. To better understand, think of it like this: Your computer has a main folder, let's call it "My Account," and inside that, you have a special folder called "My Collection." Inside this "Collection" folder, you can keep different digital items, such as pictures, videos, or music files. Similarly, in this system, when you buy or create NFTs, they get stored in your personal Collection. For our Collectibles contract, each person who buys NFTs gets their own "Collection" folder, and they can fill it with as many NFTs as they like. It's like having a personal space to store and organize your unique digital treasures! JavaScript import NonFungibleToken from "./interfaces/NonFungibleToken.cdc" pub contract Collectibles: NonFungibleToken{ pub var totalSupply: UInt64 pub resource NFT: NonFungibleToken.INFT{ pub let id: UInt64 pub var name: String pub var image: String init(_id:UInt64, _name:String, _image:String){ self.id = _id self.name = _name self.image = _image } } // Collection Resource pub resource Collection{ } init(){ self.totalSUpply = 0 } } The Collection resource will have a public variable named ownedNFTs to store the NFT resources owned by this Collection. We'll also create a simple initializer for the Collection resource. JavaScript pub resource Collection { pub var ownedNFTs: @{UInt64: NonFungibleToken.NFT} init(){ self.ownedNFTs <- {} } } Resource Interfaces A resource interface in Flow is similar to interfaces in other programming languages. It sits on top of a resource and ensures that the resource that implements it has the stuff inside of the interface. It can also be used to restrict access to the whole resource and be more restrictive in terms of access modifiers than the resource itself. In the NonFungibleToken standard, there are several resource interfaces like INFT, Provider, Receiver, and CollectionPublic. Each of these interfaces has specific functions and fields that need to be implemented by the resource that uses them. In this contract, we will use these three interfaces coming from NonFungibleToken: Provider, Receiver, and CollectionPublic. These interfaces define functions like deposit, withdraw, borrowNFT, and getIDs. We will explain each of these in greater detail as we go. JavaScript pub resource interface CollectionPublic{ pub fun deposit(token: @NonFungibleToken.NFT) pub fun getIDs(): [UInt64] pub fun borrowNFT(id: UInt64): &NonFungibleToken.NFT } pub resource Collection: CollectionPublic, NonFungibleToken.Provider, NonFungibleToken.Receiver, NonFungibleToken.CollectionPublic{ pub var ownedNFTs: @{UInt64: NonFungibleToken.NFT} init(){ self.ownedNFTs <- {} } } Now, let's create the withdraw() function required by the interface. JavaScript pub fun withdraw(withdrawID: UInt64): @NonFungibleToken.NFT { let token <- self.ownedNFTs.remove(key: withdrawID) ?? panic("missing NFT") emit Withdraw(id: token.id, from: self.owner?.address) return <- token } This function first tries to move the NFT resource out of the dictionary. If it fails to remove it (the given withdrawID was not found, for example), it panics and throws an error. If it does find it, it emits a withdraw event and returns the resource to the caller. The caller can then use this resource and save it within their account storage.Now it’s time for the deposit() function required by NonFungibleToken.Receiver. JavaScript pub fun deposit(token: @NonFungibleToken.NFT) { let id = token.id let oldToken <- self.ownedNFTs[id] <-token destroy oldToken emit Deposit(id: id, to: self.owner?.address) } Now let’s focus on the two functions required by NonFungibleToken.CollectionPublic: borrowNFT() and getID(). JavaScript pub fun borrowNFT(id: UInt64): &NonFungibleToken.NFT { if self.ownedNFTs[id] != nil { return (&self.ownedNFTs[id] as &NonFungibleToken.NFT?)! } panic("NFT not found in collection.") } pub fun getIDs(): [UInt64]{ return self.ownedNFTs.keys } There is one last thing we need to do for the Collection Resource: specify a destructor. Adding a Destructor Since the Collection resource contains other resources (NFT resources), we need to specify a destructor. A destructor runs when the object is destroyed. This ensures that resources are not left "homeless" when their parent resource is destroyed. We don't need a destructor for the NFT resource as it contains no other resources. JavaScript destroy (){ destroy self.ownedNFTs } Check the complete collection resource source code: JavaScript pub resource interface CollectionPublic{ pub fun deposit(token: @NonFungibleToken.NFT) pub fun getIDs(): [UInt64] pub fun borrowNFT(id: UInt64): &NonFungibleToken.NFT } pub resource Collection: CollectionPublic, NonFungibleToken.Provider, NonFungibleToken.Receiver, NonFungibleToken.CollectionPublic{ pub var ownedNFTs: @{UInt64: NonFungibleToken.NFT} init(){ self.ownedNFTs <- {} } destroy (){ destroy self.ownedNFTs } pub fun withdraw(withdrawID: UInt64): @NonFungibleToken.NFT { let token <- self.ownedNFTs.remove(key: withdrawID) ?? panic("missing NFT") emit Withdraw(id: token.id, from: self.owner?.address) return <- token } pub fun deposit(token: @NonFungibleToken.NFT) { let id = token.id let oldToken <- self.ownedNFTs[id] <-token destroy oldToken emit Deposit(id: id, to: self.owner?.address) } pub fun borrowNFT(id: UInt64): &NonFungibleToken.NFT { if self.ownedNFTs[id] != nil { return (&self.ownedNFTs[id] as &NonFungibleToken.NFT?)! } panic("NFT not found in collection.") } pub fun getIDs(): [UInt64]{ return self.ownedNFTs.keys } } Now we have finished all the resources! Global Function Now, let's talk about some global functions you can use: createEmptyCollection: This function allows you to create an empty Collection in your account storage. checkCollection: This handy function helps you discover whether or not your account already has a collection. mintNFT: This function is super cool because it allows anyone to create an NFT. JavaScript pub fun createEmptyCollection(): @Collection{ return <- create Collection() } pub fun checkCollection(_addr: Address): Bool{ return getAccount(_addr) .capabilities.get<&{Collectibles.CollectionPublic}> (Collectibles.CollectionPublicPath)! .check() } pub fun mintNFT(name:String, image:String): @NFT{ Collectibles.totalSupply = Collectibles.totalSupply + 1 let nftId = Collectibles.totalSupply var newNFT <- create NFT(_id:nftId, _name:name, _image:image) return <- newNFT } Wrapping Up the Smart Contract Now we’ve finished writing our smart contract. The final code should look like the combined structure NFT resource, and Collection resources, along with the required interfaces and global functions. Transaction and Script What is a transaction? A transaction is a set of instructions that interact with smart contracts on the blockchain and modify its current state. It's like a function call that changes the data on the blockchain. Transactions usually involve some cost, which can vary depending on the blockchain you are on.On the other hand, we can use a script to view data on the blockchain, but it does not change it. Scripts are free and are used when you want to look at the state of the blockchain without altering it.Here is how a transaction is structured in Cadence: Import: The transaction can import any number of types from external accounts using the import syntax. For example, import NonFungibleToken from 0x01. Body: The body is declared using the transaction keyword and its contents are contained in curly brackets. It first contains local variable declarations that are valid throughout the whole of the transaction. Phases: There are two optional main phases: preparation and execution. The preparation and execution phases are blocks of code that execute sequentially. Prepare Phase: This phase is used to access data/information inside the signer's account (allowed by the AuthAccount type). Execute Phase: This phase is used to execute actions. Create Collection Transaction JavaScript import Collectibles from "../contracts/Collectibles.cdc" transaction { prepare(signer: AuthAccount) { if signer.borrow<&Collectibles.Collection>(from: Collectibles.CollectionStoragePath) == nil { let collection <- Collectibles.createEmptyCollection() signer.save(<-collection, to: Collectibles.CollectionStoragePath) let cap = signer.capabilities.storage.issue<&{Collectibles.CollectionPublic}>(Collectibles.CollectionStoragePath) signer.capabilities.publish( cap, at: Collectibles.CollectionPublicPath) } } } Let's break down the code line by line: This transaction interacts with Collectibles smart contract. Then it checks if the sender (signer) has a Collection resource stored in their account by borrowing a reference to the Collection resource from the specified storage path Collectibles.CollectionStoragePath. If the reference is nil, it means the signer does not have a collection yet. If the signer does not have a collection, then it creates an empty collection by calling the createEmptyCollection() function. After creating the empty collection, place into the signer's account under the specified storage path Collectibles.CollectionStoragePath. It establishes a link between the signer's account and the newly created collection using link(). Mint NFT Transaction JavaScript import NonFungibleToken from "../contracts/interfaces/NonFungibleToken.cdc" import Collectibles from "../contracts/Collectibles.cdc" transaction(name:String, image:String){ let receiverCollectionRef: &{NonFungibleToken.CollectionPublic} prepare(signer:AuthAccount){ self.receiverCollectionRef = signer.borrow<&Collectibles.Collection>(from: Collectibles.CollectionStoragePath) ?? panic("could not borrow Collection reference") } execute{ let nft <- Collectibles.mintNFT(name:name, image:image) self.receiverCollectionRef.deposit(token: <-nft) } } Let’s break down the code line by line: We first import the NonFungibleToken and Collectibles contract. transaction(name: String, image: String)This line defines a new transaction. It takes two arguments, name and image, both of type String. These arguments are used to pass the name and image of the NFT being minted. let receiverCollectionRef: &{NonFungibleToken.CollectionPublic}This line declares a new variable receiverCollectionRef. It is a reference to a public collection of NFTs of type NonFungibleToken.CollectionPublic. This reference will be used to interact with the collection where we will deposit the newly minted NFT. prepare(signer: AuthAccount)(This line starts the prepare block, which is executed before the transaction.) It takes an argument signer of type AuthAccount. AuthAccount represents the account of the transaction’s signer. Inside the prepare block, it borrows a reference to the Collectibles.Collection from the signer’s storage. It uses the borrow function to access the reference to the collection and store it in the receiverCollectionRef variable. If the reference is not found (if the collection doesn’t exist in the signer’s storage, for example), it will throw the error message “could not borrow Collection reference”. The execute block contains the main execution logic for the transaction. The code inside this block will be executed after the prepare block has successfully completed. nft <- Collectibles.mintNFT(_name: name, image: image)Inside the execute block, this line calls the mintNFT function from the Collectibles contract with the provided name and image arguments. This function is expected to create a new NFT with the given name and image. The <- symbol indicates that the NFT is being received as an object that can be moved (a resource). self.receiverCollectionRef.deposit(token: <-nft)This line deposits the newly minted NFT into the specified collection. It uses the deposit function on the receiverCollectionRef to transfer ownership of the NFT from the transaction’s executing account to the collection. The <- symbol here also indicates that the NFT is being moved as a resource during the deposit process. View NFT Script JavaScript import NonFungibleToken from "../contracts/interfaces/NonFungibleToken.cdc" import Collectibles from "../contracts/Collectibles.cdc" pub fun main(user: Address, id: UInt64): &NonFungibleToken.NFT? { let collectionCap= getAccount(user).capabilities .get<&{Collectibles.CollectionPublic}>(/public/NFTCollection) ?? panic("This public capability does not exist.") let collectionRef = collectionCap.borrow()! return collectionRef.borrowNFT(id: id) } Let's break down the code line by line: First we import the NonFungibleToken and Collectibles contract. pub fun main(acctAddress: Address, id: UInt64): &NonFungibleToken.NFT?This line defines the entry point of the script, which is a public function named main. The function takes two parameters: acctAddress: An Address type parameter representing the address of an account on the Flow Blockchain. id: A UInt64 type parameter representing the unique identifier of the NFT within the collection. Then we use getCapability to fetch the Collectibles.Collection capability for the specified acctAddress. A capability is a reference to a resource that allows access to its functions and data. In this case, it is fetching the capability for the Collectibles.Collection resource type. Then, we borrow an NFT from the collectionRef using the borrowNFT function. The borrowNFT function takes the id parameter, which is the unique identifier of the NFT within the collection. The borrow function on a capability allows reading the resource data. Finally, we return the NFT from the function. Testnet Deployment Follow the steps to deploy the collectibles contract to the Flow Testnet. 1. Set up a Flow account. Run the following command in the terminal to generate a Flow account: flow keys generate Be sure to write down your public key and private key. Next, we’ll head over to the Flow Faucet, create a new address based on our keys, and fund our account with some test tokens. Complete the following steps to create your account: Paste in your public key in the specified input field. Keep the Signature and Hash Algorithms set to default. Complete the Captcha. Click on Create Account. After setting up an account, we receive a dialogue with our new Flow address containing 1,000 test Flow tokens. Copy the address so we can use it going forward. 2. Configure the project. Ensure your project is configured correctly by verifying the contract's source code location, account details, and contract name. JSON { "emulators": { "default": { "port": 3569, "serviceAccount": "emulator-account" } }, "contracts": { "NonFungibleToken": { "source": "./cadence/contracts/interfaces/NonFungibleToken.cdc", "aliases": { "testnet": "0x631e88ae7f1d7c20" } }, "Collectibles": "./cadence/contracts/Collectibles.cdc" }, "networks": { "testnet": "access.devnet.nodes.onflow.org:9000" }, "accounts": { "emulator-account": { "address": "0xf8d6e0586b0a20c7", "key": "61dace4ff7f2fa75d2ec4a009f9b19d976d3420839e11a3440c8e60391699a73" }, "contract": { "address": "0x490b5c865c43d0fd", "key": { "type": "hex", "index": 0, "signatureAlgorithm": "ECDSA_P256", "hashAlgorithm": "SHA3_256", "privateKey": "priavte_key" } } }, "deployments": { "testnet": { "contract": [ "Collectibles" ] } } } 3. Copy and paste. Paste your generated private key and account address inside accounts -> contract section. 4. Execute. Go to the terminal and run the following code: flow project deploy --network testnet 5. Wait for confirmation. After submitting the transaction, you'll receive a transaction ID. Wait for the transaction to be confirmed on the testnet, indicating that the smart contract has been successfully deployed. Check your deployed contract here: Flow Source. Final Thoughts and Congratulations! Congratulations! You have now built a collectibles portal on the Flow blockchain and deployed it to the testnet. What’s next? Now you can work on building the frontend which we will cover in part 2 of this series. Have a really great day!
Exploratory Data Analysis (EDA) is the initial phase of data analysis, where we examine and understand our data. One of the most powerful tools at our disposal during EDA is data visualization. Visualization allows us to represent data visually, helping us gain insights that are difficult to obtain from raw numbers alone. In this article, we'll explore 11 essential Python visualizations for EDA, providing concise explanations and Python code for each, along with the benefits of effective visualization. What Is Data Visualization in EDA? Data visualization in EDA is the process of representing data graphically to reveal patterns, trends, and relationships within the data. It involves creating charts, graphs, and plots to transform complex data into easily understandable visuals. Why Is Data Visualization Effective in EDA? Simplifies Complexity: Data can be complex, with numerous variables and data points. Visualization simplifies this complexity by presenting information in a visual format that's easy to comprehend. Pattern Recognition: Visualizations make it easier to identify patterns and relationships within the data, aiding in hypothesis generation and validation. Enhanced Communication: Visual representations of data are more accessible and engaging, making it simpler to convey findings and insights to stakeholders. Anomaly Detection: Visualizations can quickly highlight outliers or unusual data points, prompting further investigation. Time Efficiency: Visualizations provide a rapid overview of data, saving time compared to manually inspecting raw data. Now, let's explore 11 essential Python visualizations for EDA, each accompanied by a one-line explanation and Python code. 1. Scatter Matrix Plot A scatter matrix plot displays pairwise scatter plots between numerical features, aiding in the identification of relationships. Python import pandas as pd import seaborn as sns data = pd.read_csv('titanic.csv') sns.pairplot(data, hue="Survived") 2. Heatmap Heatmaps visualize the correlation between numerical features, helping to uncover dependencies in the data. Python import seaborn as sns import matplotlib.pyplot as plt correlation_matrix = data.corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm") 3. Box Plot Box plots represent the distribution and spread of data, useful for detecting outliers and understanding central tendencies. Python import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(x="Pclass", y="Age", data=data) 4. Violin Plot Violin plots combine box plots with kernel density estimation, offering a detailed view of data distribution. Python import seaborn as sns import matplotlib.pyplot as plt sns.violinplot(x="Pclass", y="Age", data=data) 5. Interactive Scatter Plot (Plotly) Plotly allows the creation of interactive scatter plots, providing additional information on hover. Python import plotly.express as px fig = px.scatter(data, x="Fare", y="Age", color="Survived", hover_name="Name") fig.show() 6. Word Cloud Word clouds visually represent word frequency in text data, aiding text analysis. Python from wordcloud import WordCloud import matplotlib.pyplot as plt # Sample text data text = """ This is a sample text for creating a word cloud. Word clouds are a great way to visualize word frequency in text data. They can reveal the most common words in a document or corpus. Word clouds are often used for text analysis and data exploration. """ # Create a WordCloud object wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text) # Display the word cloud plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show() 7. Stacked Bar Chart (Altair) Altair is great for creating stacked bar charts effectively visualizing data in different categories. Python import matplotlib.pyplot as plt # Sample data categories = ['Category A', 'Category B', 'Category C'] values1 = [10, 15, 8] values2 = [5, 12, 10] # Create the figure and axes objects fig, ax = plt.subplots() # Create stacked bar chart bar1 = ax.bar(categories, values1, label='Value 1') bar2 = ax.bar(categories, values2, bottom=values1, label='Value 2') # Add labels and legend ax.set_xlabel('Categories') ax.set_ylabel('Values') ax.set_title('Stacked Bar Chart') ax.legend() # Show the plot plt.show() 8. Parallel Coordinates Plot Parallel coordinates plots help visualize high-dimensional data by connecting numerical features with lines. Python from pandas.plotting import parallel_coordinates import matplotlib.pyplot as plt parallel_coordinates(data[['Age', 'Fare', 'Pclass', 'Survived']], 'Survived', colormap=plt.get_cmap("Set2")) 9. Sankey Diagrams Sankey diagrams are powerful for visualizing the flow of data, energy, or resources. They are increasingly used in fields such as data science, sustainability, and process analysis to illustrate complex systems and the distribution of resources. Python import plotly.graph_objects as go fig = go.Figure(go.Sankey( node=dict( pad=15, thickness=20, line=dict(color="black", width=0.5), label=["Source", "Node A", "Node B", "Node C", "Destination"], ), link=dict( source=[0, 0, 1, 1, 2, 3], target=[1, 2, 2, 3, 3, 4], value=[4, 3, 2, 2, 2, 4], ), )) fig.update_layout(title_text="Sankey Diagram Example", font_size=10) fig.show() 10. Sunburst Charts Sunburst charts are hierarchical visualizations that show the breakdown of data into nested categories or levels. They are useful for displaying hierarchical data structures, such as organizational hierarchies or nested file directories. Python import plotly.express as px data = dict( id=["A", "B", "C", "D", "E"], labels=["Category A", "Category B", "Category C", "Category D", "Category E"], parent=["", "", "", "C", "C"], values=[10, 20, 15, 5, 10] ) fig = px.sunburst(data, path=['parent', 'labels'], values='values') fig.update_layout(title_text="Sunburst Chart Example") fig.show() 11. Tree Maps With Heatmaps Tree maps visualize hierarchical data by nesting rectangles within larger rectangles, with each rectangle representing a category or element. The addition of heatmaps to tree maps provides a way to encode additional information within each rectangle's color. Python import plotly.express as px data = px.data.tips() fig = px.treemap( data, path=['day', 'time', 'sex'], values='total_bill', color='tip', hover_data=['tip'], color_continuous_scale='Viridis' ) fig.update_layout(title_text="Tree Map with Heatmap Example") fig.show() Conclusion In conclusion, data visualization is a powerful tool for data exploration, analysis, and communication. Through this article, we've explored 11 advanced Python visualization techniques, each serving unique purposes in uncovering insights from data. From scatter matrix plots to interactive time series visualizations, these methods empower data professionals to gain deeper insights, communicate findings effectively, and make informed decisions. Data visualization is not only about creating aesthetically pleasing graphics but also about transforming raw data into actionable insights, making it an indispensable part of the data analysis toolkit. Embracing these visualization techniques can greatly enhance your ability to understand and convey complex data, ultimately driving better outcomes in various fields. Do you have any questions related to this article? Leave a comment and ask your question, and I will do my best to answer it. Thanks for reading!
In our previous post, we delved into problems of pathfinding in graphs, which are inherently connected to solving mazes. When I set out to create a maze map for the Wall-E project, I initially expected to find a quick and easy way to accomplish this task. However, I quickly found myself immersed in the vast and fascinating world of mazes and labyrinths. I was unaware of the breadth and depth of this topic before. I discovered that mazes can be classified in seven different ways, each with numerous variations and countless algorithms for generating them. Surprisingly, I couldn't find any algorithmic books that comprehensively covered this topic, and even the Wikipedia page didn't provide a systematic overview. Fortunately, I stumbled upon a fantastic resource that covers various maze types and algorithms, which I highly recommend exploring. I embarked on a journey to learn about the different classifications of mazes, including dimensional and hyperdimensional variations, perfect mazes versus unicursal labyrinths, planar and sparse mazes, and more. How To Create a Maze My primary goal was to generate a 2D map representing a maze. While it would have been enticing to implement various maze-generation algorithms to compare them, I also wanted a more efficient approach. The quickest solution I found involved randomly selecting connected cells. That's precisely what I did with mazerandom. This one-file application creates a grid table of 20 x 20 cells and then randomly connects them using a Depth-First Search (DFS) traversal. In other words, we're simply carving passages in the grid. If you were to do this manually on paper, it would look something like this: To achieve this algorithmically, we apply Depth-First Search to the grid of cells. Let's take a look at how it's done in the Main.cpp. As usual, we represent the grid of cells as an array of arrays, and we use a stack for DFS: C++ vector<vector<int>> maze_cells; // A grid 20x20 stack<Coord> my_stack; // Stack to traverse the grid by DFS my_stack.push(Coord(0, 0)); // Starting from very first cell We visit every cell in the grid and push its neighbors onto the stack for deep traversal: C++ ... while (visitedCells < HORIZONTAL_CELLS * VERTICAL_CELLS) { vector<int> neighbours; // Step 1: Create an array of neighbour cells that were not yet visited (from North, East, South and West). // North is not visited yet? if ((maze_cells[offset_x(0)][offset_y(-1)] & CELL_VISITED) == 0) { neighbours.push_back(0); } // East is not visited yet? if ((maze_cells[offset_x(1)][offset_y(0)] & CELL_VISITED) == 0) { neighbours.push_back(1); } ... // Do the same for West and South... The most complex logic involves marking the node as reachable (i.e., no wall in between) with CELL_PATH_S, CELL_PATH_N, CELL_PATH_W, or CELL_PATH_E: C++ ... // If we have at least one unvisited neighbour if (!neighbours.empty()) { // Choose random neighbor to make it available int next_cell_dir = neighbours[rand() % neighbours.size()]; // Create a path between the neighbour and the current cell switch (next_cell_dir) { case 0: // North // Mark it as visited. Mark connection between North and South in BOTH directions. maze_cells[offset_x(0)][offset_y(-1)] |= CELL_VISITED | CELL_PATH_S; maze_cells[offset_x(0)][offset_y(0)] |= CELL_PATH_N; // my_stack.push(Coord(offset_x(0), offset_y(-1))); break; case 1: // East // Mark it as visited. Mark connection between East and West in BOTH directions. maze_cells[offset_x(1)][offset_y(0)] |= CELL_VISITED | CELL_PATH_W; maze_cells[offset_x(0)][offset_y(0)] |= CELL_PATH_E; my_stack.push(Coord(offset_x(1), offset_y(0))); break; ... // Do the same for West and South... } visitedCells++; } else { my_stack.pop(); } ... Finally, it calls the drawMaze method to draw the maze on the screen using the SFML library. It draws a wall between two cells if the current cell isn't marked with CELL_PATH_S, CELL_PATH_N, CELL_PATH_W, or CELL_PATH_E. However, this maze doesn't guarantee a solution. In many cases, it will generate a map with no clear path between two points. While this randomness might be interesting, I wanted something more structured. The only way to ensure a solution for the maze is to use a predetermined structure that connects every part of the maze in some way. Creating a Maze Using Graph Theory Well-known maze generation algorithms rely on graphs. Each cell is a node in the graph, and every node must have at least one connection to other nodes. As mentioned earlier, mazes come in many forms. Some, called "unicursal" mazes, act as labyrinths with only one entrance, which also serves as the exit. Others may have multiple solutions. However, the process of generation often starts with creating a "perfect" maze. A "perfect" maze, also known as a simply-connected maze, lacks loops, closed circuits, and inaccessible areas. From any point within it, there is precisely one path to any other point. The maze has a single, solvable solution. If we use a graph as the internal representation of our maze, constructing a spanning tree ensures that there is a path from the start to the end. In computer science terms, such a maze can be described as a spanning tree over the set of cells or vertices. Multiple spanning trees may exist, but the goal is to ensure at least one solution from the start to the end, as shown in the example below: The image above depicts only one solution, but there are actually multiple paths. No cell is isolated and impossible to reach. So, how do we achieve this? I discovered a well-designed mazegenerator codebase by @razimantv that accomplishes this, generating mazes in SVG file format. Therefore, I forked the repository and based my solution on it. Kudos to @razimantv for the elegant OOP design, which allowed me to customize the results to create visually appealing images using the SFML library or generate a text file with the necessary map description for my Wall-E project. I refactored the code to remove unnecessary components and focus exclusively on rectangular mazes. However, I retained support for various algorithms to build a spanning tree. I also added comments throughout the codebase for easier comprehension, so I don't need to explain it in every detail here. The main pipeline can be found in \mazegenerator\maze\mazebaze.cpp: C++ /** * \param algorithm Algorithm that is used to generate maze spanning tree. */ void MazeBase::GenerateMaze(SpanningtreeAlgorithmBase* algorithm) { // Generates entire maze spanning tree auto spanningTreeEdges = algorithm->SpanningTree(_verticesNumber, _edgesList); // Find a solution of a maze based on Graph DFS. _Solve(spanningTreeEdges); // Build a maze by removing unnecessary edges. _RemoveBorders(spanningTreeEdges); } I introduced visualization using the SFML graphics library, thanks to a straightforward _Draw_ function.While DFS is the default algorithm for creating a spanning tree, there are multiple algorithms available as options.The result is a handy utility that generates rectangular "perfect" mazes and displays them on the screen: As you can see, it contains exactly one input and one output at the left top and right bottom corners. The code still generates SVG file, which is a nice addition (though, it is the core function of the original codebase). Now, I can proceed with my experiments in the Wall-E project, and I leave you here, hoping that you're inspired to explore this fascinating world of mazes and embark on your own journey.Stay tuned!
Sorting is a fundamental operation in computer science and is crucial for organizing and processing large sets of data efficiently. There are numerous sorting algorithms available, each with its unique characteristics and trade-offs. Whether you’re a beginner programmer or an experienced developer, understanding sorting algorithms is essential for optimizing your code and solving real-world problems efficiently. Sorting algorithms play a crucial role in computer science and programming, enabling efficient organization and retrieval of data. In this article, we will dive into the world of sorting algorithms, exploring their various types, their strengths, and their best use cases. Understanding these algorithms will empower you to choose the most suitable sorting technique for your specific requirements. What Are Sorting Algorithms? Sorting algorithms are algorithms designed to arrange elements in a specific order, typically ascending or descending. They are fundamental tools in computer science and play a vital role in data organization and retrieval. Sorting algorithms take an unsorted collection of elements and rearrange them according to a predetermined criterion, allowing for easier searching, filtering, and analysis of data. The primary goal of sorting algorithms is to transform a disordered set of elements into a sequence that follows a specific order. The order can be based on various factors, such as numerical value, alphabetical order, or custom-defined criteria. Sorting algorithms operate on different data structures, including arrays, lists, trees, and more. These algorithms come in various types, each with its own set of characteristics, efficiency, and suitability for different scenarios. Some sorting algorithms are simple and easy to implement, while others are more complex but offer improved performance for larger datasets. The choice of sorting algorithm depends on factors such as the size of the dataset, the expected order of the input, stability requirements, memory constraints, and desired time complexity. Sorting algorithms are not limited to a specific programming language or domain. They are widely used in a range of applications, including databases, search algorithms, data analysis, graph algorithms, and more. Understanding sorting algorithms is essential for developers and computer scientists, as it provides the foundation for efficient data manipulation and retrieval. Types of Sorting Algorithms Bubble Sort Bubble Sort is a simple and intuitive algorithm that repeatedly swaps adjacent elements if they are in the wrong order. It continues this process until the entire list is sorted. While easy to understand and implement, Bubble Sort has a time complexity of O(n²) in the worst case, making it inefficient for large datasets. It is primarily useful for educational purposes or when dealing with small datasets. Insertion Sort Insertion Sort works by dividing the list into a sorted and an unsorted part. It iterates through the unsorted part, comparing each element to the elements in the sorted part and inserting it at the correct position. Insertion Sort has a time complexity of O(n²) in the worst case but performs better than Bubble Sort in practice, particularly for partially sorted or small datasets. Selection Sort Selection Sort divides the list into a sorted and an unsorted part, similar to Insertion Sort. However, instead of inserting elements, it repeatedly finds the minimum element from the unsorted part and swaps it with the first element of the unsorted part. Selection Sort has a time complexity of O(n²) and is generally less efficient than Insertion Sort or more advanced algorithms. It is mainly used for educational purposes or small datasets. Merge Sort Merge Sort is a divide-and-conquer algorithm that recursively divides the list into smaller halves, sorts them, and then merges them back together. It has a time complexity of O(n log n), making it more efficient than the previous algorithms for large datasets. Merge Sort is known for its stability (preserving the order of equal elements) and is widely used in practice. Quick Sort Quick Sort, another divide-and-conquer algorithm, selects a “pivot” element and partitions the list around it such that all elements less than the pivot come before it, and all elements greater come after it. The algorithm then recursively sorts the two partitions. Quick Sort has an average time complexity of O(n log n), but it can degrade to O(n²) in the worst case. However, its efficient average-case performance and in-place sorting make it a popular choice for sorting large datasets. Heap Sort Heap Sort uses a binary heap data structure to sort the elements. It first builds a heap from the input list, then repeatedly extracts the maximum element (root) and places it at the end of the sorted portion. Heap Sort has a time complexity of O(n log n) and is often used when a guaranteed worst-case performance is required. Radix Sort Radix Sort is a non-comparative algorithm that sorts elements by processing individual digits or bits of the elements. It works by grouping elements based on each digit’s value and repeatedly sorting them until the entire list is sorted. Radix Sort has a time complexity of O(k * n), where k is the number of digits or bits in the input elements. It is particularly efficient for sorting integers or fixed-length strings. Choosing the Right Sorting Algorithm Choosing the right sorting algorithm depends on several factors, including the characteristics of the data set, the desired order, time complexity requirements, stability considerations, and memory constraints. Here are some key considerations to help you make an informed decision: Input Size: Consider the size of your data set. Some sorting algorithms perform better with smaller data sets, while others excel with larger inputs. For small data sets, simple algorithms like Bubble Sort or Insertion Sort may be sufficient. However, for larger data sets, more efficient algorithms like Merge Sort, Quick Sort, or Heap Sort are generally preferred due to their lower time complexity. Input Order: Take into account the initial order of the data set. If the data is already partially sorted or nearly sorted, algorithms like Insertion Sort or Bubble Sort can be advantageous as they have better performance under these conditions. They tend to have a lower time complexity when dealing with partially ordered inputs. Stability: Consider whether the stability of the sorting algorithm is important for your use case. A stable sorting algorithm preserves the relative order of elements with equal keys. If maintaining the original order of equal elements is crucial, algorithms like Merge Sort or Insertion Sort are stable options, while Quick Sort is not inherently stable. Time Complexity: Analyze the time complexity requirements for your application. Different sorting algorithms have varying time complexities. For example, Bubble Sort and Insertion Sort have average and worst-case time complexities of O(n²), making them less efficient for large data sets. Merge Sort and Heap Sort have average and worst-case time complexities of O(n log n), offering better performance for larger data sets. Quick Sort has an average time complexity of O(n log n), but its worst-case time complexity can reach O(n²) in certain scenarios. Memory Usage: Consider the memory requirements of the sorting algorithm. In-place algorithms modify the original data structure without requiring significant additional memory. Algorithms like Insertion Sort, Quick Sort, and Heap Sort can be implemented in place, which is beneficial when memory usage is a concern. On the other hand, algorithms like Merge Sort require additional memory proportional to the input size, as they create temporary arrays during the merging process. Specialized Requirements: Depending on the specific characteristics of your data or the desired order, there may be specialized sorting algorithms that offer advantages. For example, Radix Sort is useful for sorting integers or strings based on individual digits or characters. Conclusion In computer science and programming, sorting algorithms are essential for the effective manipulation and analysis of data. Although some of the most popular sorting algorithms were described in this article, it's crucial to remember that there are a variety of other variations and specialized algorithms that are also accessible. The sorting algorithm to use depends on a number of variables, including the dataset's size, distribution, memory requirements, and desired level of time complexity. Making educated selections and optimizing your code for particular contexts requires an awareness of the fundamentals and traits of various sorting algorithms. Overall, sorting algorithms are powerful tools that enable efficient organization and retrieval of data. They allow us to transform unordered collections into ordered sequences, facilitating faster and easier data processing in various computational tasks.
Staying ahead of the curve in today's quickly expanding digital landscape is more than a goal—it's a requirement. For architects, mastering real-time data integration consequently becomes indispensable, and the reason is clear: modern businesses crave instantaneous insights, fluid user experiences, and the agility to adapt strategies on the fly. This is why Change Data Capture (CDC) has become increasingly important in the field of architecture. It allows for the continuous integration of data changes from/to various sources, ensuring that systems are always up-to-date. To get started, we'll explore the technologies that power CDC, Kafka, and Debezium. Learning about their capabilities and the challenges they solve should clarify why the CDC represents an effective approach to data integration. After that, we'll explore anti-patterns in data integration, architectural considerations, and trade-offs. Exploring The Foundations of CDC: Kafka, Debezium Apache Kafka is a distributed event streaming technology designed to handle massive amounts of data, making it perfect for real-time analytics and monitoring. Debezium, on the other hand, is an open-source platform for change data capture (CDC). It can capture and stream all database changes in real time, eliminating the need for batch processing and enabling services to react immediately to data changes. This synergy between Kafka and Debezium eliminates batch-processing bottlenecks, enabling solutions to respond instantly to fast-moving data changes. To facilitate the understanding of this data integration strategy, picture the scenario, for instance, of an e-commerce platform. In this case, Debezium could be used to capture changes in the inventory database and stream them to Kafka topics. The data can then be consumed by various systems, such as a real-time inventory management application, a recommendation engine, and a dashboard for monitoring stock levels. A Closer Look at Debezium Debezium's greatest value comes from its ability to tap into a database's transaction logs, capture changes as they occur, and streaming them as events in real time. To facilitate understanding and know what to expect, check the following examples of events that would be emitted if Debezium captured operations of insert, update, and delete on a database being tracked. Sample of an Event Emitted for a db operation of: Insert: JSON { "before": null, "after": { "id": 1, "name": "John", "age": 25 }, "source": { "table": "users", "schema": "public", "database": "mydatabase" }, "op": "c" } Update: JSON { "before": { "id": 1, "name": "John", "age": 25 }, "after": { "id": 1, "name": "John Doe", "age": 30 }, "source": { "table": "users", "schema": "public", "database": "mydatabase" }, "op": "u" } Delete: JSON { "before": { "id": 1, "name": "John Doe", "age": 30 }, "after": null, "source": { "table": "users", "schema": "public", "database": "mydatabase" }, "op": "d" } From the above examples, see that there different events will be emitted for create, update, and delete. Each holds information about the changes of a specific record in the tracked database. The "before" field shows the state of the record before the change, The "after" field shows the state of the record after the change, and the "op" field indicates the type of operation performed. The "source" field brings the details about the table, schema, and database where the change occurred. Debezium's extensive compatibility with multiple database vendors further expands its capabilities, making it a flexible and versatile technology. It works under the covers using Kafka Connect and two core connector types: Source: connectors that can read data from some type of data source, transform, and emit them individually as events. Debezium is an example of a Source Connector. At the time of writing, Debezium offers connectors for MongoDB, MySQL, PostgreSQL, SQL Server, Oracle Db2, Cassandra, and under preview for Vitess, Spanner, and JDBC. Sink: connectors that can consume data from specified topics and send them to the determined data store type. Debezium has a new interesting feature currently under preview to support users on creating, maintaining, and visualizing configurations using a web UI: Additionally, Debezium has monitoring and error-handling capabilities to ensure that data is consistently and reliably streamed. The way it technically handles errors is by using a combination of messaging retry mechanisms and dead-letter queues (DLQs). When an error occurs during data streaming, Debezium retries the operation a certain number of times before forwarding it to a dead-letter queue. This ensures that no data is lost and provides a troubleshooting path for fixing the underlying issue. Architectural Considerations When deploying CDC, architects must be aware of the following: Data Transformation: Integration solutions, like Kafka streams, can process and transform events published by Debezium, enabling data enrichment, filtering, or aggregation before publishing events to other systems in the architecture. Messaging System Requirements: You'll need a reliable and scalable messaging system to handle the high volume of data changes being captured and processed. The messaging system should be able to handle both real-time and batch-processing scenarios, as well as support different data formats and protocols. Distributed Storage: A distributed storage mechanism is essential to store and manage the captured change data, ensuring high availability and resilience. Consider Cloud Services: Cloud service providers offer integrated messaging and storage solutions tailored for CDC. That can help in case of a lack of specialized technical professionals on the platform and infrastructure's requirements and serving as a simpler means to service scalability and cost-efficiency. Common Errors and Anti-patterns in Data Management for Distributed Services The adoption of microservice architectures can introduce unique challenges regarding handling data in a distributed environment. Common pitfalls include: Shared Databases: Multiple microservices interacting with a single shared database can lead to tightly-coupled services. CDC can help ensure each microservice possesses its own data view and by preserving data encapsulation. Dual writes: In distributed systems, some services can update data in its own database and external data stores, such as ElasticSearch. This is an issue as the core data operations are no longer handled in transactions. This can lead to data inconsistency in distributed systems. Inconsistent Data Models: Data inconsistency may happen when different services use different data models for the same data entity. CDC can act as a gateway, catching changes in the source data and distributing it to other services using a consistent model. Ignoring Eventual Consistency: Don't expect immediate data consistency across distributed services. The way CDC helps addressing this gap is that all services will, at last, get a consistent view of the data. No Proper Handling of Data Evolution: Without a data evolution plan, changes can become confusing, risky, and time-consuming. CDC solutions, particularly when integrated with platforms such as Kafka, can elegantly take into account schema evolution. CDC Adoption Is on the Rise CDC initially gained traction as an alternative to batch data replication for populating data warehouses for Extract, Transform, and Load (ETL) procedures. With the increasing adoption of cloud-native architectures plus the need for real-time analytics, the role of this integration pattern has never been more important. It's not only about data replication; it's about real-time data integration, ensuring services can access and provide the correct data at the right time. With the emergence of technologies like AI and machine learning, the need for real-time data has increasingly grown. Architects must examine not only how to acquire and analyze data but also how to do so in a scalable, reliable, and cost-effective way. This is where the CDC (Change Data Capture) comes into play. It allows organizations to capture and replicate only the changes made to their data rather than transferring the entire dataset, significantly reducing the time and resources required for data integration and making it a more efficient and practical solution for handling large volumes of data. Wrapping It Up The importance of real-time data integration cannot be underestimated in today's fast-paced and data-driven world. As organizations continue to embrace advanced technologies and rely on accurate data, architects own the important responsibility of designing systems that can handle the increasing demands for data processing and analysis. By prioritizing scalability, reliability, and cost-effectiveness, architects can ensure their organizations have the tools to make informed decisions and stay competitive in their respective industries. Debezium and Kafka are two popular technologies for architecting distributed Java microservices. By incorporating these open-source technologies, architects can create solutions with seamless integration and processing of data across microservices, harnessing the power of real-time data integration and processing.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Kai Wähner
Technology Evangelist,
Confluent
Gilad David Maayan
CEO,
Agile SEO
Grant Fritchey
Product Advocate,
Red Gate Software