Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)
Securing Secrets: A Guide To Implementing Secrets Management in DevSecOps Pipelines
Modern API Management
When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.
Open Source Migration Practices and Patterns
MongoDB Essentials
Logging is essential for any software system. Using logs, you can troubleshoot a wide range of issues, including debugging an application bug, security defect, system slowness, etc. In this article, we will discuss how to use Python logging effectively using custom attributes. Python Logging Before we delve in, I briefly want to explain a basic Python logging module with an example. #!/opt/bb/bin/python3.7 import logging import sys root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) logging.info("I love Dzone!") The above example prints the following when executed: INFO - 2024-03-09 19:49:07,734 I love Dzone! In the example above, we are creating the root logger and the logging format for log messages. On line 6, logging.getLogger() returns the logger if already created; if not, it goes one level above the hierarchy and returns the parent logger. We define our own StreamHandler to print the log message at the console. Whenever we log messages, it is essential to log the basic attributes of the LogRecord. On line 10, we define the basic format that includes level name, time in string format, and the actual message itself. The handler thus created is set at the root logger level. We could use any pre-defined log attribute name and the format from the LogRecord library. However, let's say you want to print some additional attributes like contextId, a custom logging adapter to the rescue. Logging Adapter class MyLoggingAdapter(logging.LoggerAdapter): def __init__(self, logger): logging.LoggerAdapter.__init__(self, logger=logger, extra={}) def process(self, msg, kwargs): return msg, kwargs We create our own version of Logging Adapter and pass "extra" parameters as a dictionary for the formatter. ContextId Filter import contextvars class ContextIdFilter(logging.Filter): context_id = contextvars.ContextVar('context_id', default='') def filter(self, record): # add a new UUID to the context. req_id = str(uuid.uuid4()) if not self.context_id.get(): self.context_id.set(req_id) record.context_id = self.context_id.get() return True We create our own filter that extends the logging filter, which returns True if the specified log record should be logged. We simply add our parameter to the log record and return True always, thus adding our unique id to the record. In our example above, a unique id is generated for every new context. For an existing context, we return already stored contextId from the contextVars. Custom Logger import logging root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s ContextId:%(context_id)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) root.addFilter(ContextIdFilter()) adapter = MyLoggingAdapter(root) adapter.info("I love Dzone!") adapter.info("this is my custom logger") adapter.info("Exiting the application") Now let's put it together in our logger file. Add the contextId filter to the root. Please note that we are using our own adapter in place of logging wherever we need to log the message. Running the code above prints the following message: INFO - 2024-04-20 23:54:59,839 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 I love Dzone! INFO - 2024-04-20 23:54:59,842 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 this is my custom logger INFO - 2024-04-20 23:54:59,843 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 Exiting the application By setting root.propagate = False, events logged to this logger will be passed to the handlers of higher logging, aka parent logging class. Conclusion Python does not provide a built-in option to add custom parameters in logging. Instead, we create a wrapper logger on top of the Python root logger and print our custom parameters. This would be helpful at the time of debugging request-specific issues.
The monolithic architecture was historically used by developers for a long time — and for a long time, it worked. Unfortunately, these architectures use fewer parts that are larger, thus meaning they were more likely to fail in entirety if a single part failed. Often, these applications ran as a singular process, which only exacerbated the issue. Microservices solve these specific issues by having each microservice run as a separate process. If one cog goes down, it doesn’t necessarily mean the whole machine stops running. Plus, diagnosing and fixing defects in smaller, highly cohesive services is often easier than in larger monolithic ones. Microservices design patterns provide tried-and-true fundamental building blocks that can help write code for microservices. By utilizing patterns during the development process, you save time and ensure a higher level of accuracy versus writing code for your microservices app from scratch. In this article, we cover a comprehensive overview of microservices design patterns you need to know, as well as when to apply them. Key Benefits of Using Microservices Design Patterns Microservices design patterns offer several key benefits, including: Scalability: Microservices allow applications to be broken down into smaller, independent services, each responsible for a specific function or feature. This modular architecture enables individual services to be scaled independently based on demand, improving overall system scalability and resource utilization. Flexibility and agility: Microservices promote flexibility and agility by decoupling different parts of the application. Each service can be developed, deployed, and updated independently, allowing teams to work autonomously and release new features more frequently. This flexibility enables faster time-to-market and easier adaptation to changing business requirements. Resilience and fault isolation: Microservices improve system resilience and fault isolation by isolating failures to specific services. If one service experiences an issue or failure, it does not necessarily impact the entire application. This isolation minimizes downtime and improves system reliability, ensuring that the application remains available and responsive. Technology diversity: Microservices enable technology diversity by allowing each service to be built using the most suitable technology stack for its specific requirements. This flexibility enables teams to choose the right tools and technologies for each service, optimizing performance, development speed, and maintenance. Improved development and deployment processes: Microservices streamline development and deployment processes by breaking down complex applications into smaller, manageable components. This modular architecture simplifies testing, debugging, and maintenance tasks, making it easier for development teams to collaborate and iterate on software updates. Scalability and cost efficiency: Microservices enable organizations to scale their applications more efficiently by allocating resources only to the services that require them. This granular approach to resource allocation helps optimize costs and ensures that resources are used effectively, especially in cloud environments where resources are billed based on usage. Enhanced fault tolerance: Microservices architecture allows for better fault tolerance as services can be designed to gracefully degrade or fail independently without impacting the overall system. This ensures that critical functionalities remain available even in the event of failures or disruptions. Easier maintenance and updates: Microservices simplify maintenance and updates by allowing changes to be made to individual services without affecting the entire application. This reduces the risk of unintended side effects and makes it easier to roll back changes if necessary, improving overall system stability and reliability. Let's go ahead and look for different Microservices Design Patterns. Database per Service Pattern The database is one of the most important components of microservices architecture, but it isn’t uncommon for developers to overlook the database per service pattern when building their services. Database organization will affect the efficiency and complexity of the application. The most common options that a developer can use when determining the organizational architecture of an application are: Dedicated Database for Each Service A database dedicated to one service can’t be accessed by other services. This is one of the reasons that makes it much easier to scale and understand from a whole end-to-end business aspect. Picture a scenario where your databases have different needs or access requirements. The data owned by one service may be largely relational, while a second service might be better served by a NoSQL solution and a third service may require a vector database. In this scenario, using dedicated services for each database could help you manage them more easily. This structure also reduces coupling as one service can’t tie itself to the tables of another. Services are forced to communicate via published interfaces. The downside is that dedicated databases require a failure protection mechanism for events where communication fails. Single Database Shared by All Services A single shared database isn’t the standard for microservices architecture but bears mentioning as an alternative nonetheless. Here, the issue is that microservices using a single shared database lose many of the key benefits developers rely on, including scalability, robustness, and independence. Still, sharing a physical database may be appropriate in some situations. When a single database is shared by all services, though, it’s very important to enforce logical boundaries within it. For example, each service should own its have schema, and read/write access should be restricted to ensure that services can’t poke around where they don’t belong. Saga Pattern A saga is a series of local transactions. In microservices applications, a saga pattern can help maintain data consistency during distributed transactions. The saga pattern is an alternative solution to other design patterns that allow for multiple transactions by giving rollback opportunities. A common scenario is an e-commerce application that allows customers to purchase products using credit. Data may be stored in two different databases: One for orders and one for customers. The purchase amount can’t exceed the credit limit. To implement the Saga pattern, developers can choose between two common approaches. 1. Choreography Using the choreography approach, a service will perform a transaction and then publish an event. In some instances, other services will respond to those published events and perform tasks according to their coded instructions. These secondary tasks may or may not also publish events, according to presets. In the example above, you could use a choreography approach so that each local e-commerce transaction publishes an event that triggers a local transaction in the credit service. 2. Orchestration An orchestration approach will perform transactions and publish events using an object to orchestrate the events, triggering other services to respond by completing their tasks. The orchestrator tells the participants what local transactions to execute. Saga is a complex design pattern that requires a high level of skill to successfully implement. However, the benefit of proper implementation is maintained data consistency across multiple services without tight coupling. API Gateway Pattern For large applications with multiple clients, implementing an API gateway pattern is a compelling option One of the largest benefits is that it insulates the client from needing to know how services have been partitioned. However, different teams will value the API gateway pattern for different reasons. One of these possible reasons is that it grants a single entry point for a group of microservices by working as a reverse proxy between client apps and the services. Another is that clients don’t need to know how services are partitioned, and service boundaries can evolve independently since the client knows nothing about them. The client also doesn’t need to know how to find or communicate with a multitude of ever-changing services. You can also create a gateway for specific types of clients (for example, backends for frontends) which improves ergonomics and reduces the number of roundtrips needed to fetch data. Plus, an API gateway pattern can take care of crucial tasks like authentication, SSL termination, and caching, which makes your app more secure and user-friendly. Another advantage is that the pattern insulates the client from needing to know how services have been partitioned. Before moving on to the next pattern, there’s one more benefit to cover: Security. The primary way the pattern improves security is by reducing the attack surface area. By providing a single entry point, the API endpoints aren’t directly exposed to clients, and authorization and SSL can be efficiently implemented. Developers can use this design pattern to decouple internal microservices from client apps so a partially failed request can be utilized. This ensures a whole request won’t fail because a single microservice is unresponsive. To do this, the encoded API gateway utilizes the cache to provide an empty response or return a valid error code. Circuit Breaker Design Pattern This pattern is usually applied between services that are communicating synchronously. A developer might decide to utilize the circuit breaker when a service is exhibiting high latency or is completely unresponsive. The utility here is that failure across multiple systems is prevented when a single microservice is unresponsive. Therefore, calls won’t be piling up and using the system resources, which could cause significant delays within the app or even a string of service failures. Implementing this pattern as a function in a circuit breaker design requires an object to be called to monitor failure conditions. When a failure condition is detected, the circuit breaker will trip. Once this has been tripped, all calls to the circuit breaker will result in an error and be directed to a different service. Alternatively, calls can result in a default error message being retrieved. There are three states of the circuit breaker pattern functions that developers should be aware of. These are: Open: A circuit breaker pattern is open when the number of failures has exceeded the threshold. When in this state, the microservice gives errors for the calls without executing the desired function. Closed: When a circuit breaker is closed, it’s in the default state and all calls are responded to normally. This is the ideal state developers want a circuit breaker microservice to remain in — in a perfect world, of course. Half-open: When a circuit breaker is checking for underlying problems, it remains in a half-open state. Some calls may be responded to normally, but some may not be. It depends on why the circuit breaker switched to this state initially. Command Query Responsibility Segregation (CQRS) A developer might use a command query responsibility segregation (CQRS) design pattern if they want a solution to traditional database issues like data contention risk. CQRS can also be used for situations when app performance and security are complex and objects are exposed to both reading and writing transactions. The way this works is that CQRS is responsible for either changing the state of the entity or returning the result in a transaction. Multiple views can be provided for query purposes, and the read side of the system can be optimized separately from the write side. This shift allows for a reduction in the complexity of all apps by separately querying models and commands so: The write side of the model handles persistence events and acts as a data source for the read side The read side of the model generates projections of the data, which are highly denormalized views Asynchronous Messaging If a service doesn’t need to wait for a response and can continue running its code post-failure, asynchronous messaging can be used. Using this design pattern, microservices can communicate in a way that’s fast and responsive. Sometimes this pattern is referred to as event-driven communication. To achieve the fastest, most responsive app, developers can use a message queue to maximize efficiency while minimizing response delays. This pattern can help connect multiple microservices without creating dependencies or tightly coupling them. While there are tradeoffs one makes with async communication (such as eventual consistency), it’s still a flexible, scalable approach to designing a microservices architecture. Event Sourcing The event-sourcing design pattern is used in microservices when a developer wants to capture all changes in an entity’s state. Using event stores like Kafka or alternatives will help keep track of event changes and can even function as a message broker. A message broker helps with the communication between different microservices, monitoring messages and ensuring communication is reliable and stable. To facilitate this function, the event sourcing pattern stores a series of state-changing events and can reconstruct the current state by replaying the occurrences of an entity. Using event sourcing is a viable option in microservices when transactions are critical to the application. This also works well when changes to the existing data layer codebase need to be avoided. Strangler-Fig Pattern Developers mostly use the strangler design pattern to incrementally transform a monolith application to microservices. This is accomplished by replacing old functionality with a new service — and, consequently, this is how the pattern receives its name. Once the new service is ready to be executed, the old service is “strangled” so the new one can take over. To accomplish this successful transfer from monolith to microservices, a facade interface is used by developers that allows them to expose individual services and functions. The targeted functions are broken free from the monolith so they can be “strangled” and replaced. Utilizing Design Patterns To Make Organization More Manageable Setting up the proper architecture and process tooling will help you create a successful microservice workflow. Use the design patterns described above and learn more about microservices in my blog to create a robust, functional app.
A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added. In practice, data scientists often work with Jupyter Notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of an ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following: Amazon SageMaker: A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints. Use Case For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes. This post doesn’t go into the details of the model but demonstrates a way to build an ML pipeline that builds and deploys any ML model. Solution Overview The following diagram summarizes the approach for the retraining pipeline. The workflow contains the following elements: AWS Glue crawler: You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. AWS Glue triggers: Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers. AWS Glue job: An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location. AWS Glue workflow: An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image. The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters. When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status. At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more. Setting up the Environment To set up the environment, complete the following steps: Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet. Download the following code into your local directory. Organization of Code The code to build the pipeline has the following directory structure: --Glue workflow orchestration --glue_scripts --DataExtractionJob.py --DataProcessingJob.py --MessagingQueueJob,py --TrainingJob.py --base_resources.template --deploy.sh --glue_resources.template The code directory is divided into three parts: AWS CloudFormation templates: The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow. AWS Glue scripts: The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs. Bash script: A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers. Implementing the Solution Complete the following steps: Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region. The following code example is a path for Region us-west-2: Shell algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest" For more information about BlazingText parameters, see Common parameters for built-in algorithms. Enter the following code in your terminal: Shell sh deploy.sh -s dev AWS_PROFILE=your_profile_name This step sets up the infrastructure of the pipeline. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE. On the AWS Glue console, manually start the pipeline. In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs. To begin the workflow, in the Workflow section, select DevMLWorkflow. From the Actions drop-down menu, choose Run. View the progress of your workflow on the History tab and select the latest RUN ID. The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion. After the workflow is successful, open the Amazon SageMaker console. Under Inference, choose Endpoint. The following screenshot shows that the endpoint of the workflow deployed is ready. Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application. Cleaning Up Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code: Python def delete_resources(self): endpoint_name = self.endpoint try: sagemaker.delete_endpoint(EndpointName=endpoint_name) print("Deleted Test Endpoint ", endpoint_name) except Exception as e: print('Model endpoint deletion failed') try: sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name) print("Deleted Test Endpoint Configuration ", endpoint_name) except Exception as e: print(' Endpoint config deletion failed') try: sagemaker.delete_model(ModelName=endpoint_name) print("Deleted Test Endpoint Model ", endpoint_name) except Exception as e: print('Model deletion failed') This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.
If your system is facing an imminent security threat—or worse, you’ve just suffered a breach—then logs are your go-to. If you’re a security engineer working closely with developers and the DevOps team, you already know that you depend on logs for threat investigation and incident response. Logs offer a detailed account of system activities. Analyzing those logs helps you fortify your digital defenses against emerging risks before they escalate into full-blown incidents. At the same time, your logs are your digital footprints, vital for compliance and auditing. Your logs contain a massive amount of data about your systems (and hence your security), and that leads to some serious questions: How do you handle the complexity of standardizing and analyzing such large volumes of data? How do you get the most out of your log data so that you can strengthen your security? How do you know what to log? How much is too much? Recently, I’ve been trying to use tools and services to get a handle on my logs. In this post, I’ll look at some best practices for using these tools—how they can help with security and identifying threats. And finally, I’ll look at how artificial intelligence may play a role in your log analysis. How To Identify Security Threats Through Logs Logs are essential for the early identification of security threats. Here’s how: Identifying and Mitigating Threats Logs are a gold mine of streaming, real-time analytics, and crucial information that your team can use to its advantage. With dashboards, visualizations, metrics, and alerts set up to monitor your logs you can effectively identify and mitigate threats. In practice, I’ve used both Sumo Logic and the ELK stack (a combination of Elasticsearch, Kibana, Beats, and Logstash). These tools can help your security practice by allowing you to: Establish a baseline of behavior and quickly identify anomalies in service or application behavior. Look for things like unusual access times, spikes in data access, or logins from unexpected areas of the world. Monitor access to your systems for unexpected connections. Watch for frequent and unusual access to critical resources. Watch for unusual outbound traffic that might signal data exfiltration. Watch for specific types of attacks, such as SQL injection or DDoS. For example, I monitor how rate-limiting deals with a burst of requests from the same device or IP using Sumo Logic’s Cloud Infrastructure Security. Watch for changes to highly critical files. Is someone tampering with config files? Create and monitor audit trails of user activity. This forensic information can help you to trace what happened with suspicious—or malicious—activities. Closely monitor authentication/authorization logs for frequent failed attempts. Cross-reference logs to watch for complex, cross-system attacks, such as supply chain attacks or man-in-the-middle (MiTM) attacks. Using a Sumo Logic dashboard of logs, metrics, and traces to track down security threats It’s also best practice to set up alerts to see issues early, giving you the lead time needed to deal with any threat. The best tools are also infrastructure agnostic and can be run on any number of hosting environments. Insights for Future Security Measures Logs help you with more than just looking into the past to figure out what happened. They also help you prepare for the future. Insights from log data can help your team craft its security strategies for the future. Benchmark your logs against your industry to help identify gaps that may cause issues in the future. Hunt through your logs for signs of subtle IOCs (indicators of compromise). Identify rules and behaviors that you can use against your logs to respond in real-time to any new threats. Use predictive modeling to anticipate future attack vectors based on current trends. Detect outliers in your datasets to surface suspicious activities What to Log. . . And How Much to Log So we know we need to use logs to identify threats both present and future. But to be the most effective, what should we log? The short answer is—everything! You want to capture everything you can, all the time. When you’re first getting started, it may be tempting to try to triage logs, guessing as to what is important to keep and what isn’t. But logging all events as they happen and putting them in the right repository for analysis later is often your best bet. In terms of log data, more is almost always better. But of course, this presents challenges. Who’s Going To Pay for All These Logs? When you retain all those logs, it can be very expensive. And it’s stressful to think about how much money it will cost to store all of this data when you just throw it in an S3 bucket for review later. For example, on AWS a daily log data ingest of 100GB/day with the ELK stack could create an annual cost of hundreds of thousands of dollars. This often leads to developers “self-selecting” what they think is — and isn’t — important to log. Your first option is to be smart and proactive in managing your logs. This can work for tools such as the ELK stack, as long as you follow some basic rules: Prioritize logs by classification: Figure out which logs are the most important, classify them as such, and then be more verbose with those logs. Rotate logs: Figure out how long you typically need logs and then rotate them off servers. You probably only need debug logs for a matter of weeks, but access logs for much longer. Log sampling: Only log a sampling of high-volume services. For example, log just a percentage of access requests but log all error messages. Filter logs: Pre-process all logs to remove unnecessary information, condensing their size before storing them. Alert-based logging: Configure alerts based on triggers or events that subsequently turn logging on or make your logging more verbose. Use tier-based storage: Store more recent logs on faster, more expensive storage. Move older logs to cheaper, slow storage. For example, you can archive old logs to Amazon S3. These are great steps, but unfortunately, they can involve a lot of work and a lot of guesswork. You often don’t know what you need from the logs until after the fact. A second option is to use a tool or service that offers flat-rate pricing; for example, Sumo Logic’s $0 ingest. With this type of service, you can stream all of your logs without worrying about overwhelming ingest costs. Instead of a per-GB-ingested type of billing, this plan bills based on the valuable analytics and insights you derive from that data. You can log everything and pay just for what you need to get out of your logs. In other words, you are free to log it all! Looking Forward: The Role of AI in Automating Log Analysis The right tool or service, of course, can help you make sense of all this data. And the best of these tools work pretty well. The obvious new tool to help you make sense of all this data is AI. With data that is formatted predictably, we can apply classification algorithms and other machine-learning techniques to find out exactly what we want to know about our application. AI can: Automate repetitive tasks like data cleaning and pre-processing Perform automated anomaly detection to alert on abnormal behaviors Automatically identify issues and anomalies faster and more consistently by learning from historical log data Identify complex patterns quickly Use large amounts of historical data to more accurately predict future security breaches Reduce alert fatigue by reducing false positives and false negatives Use natural language processing (NLP) to parse and understand logs Quickly integrate and parse logs from multiple, disparate systems for a more holistic view of potential attack vectors AI probably isn’t coming for your job, but it will probably make your job a whole lot easier. Conclusion Log data is one of the most valuable and available means to ensure your applications’ security and operations. It can help guard against both current and future attacks. And for log data to be of the most use, you should log as much information as you can. The last problem you want during a security crisis is to find out you didn’t log the information you need.
As the volume of data increases exponentially and queries become more complex, relationships become a critical component for data analysis. In turn, specialized solutions such as graph databases that explicitly optimize for relationships are needed. Other databases aren’t designed to be able to search and query data based on the intricate relationships found in complex data structures. Graph databases are optimized to handle connected data by modeling the information into a graph, which maps data through nodes and relationships. With this article, readers will traverse a beginner’s guide to graph databases, their terminologies, and comparisons with relational databases. They will also explore graph databases from cloud providers like AWS Neptune to open-source solutions. Additionally, this article can help develop a better understanding of how graph databases are useful for applications such as social network analysis, fraud detection, and many other areas. Readers will also learn how graph databases are used for applications like knowledge graph databases and social media analytics. What Is a Graph Database? A graph database is a purpose-built NoSQL database specializing in data structured in complex network relationships, where entities and their relationships have interconnections. Data is modeled using graph structures, and the essential elements of this structure are nodes, which represent entities, and edges, which represent the relationships between entities. The nodes and edges of a graph can all have attributes. Critical Components of Graph Databases Nodes These are the primary data elements representing entities such as people, businesses, accounts, or any other item you might find in a database. Each node can store a set of key-value pairs as properties. Edges Edges are the lines that connect nodes, defining their relationships. In addition to nodes, edges can also have properties – such as weight, type, or strength – that clarify their relationship. Properties Nodes and edges can each have properties that can be used to store metadata about those objects. These can include names, dates, or any other relevant descriptive attributes to a node or edge. How Graph Databases Store and Process Data In a graph database, nodes and relationships are considered first-class citizens — in contrast to relational databases, nodes are stored in tabular forms, and relationships are computed at query time. This lets graph databases treat the data relationships as having as much value as the data, which enables faster traversal of connected data. With their traversal algorithms, graph databases can explore the relationships between nodes and edges to answer complicated queries like the shortest path, fraud detection, or network analysis. Various graph-specific query languages – Neo4j’s Cypher and Tinkerpop’s Gremlin – enable these operations by focusing on pattern matching and deep-link analytics. Practical Applications and Benefits Graph databases shine in any application where the relationships between the data points are essential, such as web and social networks, recommendation engines, and a whole host of other apps where it’s necessary to know how deep and wide the relationships go. In areas such as fraud detection and network security, it’s essential to adjust and adapt dynamically; this is something graph databases do very well. In conclusion, graph databases offer a solid infrastructure for working with complex, highly connected data. They offer many advantages over relational databases regarding modeling relationships and the interactions between the data. Key Components and Terminology Nodes and Their Properties Nodes are the basic building blocks of a graph database. They typically represent some object or a specific instance, be it a person, place, or thing. For each node, we have a vertex in the graph structure. The node can also contain several properties (also called "labels" in the database context). Each of these properties is a key-value pair, where the value expands or further clarifies the object, and its content depends on the application of the graph database. Edges: Defining Relationships Edges, on the other hand, are the links that tie the nodes together. They are directional, so they can have a start node and an end node (thus defining the flow between one node and another). These edges also define the nature of the relationship—whether it is internalizational or social. Labels: Organizing Nodes The labels help group nodes that might have similarities (Person nodes, Company nodes, etc.) so that graph databases can retrieve sets of nodes more quickly. For example, in a social network analysis, Person and Company nodes might be grouped using labels. Relationships and Their Characteristics Relationships connect nodes, but they also have properties, such as strength, status, or duration, that can define how the relationship might differ between nodes. Graph Query Languages: Cypher and Gremlin Graph databases require unique particular languages to use their often complicated structure, and these languages differ from graph databases. Cypher, used with Neo4j, is a reasonably relatively pattern-based language. Gremlin, used with other graph databases, is more procedural and can traverse more complex graph structures. Both languages are expressive and powerful, capable of queries that would be veritable nightmares written in the languages used with traditional databases. Tools for Managing and Exploring Graph Data Neo4j offers a suite of tools designed to enhance the usability of graph databases: Neo4j Bloom: Explore graph data visually without using a graph query language. Neo4j Browser: A web-based application for executing Cypher queries and visualizing the results. Neo4j Data Importer and Neo4j Desktop: These tools for importing data into a Neo4j database and handling Neo4j database instances, respectively. Neo4j Ops Manager: Useful for managing multiple Neo4j instances to ensure that large-scale deployments can be managed and optimized. Neo4j Graph Data Science: This library is an extension of Neo4j that augments its capabilities, which are more commonly associated with data science. It enables sophisticated analytical tasks to be performed directly on graph data. Equipped with these fundamental components and tools, users can wield the power of graph databases to handle complex data and make knowledgeable decisions based on networked knowledge systems. Comparing Graph Databases With Other Databases While graph and relational databases are designed to store and help us make sense of data, they fundamentally differ in how they accomplish this. Graph databases are built on the foundation of nodes and edges, making them uniquely fitted for dealing with complex relationships between data points. That foundation’s core is structure, representing connected entities through nodes and their relationships through edges. Relational databases arrange data in ‘rows and columns’ – tables, whereas graph databases are ‘nodes and edges.’ This difference in structure makes such a direct comparison between the two kinds of databases compelling. Graph databases organize data in this way naturally, whereas it’s not as easy to represent relationships between certain types of data points in relational databases. After all, they were invented to deal with transactions (i.e., a series of swaps of ‘rows and columns’ between two sides, such as a payment or refund between a seller and a customer). Data Models and Scalability Graph databases store data in a graph with nodes, edges, and properties. They are instrumental in domains with complex relationships, such as social networks or recommendation engines. As an example of the opposite end of the spectrum, relational databases contain data in tables, which is well-suited for applications requiring high levels of data integrity (i.e., applications such as those involved in financial systems or managing customer relationships). Another benefit, for example, is their horizontal scalability: graph databases grow proportionally to their demands by adding more machines to a network instead of the vertical scalability (adding more oomph to an existing machine) typical for a relational database. Query Performance and Flexibility One reason is that graph databases are generally much faster at executing complex queries with deep relationships because they can traverse nodes and edges—unlike relational databases, which might have to perform lots of joins that could speed up or slow down depending on the size of the data set. In addition, graph databases excel in the ease with which the data model can be changed without severe consequences. As business requirements evolve and users learn more about how their data should interact, a graph database can be more readily adapted without costly redesigns. Though better suited for providing strong transactional guarantees or ACID compliance, relational databases are less adept at model adjustments. Use of Query Languages The different languages of query also reflect the distinct nature of these databases. Whereas graph databases tend to use a language tailored to the way a graph is traversed—such as Gremlin or Cypher—relational databases have long been managed and queried through SQL, a well-established language for structured data. Suitability for Different Data Types Relational databases are well suited for handling large datasets with a regular and relatively simple structure. In contrast, graph databases shine in environments where the structures are highly interconnected, and the relationships are as meaningful as the data. In conclusion, while graph and relational databases have pros and cons, which one to use depends on the application’s requirements. Graph databases are better for analyzing intricate and evolving relationships, which makes them ideal for modern applications that demand a detailed understanding of networked data. Advantages of Graph Databases Graph databases are renowned for their efficiency and flexibility, mainly when dealing with complex, interconnected data sets. Here are some of the key advantages they offer: High Performance and Real-Time Data Handling Performance is a huge advantage for graph databases. It comes from the ease, speed, and efficiency with which it can query linked data. Graph databases often beat relational databases at handling complex, connected data. They are well suited to continual, real-time updates and queries, unlike, e.g., Hadoop HDFS. Enhanced Data Integrity and Contextual Awareness Keeping these connections intact across channels and data formats, graph databases maintain rich data relationships and allow that data to be easily linked. This structure surfaces nuance in interactions humans could not otherwise discern, saving time and making the data more consumable. It gives users relevant insights to understand the data better and helps businesses make more informed decisions. Scalability and Flexibility Graph databases have been designed to scale well. They can accommodate the incessant expansion of the underlying data and the constant evolution of the data schema without downtime. They can also scale well in terms of the number of data sources they can link, and again, this linking can temporarily accommodate a continuous evolution of the schema without interrupting service. They are, therefore, particularly well-suited to environments in which rapid adaptation is essential. Advanced Query Capabilities These graphs-based systems can quickly run powerful recursive path queries to retrieve direct (‘one hop’) and indirect (‘two hops’ and ‘twenty hops’) connections, making running complex subgraph pattern-matching queries easy. Moreover, complex group-by-aggregate queries (such as Netflix’s tag aggregation) are also natively supported, allowing arbitrary degree flexibility in aggregating selective dimensions, such as in big-data setups with multiple dimensions, such as time series, demographics, or geographics. AI and Machine Learning Readiness The fact that graph databases naturally represent entities and inter-relations as a structured set of connections makes them especially well-suited for AI and machine-learning foundational infrastructures since they support fast real-time changes and rely on expressive, ergonomic declarative query languages that make deep-link traversal and scalability a simple matter – two features that are critical in the case of next-generation data analytics and inference. These advantages make graph databases a good fit for an organization that needs to manage and efficiently draw meaningful insights from dataset relationships. Everyday Use Cases for Graph Databases Graph databases are being used by more industries because they are particularly well-suited for handling complex connections between data and keeping the whole system fast. Let’s look at some of the most common uses for graph databases. Financial and Insurance Services The financial and insurance services sector increasingly uses graph databases to detect fraud and other risks; how these systems model business events and customer data as a graph allows them to detect fraud and suspicious links between various entities, and the technique of Entity Link Analysis takes this a step further, allowing the detection of potential fraud in the interactions between different kinds of entities. Infrastructure and Network Management Graph databases are well-suited for infrastructure mapping and keeping network inventories up to date. Serving up an interactive map of the network estate and performing network tracing algorithms to walk across the graph is straightforward. Likewise, it makes writing new algorithms to identify problematic dependencies, vulnerable bottlenecks, or higher-order latency issues much easier. Recommendation Systems Many companies – including major e-commerce giants like Amazon – use graph databases to power recommendation engines. These keep track of which products and services you’ve purchased and browsed in the past to suggest things you might like, improving the customer experience and engagement. Social Networking Platforms Social networks such as Facebook, Twitter, and LinkedIn all use graph databases to manage and query huge amounts of relational data concerning people, their relationships, and interactions. This makes them very good at quickly navigating across vast social networks, finding influential users, detecting communities, and identifying key players. Knowledge Graphs in Healthcare Healthcare organizations assemble critical knowledge about patient profiles, past ailments, and treatments in knowledge graphs, while graph queries implemented on graph databases identify patient patterns and trends. These can influence how treatments proceed positively and how patients fare. Complex Network Monitoring Graph databases are used to model and monitor complex network infrastructures, including telecommunications networks or end-to-end environments of clouds (data-center infrastructure including physical networking, storage, and virtualization). This application is undoubtedly crucial for the robustness and scalability of those systems and environments that form the essential backbone of the modern information infrastructure. Compliance and Governance Organizations also use graph databases to manage data related to compliance and governance, such as access controls, data retention policies, and audit trails, to ensure they can continue to meet high standards of data security and regulatory compliance. AI and Machine Learning Graph databases are also essential for developing artificial intelligence and machine learning applications. They allow developers to create standardized means of storing and querying data for applications such as natural language processing, computer vision, and advanced recommendation systems, which is essential for making AI applications more intelligent and responsive. Unraveling Financial Crimes Graphs provide a way to trace the structure of shell corporate entities that criminals use to launder money, studying whether the patterns of supplies to shell companies and cash flows from shell companies to other entities are suspicious. Such applications are helpful for law enforcement and regulatory agencies to unravel complex money laundering networks and fight against financial crime. Automotive Industry In the automotive industry, graph queries help analyze the relationships between tens of thousands of car parts, enabling real-time interactive analysis that has the potential to improve manufacturing and maintenance processes. Criminal Network Analysis In law enforcement, graph databases are used to identify criminal networks, address patterns, and identify critical links in criminal organizations to bring operations down efficiently from all sides. Data Lineage Tracking Graph technology can also track data lineage (the details of where an item of data, such as a fact or number, was created, how it was copied, and where it was used). This is important for auditing and verifying that data assets are not corrupted. This diverse array of applications underscores the versatility of graph databases and their utility in representing and managing complex, interconnected data across multiple diverse fields. Challenges and Considerations Graph databases are built around modeling structures in a specific domain, in a process resembling both knowledge or ontology engineering, and a practical challenge that can require specialized "graph data engineers." All these requirements point to important scalability issues and potentially limit the appeal of this technology to many beyond the opponents of a data web. Inconsistency of data across the system remains a critical issue since developing homogeneous systems that can maintain data consistency while maintaining flexibility and expressivity is challenging. While graph queries don’t require as much coding as SQL, paths for traversal across the data still have to be spelled out explicitly. This increases the effort needed to write queries and prevents graph queries from being as easily abstracted and reused as SQL code, impairing their generalization. Furthermore, because there isn’t a unified standard for capabilities or query languages, developers invent their own – a further step in API fragmentation. Another significant issue is knowing which machine is the best place to put that data, given all the subtle relationships between nodes, deciding that is crucial to performance but hard to do on the fly. As necessary, many existing graph database systems weren’t architected for today’s high volumes of data, so they can end up being performance bottlenecks. From a project management standpoint, failure to accurately capture and map business requirements to technical requirements often results in confusion and delay. Poor data quality, inadequate access to data sources, verbose data modeling, or time-consuming data modeling will magnify the pain of a graph data project. On the end-user side, asking people to learn new languages or skills in order to read some graphs could deter adoption, while the difficulty of sharing those graphs or collaborating on the analysis will eventually lower the range and impact of the insights. The Windows 95 interface had an excellent early advantage in the virtues of simplicity: we can tell the same story about graph technologies nowadays. Adopting this technology is also hindered when the analysis process is criticized as too time-consuming. From a technical perspective, managing large graphs by storing and querying complex structures presents more significant challenges. For example, the data must be distributed on a cluster of multiple machines, adding another level of complexity for developers. Data is typically sharded (split) into smaller parts and stored on various machines, coordinated by an "intelligent" virtual server managing access control and query across multiple shards. Choosing the Right Graph Database When selecting a graph database, it’s crucial to consider the queries’ complexity and the data’s interconnectedness. A well-chosen graph database can significantly enhance the performance and scalability of data-driven applications. Key Factors to Consider Native graph storage and processing: Opt for databases designed from the ground up to handle graph data structures. Property graphs and Graph Query Languages: Ensure the database supports robust graph query languages and can handle property graphs efficiently. Data ingestion and integration capabilities: The ability to seamlessly integrate and ingest data from various sources is vital for dynamic data environments. Development tools and graph visualization: Tools that facilitate development and allow intuitive graph visualizations to improve usability and insights. Graph data science and analytics: Databases with advanced analytics and data science capabilities can provide deeper insights. Support for OLTP, OLAP, and HTAP: Depending on the application, support for transactional (OLTP), analytical (OLAP), and hybrid (HTAP) processing may be necessary. ACID compliance and system durability: Essential for ensuring data integrity and reliability in transaction-heavy environments Scalability and performance: The database should scale vertically and horizontally to handle growing data loads. Enterprise security and privacy features: Robust security features are crucial to protect sensitive data and ensure privacy. Deployment flexibility: The database should match the organization’s deployment strategy, whether on-premises or cloud. Open-source foundation and community support: A strong community and open-source foundation can provide extensive support and flexibility. Business and technology partnerships: Partnerships can offer additional support and integration options, enhancing the database’s capabilities. Comparing Popular Graph Databases Dgraph: This is the most performant and scalable option for enterprise systems that need to handle massive amounts of fast-flowing data. Memgraph: An open-source, in-memory storage database with a query language specially designed for real-time data and analytics Neo4j: Offers a comprehensive graph data science library and is well-suited for static data storage and Java-oriented developers Each of these databases has its advantages: Memgraph is the strongest contender in the Python ecosystem (you can choose Python, C++, or Rust for your custom stored procedures), and Neo4j’s managed solution offers the most control over your deployment into the cloud (its AuraDB service provides a lot of power and flexibility). Community and Free Resources Memgraph has a free community edition and a paid enterprise edition, and Neo4j has a community "Labs" edition, a free enterprise trial, and hosting services. These are all great ways for developers to get their feet wet without investing upfront. In conclusion, choosing the proper graph database to use is contingent upon understanding the realities of your project well enough and the potential of the database to which you are selecting. If you bear this notion in mind, your organization will be using graph databases to their full potential to enhance its data infrastructure and insights. Conclusion Having navigated through the expansive realm of graph databases, the hope is that you now know not only the basics of these beautiful databases, from nodes to edges, from vertex storage to indexing, but also those of their applications across industries, including finance, government, and healthcare. This master guide comprehensively introduces graph databases, catering to sophomores and seniors in the database field. Now, every reader of this broad stratum is fully prepared to take the following steps in understanding how graph databases work, how they compare against traditional and non-relational databases, and where they are utilized in the real world. We have seen that choosing a graph database requires careful consideration of the project’s requirements and features. The reflections and difficulties highlighted the importance of correct implementation and the advantage of the graph database in changing our way of processing and looking at data. The graph databases’ complexity and power allow us to provide new insights and be more efficient in computation. In this way, new data management and analysis methods may be developed. References Graph Databases for Beginners How to choose a graph database: we compare 6 favorites What is A Graph Database? A Beginner's Guide Video: What is a graph database? (in 10 minutes) AWS: What Is a Graph Database? Neo4j: What is a graph database? Wikipedia: Graph database Geeks for Geeks: What is Graph Database – Introduction Memgraph: What is a Graph Database? Geeks for Geeks: Introduction to Graph Database on NoSQL Graph Databases: A Comprehensive Overview and Guide. Part1 Graph database concepts AWS: What’s the Difference Between a Graph Database and a Relational Database? Comparison of Relational Databases and Graph Databases Nebula Graph: Graph Database vs Relational Database: What to Choose? Graph database vs. relational database: Key differences The ultimate guide to graph databases Neo4j: Why Graph Databases? What is a Graph Database and What are the Benefits of Graph Databases What Are the Major Advantages of Using a Graph Database? Graph Databases for Beginners: Why Graph Technology Is the Future Understanding Graph Databases: Unleashing the Power of Connected Data in Data Science Use cases for graph databases 7 Graph Database Use Cases That Will Change Your Mind When Connected Data Matters Most 17 Use Cases for Graph Databases and Graph Analytics The Challenges of Working with a Graph Database Where the Path Leads: State of the Art and Challenges of Graph Database Systems 5 Reasons Graph Data Projects Fail 16 Things to Consider When Selecting the Right Graph Database How to Select a Graph Database: Best Practices at RoyalFlush Neo4j vs Memgraph - How to Choose a Graph Database?
With recent achievements and attention to LLMs and the resultant Artificial Intelligence “Summer,” there has been a renaissance in model training methods aimed at getting to the most optimal, performant model as quickly as possible. Much of this has been achieved through brute scale — more chips, more data, more training steps. However, many teams have been focused on how we can train these models more efficiently and intelligently to achieve the desired results. Training LLMs typically include the following phases: Pretraining: This initial phase lays the foundation, taking the model from a set of inert neurons to a basic language generator. While the model ingests vast amounts of data (e.g., the entire internet), the outputs at this stage are often nonsensical, though not entirely gibberish. Supervised Fine-Tuning (SFT): This phase elevates the model from its unintelligible state, enabling it to generate more coherent and useful outputs. SFT involves providing the model with specific examples of desired behavior, and teaching it what is considered "helpful, useful, and sensible." Models can be deployed and used in production after this stage. Reinforcement Learning (RL): Taking the model from "working" to "good," RL goes beyond explicit instruction and allows the model to learn implicit preferences and desires of users through labeled preference data. This enables developers to encourage desired behaviors without needing to explicitly define why those behaviors are preferred. In-context learning: Also known as prompt engineering, this technique allows users to directly influence model behavior at inference time. By employing methods like constraints and N-shot learning, users can fine-tune the model's output to suit specific needs and contexts. Note this is not an exhaustive list, there are many other methods and phases that may be incorporated into idiosyncratic training pipelines Introducing Reward and Reinforcement Learning Humans excel at pattern recognition, often learning and adapting without conscious effort. Our intellectual development can be seen as a continuous process of increasingly complex pattern recognition. A child learns not to jump in puddles after experiencing negative consequences, much like an LLM undergoing SFT. Similarly, a teenager observing social interactions learns to adapt their behavior based on positive and negative feedback – the essence of Reinforcement Learning. Reinforcement Learning in Practice: The Key Components Preference data: Reinforcement Learning in LLMs typically require multiple (often 2) example outputs and a prompt/input in order to demonstrate a ‘gradient’. This is intended to show that certain behaviors are preferred relative to others. As an example, in RLHF, human users may be presented with a prompt and two examples and asked to choose which they prefer, or in other methods, they may be presented with an output and asked to improve on it in some way (where the improved version will be captured as the ‘preferred’ option). Reward model: A reward model is trained directly on the preference data. For a set of responses to a given input, each response can be assigned a scalar value representing its ‘rank’ within the set (for binary examples, this can be 0 and 1). The reward model is then trained to predict these scalar values given a novel input and output pair. That is, the RM is able to reproduce or predict a user’s preference Generator model: This is the final intended artifact. In simplified terms, during the Reinforcement Training Process, the Generator model generates an output, which is then scored by the Reward Model, and the resultant reward is fed back to the algorithm which decides how to mutate the Generator Model. For example, the algorithm will update the model to increase the odds of generating a given output when provided a positive reward and do the opposite in a negative reward scenario. In the LLM landscape, RLHF has been a dominant force. By gathering large volumes of human preference data, RLHF has enabled significant advancements in LLM performance. However, this approach is expensive, time-consuming, and susceptible to biases and vulnerabilities. This limitation has spurred the exploration of alternative methods for obtaining reward information at scale, paving the way for the emergence of RLAIF – a revolutionary approach poised to redefine the future of AI development. Understanding RLAIF: A Technical Overview of Scaling LLM Alignment With AI Feedback The core idea behind RLAIF is both simple and profound: if LLMs can generate creative text formats like poems, scripts, and even code, why can't they teach themselves? This concept of self-improvement promises to unlock unprecedented levels of quality and efficiency, surpassing the limitations of RLHF. And this is precisely what researchers have achieved with RLAIF. As with any form of Reinforcement Learning, the key lies in assigning value to outputs and training a Reward Model to predict those values. RLAIF's innovation is the ability to generate these preference labels automatically, at scale, without relying on human input. While all LLMs ultimately stem from human-generated data in some form, RLAIF leverages existing LLMs as "teachers" to guide the training process, eliminating the need for continuous human labeling. Using this method, the authors have been able to achieve comparable or even better results from RLAIF as opposed to RLHF. See below the graph of ‘Harmless Response Rate’ comparing the various approaches: To achieve this, the authors developed a number of methodological innovations. In-context learning and prompt engineering: RLAIF leverages in-context learning and carefully designed prompts to elicit preference information from the teacher LLM. These prompts provide context, examples (for few-shot learning), and the samples to be evaluated. The teacher LLMs output then serves as the reward signal. Chain-of-thought reasoning: To enhance the teacher LLM's reasoning capabilities, RLAIF employs Chain-of-Thought (CoT) prompting. While the reasoning process itself isn't directly used, it leads to more informed and nuanced preference judgments from the teacher LLM. Addressing position bias: To mitigate the influence of response order on the teacher's preference, RLAIF averages preferences obtained from multiple prompts with varying response orders. To understand this a little more directly, imagine the AI you are trying to train as a student, learning and improving through a continuous feedback loop. And then imagine an off-the-shelf AI, that has been through extensive training already, as the teacher. The teacher rewards the student for taking certain actions, coming up with certain responses, and so on, and punishes it otherwise. The way it does this is by ‘testing’ the student, by giving it quizzes where the student must select the optimal response. These tests are generated via ‘contrastive’ prompts, where the teacher generates slightly variable responses by slightly varying prompts in order to generate the responses. For example, in the context of code generation, one prompt might encourage the LLM to generate efficient code, potentially at the expense of readability, while the other emphasizes code clarity and documentation. The teacher then assigns its own preference as the ‘ground truth’ and asks the Student to indicate what it thinks is the preferred output. By comparing the students’ responses under these contrasting prompts, RLAIF assesses which response better aligns with the desired attribute. The student, meanwhile, aims to maximize the accumulated reward. So every time it gets punished, it decides to change something about itself so it doesn’t make a mistake again, and get punished again. When it is rewarded, it aims to reinforce that behavior so it is more likely to reproduce the same response in the future. In this way, over successive quizzes, the student gets better and better and punished less and less. While punishments never go to zero, the Student does converge to some minimum which represents the optimal performance it is able to achieve. From there, future inferences made by the student are likely to be of much higher quality than if RLAIF was not employed. The evaluation of synthetic (LLM-generated) preference data is crucial for effective alignment. RLAIF utilizes a "self-rewarding" score, which compares the generation probabilities of two responses under contrastive prompts. This score reflects the relative alignment of each response with the desired attribute. Finally, Direct Preference Optimization (DPO), an efficient RL algorithm, leverages these self-rewarding scores to optimize the student model, encouraging it to generate responses that align with human values. DPO directly optimizes an LLM towards preferred responses without needing to explicitly train a separate reward model. RLAIF in Action: Applications and Benefits RLAIF's versatility extends to various tasks, including summarization, dialogue generation, and code generation. Research has shown that RLAIF can achieve comparable or even superior performance to RLHF, while significantly reducing the reliance on human annotations. This translates to substantial cost savings and faster iteration cycles, making RLAIF particularly attractive for rapidly evolving LLM development. Moreover, RLAIF opens doors to a future of "closed-loop" LLM improvement. As the student model becomes better aligned through RLAIF, it can, in turn, be used as a more reliable teacher model for subsequent RLAIF iterations. This creates a positive feedback loop, potentially leading to continual improvement in LLM alignment without additional human intervention. So how can you leverage RLAIF? It’s actually quite simple if you already have an RL pipeline: Prompt set: Start with a set of prompts designed to elicit the desired behaviors. Alternatively, you can utilize an off-the-shelf LLM to generate these prompts. Contrastive prompts: For each prompt, create two slightly varied versions that emphasize different aspects of the target behavior (e.g., helpfulness vs. safety). LLMs can also automate this process. Response generation: Capture the responses from the student LLM for each prompt variation. Preference elicitation: Create meta-prompts to obtain preference information from the teacher LLM for each prompt-response pair. RL pipeline integration: Utilize the resulting preference data within your existing RL pipeline to guide the student model's learning and optimization. Challenges and Limitations Despite its potential, RLAIF faces challenges that require further research. The accuracy of AI annotations remains a concern, as biases from the teacher LLM can propagate to the student model. Furthermore, biases incorporated into this preference data can eventually become ‘crystallized’ in the teacher LLM which makes it difficult to remove afterward. Additionally, studies have shown that RLAIF-aligned models can sometimes generate responses with factual inconsistencies or decreased coherence. This necessitates exploring techniques to improve the factual grounding and overall quality of the generated text. Addressing these issues necessitates exploring techniques to enhance the reliability, quality, and objectivity of AI feedback. Furthermore, the theoretical underpinnings of RLAIF require careful examination. While the effectiveness of self-rewarding scores has been demonstrated, further analysis is needed to understand its limitations and refine the underlying assumptions. Emerging Trends and Future Research RLAIF's emergence has sparked exciting research directions. Comparing it with other RL methods like Reinforcement Learning from Execution Feedback (RLEF) can provide valuable insights into their respective strengths and weaknesses. One direction involves investigating fine-grained feedback mechanisms that provide more granular rewards at the individual token level, potentially leading to more precise and nuanced alignment outcomes. Another promising avenue explores the integration of multimodal information, incorporating data from images and videos to enrich the alignment process and foster a more comprehensive understanding within LLMs. Drawing inspiration from human learning, researchers are also exploring the application of curriculum learning principles in RLAIF, gradually increasing the complexity of tasks to enhance the efficiency and effectiveness of the alignment process. Additionally, investigating the potential for a positive feedback loop in RLAIF, leading to continual LLM improvement without human intervention, represents a significant step towards a more autonomous and self-improving AI ecosystem. Furthermore, there may be an opportunity to improve the quality of this approach by grounding feedback in the real world. As an example, if the agent were able to execute code, perform real-world experiments, or integrate with a robotic system to ‘instantiate’ feedback in the real world to capture more objective feedback, it would be able to capture more accurate, and reliable preference information without losing scaling advantages. However, ethical considerations remain paramount. As RLAIF empowers LLMs to shape their own alignment, it's crucial to ensure responsible development and deployment. Establishing robust safeguards against potential misuse and mitigating biases inherited from teacher models are essential for building trust and ensuring the ethical advancement of this technology. As mentioned previously, RLAIF has the potential to propagate and amplify biases present in the source data, which must be carefully examined before scaling this approach. Conclusion: RLAIF as a Stepping Stone To Aligned AI RLAIF presents a powerful and efficient approach to LLM alignment, offering significant advantages over traditional RLHF methods. Its scalability, cost-effectiveness, and potential for self-improvement hold immense promise for the future of AI development. While acknowledging the current challenges and limitations, ongoing research efforts are actively paving the way for a more reliable, objective, and ethically sound RLAIF framework. As we continue to explore this exciting frontier, RLAIF stands as a stepping stone towards a future where LLMs seamlessly integrate with human values and expectations, unlocking the full potential of AI for the benefit of society.
A retry mechanism is a critical component of many modern software systems. It allows our system to automatically retry failed operations to recover from transient errors or network outages. By automatically retrying failed operations, retry mechanisms can help software systems recover from unexpected failures and continue functioning correctly. Today, we'll take a look at these topics: What Is a Retry Pattern? What is it for, and why do we need to implement it in our system? When to Retry Your Request Only some requests should be retried. It's important to understand what kind of errors from the downstream service can be retried to avoid problems with business logic. Retry Backoff Period When we retry the request to the downstream service, how long should we wait to send the request again after it fails? How to Retry We'll look at ways to retry from the basic to more complex. What Is a Retry Pattern? Retrying is an act of sending the same request if the request to the downstream service fails. By using a retry pattern, you'll be improving the downstream resiliency aspect of your system. When an error happens when calling a downstream service, our system will try to call it again instead of returning an error to the upstream service. So, why do we need to do it, exactly? Microservices architecture has been gaining popularity in recent decades. While this approach has many benefits, one of the downsides of microservices architecture is introducing network communication between services. Additional network communication leads to the possibility of errors in the network while services are communicating with each other (read Fallacies of distributed computing). Every call to other services has a chance of getting those errors. In addition, whether you're using monolith or microservices architecture, there is a big chance that you still need to call other services that are not within your company's internal network. Calling service within a different network means your request will go through more network layers and have more chance of failure. Other than network errors, you can also get system errors like rate-limit errors, service downtime, and processing timeout. The errors you get may or may not be suitable to be retried. Let's head to the next section to explore it in more detail. When To Retry Your Request Although adding a retry mechanism in your system is generally a good idea, not every request to the downstream service should be retried. As a simple baseline, things you should consider when you want to retry are as follows: Is It a Transient Error? You'll need to consider whether the type of errors you're getting is transient (temporary). For example, you can retry a connection timeout error because it's usually only temporary, but not a bad request error because you need to change the request. Is It a System Error? When you're getting an error message from the downstream service, it can be categorized as either a system error or an application error. System error is generally okay to be retried because your request hasn't been processed by the downstream service yet. On the other hand, an application error usually means that something is wrong with your request, and you should not retry it. For example, if you're getting a bad request error from the downstream service, you'll always get the same error no matter how many times you've retried. Idempotency Even when you're getting an error from the downstream service, there is still a chance it has processed your request. The downstream service could send the error after it has processed the main process, but another sub-process causes errors. Idempotent API means that even if the API gets the same request twice, it will only process the first request. We can achieve it by adding some ID in the request that's unique to the request so the downstream service can determine whether it should process the request. Usually, you can differentiate this with the Request method. GET, DELETE, and PUT are usually idempotent, and POST is not. However, you need to confirm the API's idempotency to the service owner. The Cost of Retrying When you retry your request to the downstream service, there will be additional resource usage. The additional resource usage can be in the form of additional CPU usage, blocked thread, additional memory usage, additional bandwidth usage, etc. You need to consider this, especially if your service expects large traffic. The Implementation Cost of the Retry Mechanism Many programming languages already have a library that implements a retry mechanism, but you still need to determine which request to retry. You can also create your retry mechanism or every system if you want to, but of course, this means that there will be a high implementation cost for the retry mechanism. Note 1: Many libraries have already implemented the retry mechanism gracefully. For example, if you're using the Spring Mongo library in Java Spring Boot and the connection between your apps and MongoDB is severed, it will try to reconnect. Note 2: Some libraries also implement a retry mechanism by default. It's sometimes dangerous because you can be unaware that the library will retry your request. I've also compiled some common errors and whether or not they're suitable for retrying: Let's describe the errors shortly one by one. Connection timeout: Your app failed to connect to the downstream service; hence, the downstream service isn't aware of your request, and you can retry it. Read timeout: The downstream app has processed your request but has not returned any response for a long time. Circuit breaker tripped: This is an error if you use a circuit breaker in your service. You can retry this kind of error because your service hasn't sent its request to the downstream service. 400 - Bad Request: This error means your request to the downstream service was flagged your request as a wrong request after validating it. You shouldn't retry this error because it will always return the same error if the request is the same. 401 - Unauthorized: You need to authorize before sending the request. Whether you can retry this error will depend on the authentication method and the error. But generally, you will always get the same error if your request is the same. 429 - Too many requests: Your request is rate-limited by the downstream service. You can retry this error, although you should confirm with the downstream service's owner how long your request will be rate-limited. 500 - Internal Server Error: This means the downstream service had started processing your request but failed in the middle of it. Usually, it's okay to retry this error. 503 - Service Unavailable: The downstream service is unavailable due to downtime. It is okay to retry this kind of error. Retry Backoff Period When your request fails to the downstream service, your system will need to wait for some time before trying again. This period is called the retry backoff period. Generally, there are three strategies for wait time between calls: Fixed Backoff, Exponential Backoff, and Random Backoff. All three of them have their advantages and disadvantages. Which one you use should depend on your API and service use case. Fixed Backoff Fixed backoff means that every time you retry your request, the delay between requests is always the same. For example, if you do a retry twice with a backoff of 5 seconds, then if the first call fails, the second request will be sent 5 seconds later. If it fails again, the third call will be sent 5 seconds after the failure. A fixed backoff period is suitable for a request coming directly from the user and needs a quick response. If the request is important and you need it to come back ASAP, then you can set the backoff period to none or close to 0. Exponential Backoff When downstream service is having a problem, it doesn't always recover quickly. What you don't want to do when the downstream service is trying to recover is to hit it multiple times in a short interval. Exponential backoff works by adding some additional backoff time every time our service attempts to call the downstream service. For example, we can configure our retry mechanism with a 5-second initial backoff and add two as the multiplier every attempt. This means when our first call to the downstream service fails, our service will wait 5 seconds before the next call. If the second call fails again, the service will wait 10 seconds instead of 5 seconds before the next call. Due to its longer interval nature, exponential backoff is unsuitable for retrying a user request. But it will be perfect for a background process like notification, sending email, or webhook system. Random Backoff Random backoff is a backoff strategy introducing randomness in its backoff interval calculation. Suppose that your service is getting a burst of traffic. Your service then calls a downstream service for every request, and then you get errors from it because the downstream service gets overwhelmed by your request. Your service implements a retry mechanism and will retry the requests in 5 seconds. But there is a problem: when it's time to retry the requests, all of them will be retried at once, and you might get an error from the downstream service again. With the randomness introduced by the random backoff mechanism, you can avoid this. A random backoff strategy will help your service to level the request to the downstream service by introducing a random value for retry. Let's say you configure the retry mechanism with a 5-second interval and two retries. If the first call fails, the second one could be attempted after 500ms; if it fails again, the third one could be attempted after 3.8 seconds. If many requests fail the downstream service, they won't be retried simultaneously. Where To Store the Retry State When doing a retry, you'll need to store the state of the retry somewhere. The state includes how many retries have been made, the request to be retried, and the additional metadata you want to save. Generally, there are three places you can use to store the retry state, which are as follows: Thread Thread is the most common place to store the retry state. If you're using a library with a built-in retry mechanism, it will most likely use the Thread to store the state. The simplest way to do this is to sleep the Thread. Let's see an example in Java: int retryCount = 0; while (retryCount < 3) { try { thirdPartyOutboundService.getData(); } catch (Exception e) { retryCount += 1; Thread.sleep(3000); } } The code above basically sleeps the Thread when getting an exception and calling the process again. While this is simple, it has the disadvantage of blocking the Thread and making other processes unable to use the Thread. This method is suitable for a fixed backoff strategy with a low interval like processes that direct response to the user and need a response as soon as possible. Messaging We could use a popular messaging broker like RabbitMQ (delayed queue) to save a retry state. When you're getting a request from the upstream, and you fail to process it (it can be because of downstream service or not), you can publish the message to the delayed queue to consume it later (depending on your backoff). Using messaging to save the retry state is suitable for a background process request because the upstream service can't directly get the response of the retry process. The advantage of using this approach is that it's usually easy to implement because the broker/library already supports the retry function. Messaging as a storage system of retry state also works well with distributed systems. One problem that can happen is your service suddenly has a problem like downtime when waiting for the next retry. By saving the retry state in the messaging broker, your service can continue the retry after the issue has been resolved. Database The database is the most customizable solution to store the retry state, either by using a persistent storage or an in-memory KV store like Redis. When the request to the downstream service fails, you can save the data in the database and use a cron job to check the database every second or minute to retry failed messages. While this is the most customizable solution, the implementation cost will be very high because you'll need to implement your retry mechanism. You can either create the mechanism in your service with the downside of sacrificing a bit of performance when a retry is happening or make an entirely new service for retry purposes. Takeaways This article has explored what is and what aspects to consider when implementing a retry pattern. You need to know what request and how to retry it. If you do the retry mechanism correctly, you'll help with the user experience and reduced operation of the service you're building. But, if you do it incorrectly, you risk worsening the user experience and business error. You need to understand when the request can be retried and how to retry it so you can implement the mechanism correctly. There is much more. In this article, we've covered the retry pattern. This pattern increases the downstream resiliency aspect of a system, but there is more to the downstream resiliency. We can combine the retry pattern with a timeout (which we explored in this article) and circuit breaker to make our system more resilient to downstream failure.
Traditional machine learning (ML) models and AI techniques often suffer from a critical flaw: they lack uncertainty quantification. These models typically provide point estimates without accounting for the uncertainty surrounding their predictions. This limitation undermines the ability to assess the reliability of the model's output. Moreover, traditional ML models are data-hungry and often require correctly labeled data, and as a result, tend to struggle with problems where data is limited. Furthermore, these models lack a systematic framework for incorporating expert domain knowledge or prior beliefs into the model. Without the ability to leverage domain-specific insights, the model might overlook crucial nuances in data and tend not to perform up to its potential. ML models are becoming more complex and opaque, while there is a growing demand for more transparency and accountability in decisions derived from data and AI. Probabilistic Programming: A Solution To Addressing These Challenges Probabilistic programming provides a modeling framework that addresses these challenges. At its core lies Bayesian statistics, a departure from the frequentist interpretation of statistics. Bayesian Statistics In frequentist statistics, probability is interpreted as the long-run relative frequency of an event. Data is considered random and a result of sampling from a fixed-defined distribution. Hence, noise in measurement is associated with the sampling variations. Frequentists believe that probability exists and is fixed, and infinite experiments converge to that fixed value. Frequentist methods do not assign probability distributions to parameters, and their interpretation of uncertainty is rooted in the long-run frequency properties of estimators rather than explicit probabilistic statements about parameter values. In Bayesian statistics, probability is interpreted as a measure of uncertainty in a particular belief. Data is considered fixed, while the unknown parameters of the system are regarded as random variables and are modeled using probability distributions. Bayesian methods capture uncertainty within the parameters themselves and hence offer a more intuitive and flexible approach to uncertainty quantification. Frequentist vs. Bayesian Statistics [1] Probabilistic Machine Learning In frequentist ML, model parameters are treated as fixed and estimated through Maximum Likelihood Estimation (MLE), where the likelihood function quantifies the probability of observing the data given the statistical model. MLE seeks point estimates of parameters maximizing this probability. To implement MLE: Assume a model and the underlying model parameters. Derive the likelihood function based on the assumed model. Optimize the likelihood function to obtain point estimates of parameters. Hence, frequentist models which include Deep Learning rely on optimization, usually gradient-based, as its fundamental tool. To the contrary, Bayesian methods model the unknown parameters and their relationships with probability distributions and use Bayes' theorem to compute and update these probabilities as we obtain new data. Bayes Theorem: "Bayes’ rule tells us how to derive a conditional probability from a joint, conditioning tells us how to rationally update our beliefs, and updating beliefs is what learning and inference are all about" [2]. This is a simple but powerful equation. Prior represents the initial belief about the unknown parameters Likelihood represents the probability of the data based on the assumed model Marginal Likelihood is the model evidence, which is a normalizing coefficient. The Posterior distribution represents our updated beliefs about the parameters, incorporating both prior knowledge and observed evidence. In Bayesian machine learning inference is the fundamental tool. The distribution of parameters represented by the posterior distribution is utilized for inference, offering a more comprehensive understanding of uncertainty. Bayesian update in action: The plot below illustrates the posterior distribution for a simple coin toss experiment across various sample sizes and with two distinct prior distributions. This visualization provides insights into how the combination of different sample sizes and prior beliefs influences the resulting posterior distributions. Impact of Sample Size and Prior on Posterior Distribution How to Model the Posterior Distribution The seemingly simple posterior distribution in most cases is hard to compute. In particular, the denominator i.e. the marginal likelihood integral tends to be interactable, especially when working with a higher dimension parameter space. And in most cases there's no closed-form solution and numerical integration methods are also computationally intensive. To address this challenge we rely on a special class of algorithms called Markov Chain Monte Carlo simulations to model the posterior distribution. The idea here is to sample from the posterior distribution rather than explicitly modeling it and using those samples to represent the distribution of the model parameters Markov Chain Monte Carlo (MCMC) "MCMC methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain" [3]. A few of the commonly used MCMC samplers are: Metropolis-Hastings Gibbs Sampler Hamiltonian Monte Carlo (HMC) No-U-Turn Sampler (NUTS) Sequential Monte Carlo (SMC) Probabilistic Programming Probabilistic Programming is a programming framework for Bayesian statistics i.e. it concerns the development of syntax and semantics for languages that denote conditional inference problems and develop "solvers” for those inference problems. In essence, Probabilistic Programming is to Bayesian Modeling what automated differentiation tools are to classical Machine Learning and Deep Learning models [2]. There exists a diverse ecosystem of Probabilistic Programming languages, each with its own syntax, semantics, and capabilities. Some of the most popular languages include: BUGS (Bayesian inference Using Gibbs Sampling) [4]: BUGS is one of the earliest probabilistic programming languages, known for its user-friendly interface and support for a wide range of probabilistic models. It implements Gibbs sampling and other Markov Chain Monte Carlo (MCMC) methods for inference. JAGS (Just Another Gibbs Sampler) [5]: JAGS is a specialized language for Bayesian hierarchical modeling, particularly suited for complex models with nested structures. It utilizes the Gibbs sampling algorithm for posterior inference. STAN: A probabilistic programming language renowned for its expressive modeling syntax and efficient sampling algorithms. STAN is widely used in academia and industry for a variety of Bayesian modeling tasks. "Stan differs from BUGS and JAGS in two primary ways. First, Stan is based on a new imperative probabilistic programming language that is more flexible and expressive than the declarative graphical modeling languages underlying BUGS or JAGS, in ways such as declaring variables with types and supporting local variables and conditional statements. Second, Stan’s Markov chain Monte Carlo (MCMC) techniques are based on Hamiltonian Monte Carlo (HMC), a more efficient and robust sampler than Gibbs sampling or Metropolis-Hastings for models with complex posteriors" [6]. BayesDB: BayesDB is a probabilistic programming platform designed for large-scale data analysis and probabilistic database querying. It enables users to perform probabilistic inference on relational databases using SQL-like queries [7] PyMC3: PyMC3 is a Python library for Probabilistic Programming that offers an intuitive and flexible interface for building and analyzing probabilistic models. It leverages advanced sampling algorithms such as Hamiltonian Monte Carlo (HMC) and Automatic Differentiation Variational Inference (ADVI) for inference [8]. TensorFlow Probability: "TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU)" [9]. Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling" [10]. These languages share a common workflow, outlined below: Model definition: The model defines the processes governing data generation, latent parameters, and their interrelationships. This step requires careful consideration of the underlying system and the assumptions made about its behavior. Prior distribution specification: Define the prior distributions for the unknown parameters within the model. These priors encode the practitioner's beliefs, domain, or prior knowledge about the parameters before observing any data. Likelihood specification: Describe the likelihood function, representing the probability distribution of observed data conditioned on the unknown parameters. The likelihood function quantifies the agreement between the model predictions and the observed data. Posterior distribution inference: Use a sampling algorithm to approximate the posterior distribution of the model parameters given the observed data. This typically involves running Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) algorithms to generate samples from the posterior distribution. Case Study: Forecasting Stock Index Volatility In this case study, we will employ Bayesian modeling techniques to forecast the volatility of a stock index. Volatility here measures the degree of variation in a stock's price over time and is a crucial metric for assessing the risk associated with a particular stock. Data: For this analysis, we will utilize historical data from the S&P 500 stock index. The S&P 500 is a widely used benchmark index that tracks the performance of 500 large-cap stocks in the United States. By examining the percentage change in the index's price over time, we can gain insights into its volatility. S&P 500 — Share Price and Percentage Change From the plot above, we can see that the time series — price change between consecutive days has: Constant Mean Changing variance over time, i.e., the time series exhibits heteroscedasticity Modeling Heteroscedasticity: "In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance" [11]. Auto-regressive Conditional Heteroskedasticity (ARCH) models are specifically designed to address heteroscedasticity in time series data. Bayesian vs. Frequentist Implementation of ARCH Model The key benefits of Bayesian modeling include the ability to incorporate prior information and quantify uncertainty in model parameters and predictions. These are particularly useful in settings with limited data and when prior knowledge is available. In conclusion, Bayesian modeling and probabilistic programming offer powerful tools for addressing the limitations of traditional machine-learning approaches. By embracing uncertainty quantification, incorporating prior knowledge, and providing transparent inference mechanisms, these techniques empower data scientists to make more informed decisions in complex real-world scenarios. References Fornacon-Wood, I., Mistry, H., Johnson-Hart, C., Faivre-Finn, C., O'Connor, J.P. and Price, G.J., 2022. Understanding the differences between Bayesian and frequentist statistics. International journal of radiation oncology, biology, physics, 112(5), pp.1076-1082. Van de Meent, J.W., Paige, B., Yang, H. and Wood, F., 2018. An Introduction to Probabilistic Programming. arXiv preprint arXiv:1809.10756. Markov chain Monte Carlo Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W., 1996. BUGS 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pp.1-59. Hornik, K., Leisch, F., Zeileis, A. and Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of DSC (Vol. 2, No. 1). Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P. and Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software, 76. BayesDB PyMC TensorFlow Probability Pyro AI Homoscedasticity and heteroscedasticity Introduction to ARCH Models pymc.GARCH11
Salesforce Analytics Query Language (SAQL) is a Salesforce proprietary query language designed for analyzing Salesforce native objects and CRM Analytics datasets. SAQL enables developers to query, transform, and project data to facilitate business insights by customizing the CRM dashboards. SAQL is very similar to SQL (Structured Query Language); however, it is designed to explore data within Salesforce and has its own unique syntax which is somewhat like Pig Latin (pig-ql). You can also use SAQL to implement complex logic while preparing datasets using dataflows and recipes. Key Features Key features of SAQL include the following: It enables users to specify filter conditions, and group and summarize input data streams to create aggregated values to derive actionable insights and analyze trends. SAQL supports conditional statements such as IF-THEN-ELSE and CASE. This feature can be used to execute complex conditions for data filtering and transformation. SAQL DATE and TIME-related functions make it much easier to work with date and time attributes, allowing users to execute time-based analysis, like comparing the data over various time intervals. Supports a variety of data transformation functions to cleanse, format, and typecast data to alter the structure of data to suit the requirements SAQL enables you to create complex calculated fields using existing data fields by applying mathematical, logical, or string functions. SAQL provides seamless integration with the Salesforce objects and CRM Analytics datasets. SAQL queries can be used to design visuals like charts, graphs, and dashboards within the Salesforce CRM Analytics platform. The rest of this article will focus on explaining the fundamentals of writing the SAQL queries, and delve into a few use cases where you can use SAQL to analyze the Salesforce data. Basics of SAQL Typical SAQL queries work like any other ETL tool: queries load the datasets, perform operations/transformations, and create an output data stream to be used in visualization. SAQL statements can run into multiple lines and are concluded with a semicolon. Every line of the query works on a named stream, which can serve as input for any subsequent statements in the same query. The following SAQL query can be used to create a data stream to analyze the opportunities booked in the previous year by month. SQL 1. q = load "OpportunityLineItems"; 2. q = filter q by 'StageName' == "6 - Closed Won" and date('CloseDate_Year', 'CloseDate_Month', 'CloseDate_Day') in ["1 year ago".."1 year ago"]; 3. q = group q by ('CloseDate_Year', 'CloseDate_Month'); 4. q = foreach q generate q.'CloseDate_Year' as 'CloseDate_Year', q.'CloseDate_Month' as 'CloseDate_Month', sum(q.'ExpectedTotal__c') as 'Bookings'; 5. q = order q by ('CloseDate_Year' asc, 'CloseDate_Month' asc); 6. q = limit q 2000; Line Number Description 1 This statement loads the CRM analytics dataset named “OpportunityLineItems” into an input stream q. 2 The input stream q is filtered to look for the opportunities closed won in the previous year. This is similar to the WHERE clause in SQL. 3 The statement focuses on grouping the records by the close date year and month so that we can visualize this data by the months. This is similar to the GROUP BY clause in SQL. 4 Statement 4 is selecting the attributes we want to project from the input stream. Here the expected total is being summed up for each group. 5 Statement 5 is ordering the records by the close of the year and month so that we can create a line chart to visualize this by month. 6 The last statement in the code above focuses on restricting the stream to a limited number of rows. This is mainly used for debugging purposes. Joining Multiple Data Streams The SAQL cogroup function joins input data streams like Salesforce objects or CRM analytics datasets. The data sources being joined should have a related column to facilitate the join. cogroup also supports the execution of both INNER and OUTER joins. For example, if you had two datasets, with one containing sales data and another containing customer data, you could use cogroup to join them based on a common field like customer ID. The resultant data stream contains both fields from both tables. Use Case The following code block can be used for a data stream for NewPipeline and Bookings for the customers. The pipeline built and bookings are coming from two different streams. We can join these two streams by Account Name. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName', q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline',sum(q1.'Bookings') as 'Bookings'; You can also use a left outer cogroup to join the right data table with the left. This will result in all the records from the left data stream and all the matching records from the right stream. Use the coalesce function to replace all the null values from the right stream with another value. In the example above, if you want to report all the accounts with or without bookings, you can use the query below. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName' left, q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline', coalesce(sum(q1.'Bookings'), 0) as 'Bookings'; Top N Analysis Using Windowing SAQL enables Top N analysis across value groups using the windowing functions within the input data stream. These functionalities are utilized for deriving the moving averages, cumulative totals, and rankings within the groups. You can specify the set of records where you want to execute these calculations using the “over” keyword. SAQL allows you to specify an offset to identify the number of records before and after the selected row. Optionally you can choose to work on all the records within a partition. These records are called windows. Once the set of records is identified for a window, you can apply an aggregation function to all the records within the defined window. Optionally you can create partitions to group the records based on a set of fields and perform aggregate calculations for each partition independently. Use Case The following SAQL code can be used to prepare data for the percentage contribution of new pipelines for each customer to the total pipeline by the region and the ranking of these customers by the region. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by ('Region','AccountName'); q = foreach q generate q.'Region' as 'Region',q.'AccountName' as 'AccountName', ((sum('ExpectedTotal__c')/sum(sum('ExpectedTotal__c')) over ([..] partition by 'Region')) * 100) as 'PCT_PipelineContribution', rank() over ([..] partition by ('Region') order by sum('ExpectedTotal__c') desc ) as 'Rank'; q = filter q by 'Rank' <=5; Data Aggregation: Grand Totals and Subtotals With SAQL SAQL offers rollup and grouping functions to aggregate the data streams based on pre-defined groups. While the rollup construct is used with the group by statement, grouping is used as part of foreach statements while projecting the input data stream. The rollup function aggregates the input data stream at various levels of hierarchy allowing you to create calculated fields on summarized datasets at higher levels of granularity. For example, in case you have datasets by the day, rollup can be used to aggregate the results by week, month, or year. The grouping function is used to group data based on specific dimensions or fields in order to segment the data into meaningful subsets for analysis. For example, you might group sales data by product category or region to analyze performance within each group. Use Case Use the code below to prepare data for the total number of accounts and accounts engaged by the region and theater. Also, add the grand total to look at the global numbers and subtotals for both regions and theaters. SQL q = load "ABXLeadandOpportunities_Metric"; q = filter q by 'Source' == "ABX Opportunities" and 'CampaignType' == "Growth Sprints" and 'Territory_Level_01__c' is not null; q = foreach q generate 'Territory_Level_01__c' as 'Territory_Level_01__c','Territory_Level_02__c' as 'Territory_Level_02__c','Territory_Level_03__c' as 'Territory_Level_03__c', q.'AccountName' as 'AccountName',q.'OId' as 'OId','MarketingActionedOppty' as 'MarketingActionedOppty','AccountActionedAcct' as 'AccountActionedAcct','ADRActionedOppty' as 'ADRActionedOppty','AccountActionedADRAcct' as 'AccountActionedADRAcct'; q = group q by rollup ('Territory_Level_01__c', 'Territory_Level_02__c'); q = foreach q generate case when grouping('Territory_Level_01__c') == 1 then "TOTAL" else 'Territory_Level_01__c' end as 'Level1', case when grouping('Territory_Level_02__c') == 1 then "LEVEL1 TOTAL" else 'Territory_Level_02__c' end as 'Level2', unique('AccountName') as 'Total Accounts',unique('AccountActionedAcct') as 'Engaged',((unique('AccountActionedAcct') / unique('AccountName'))) as '% of Engaged'; q = limit q 2000; Filling the Missing Date Fields You can use the fill() function to create a record for missing date, week, month, quarter, and year records in your dataset. This comes very handy when you want to show the result as 0 for these missing days/weeks/months instead of not showing them at all. Use Case The following SAQL code allows you to track the number of tasks for the sales agents by the days of the week. In case the agents are on PTO you want to show 0 tasks. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; The code above will be missing two days where there were no tasks created. You can use the code below to fill in the missing days. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = fill q by (dateCols=(MetricDate_Year, MetricDate_Month, MetricDate_Day, "Y-M-D")); q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; You can also specify the start date and end date to populate the missing records between these dates. Conclusion In the end, SAQL has proven itself as a powerful tool for the Salesforce developer community, empowering them to extract actionable business insights from the CRM datasets using capabilities like filtering, aggregation, windowing, time-analysis, blending, custom calculation, Salesforce integration, and performance optimization. In this article, we have explored various capabilities of this technology and focused on targeted use cases. As a next step, I would recommend continuing your learnings by exploring Salesforce documentation, building your data models using dataflow, and using SAQL capabilities to harness the true potential of Salesforce as a CRM.
Deployment strategies provide a systematic approach to releasing software changes, minimizing risks, and maintaining consistency across projects and teams. Without a well-defined strategy and systematic approach, deployments can lead to downtime, data loss, or system failures, resulting in frustrated users and revenue loss. Before we start exploring different deployment strategies in more detail, let’s take a look at the short overview of each deployment strategy mentioned in this article: All-at-once deployment: This strategy involves updating all the target environments at once, making it the fastest but riskiest approach. In-place deployment: This involves stopping the current application and replacing it with a new version, directly affecting availability. Blue/Green deployment: A zero-downtime approach that involves running two identical environments and switching from old to new Canary deployment: Introduces new changes incrementally to a small subset of users before a full rollout Shadow deployment: Mirrors real traffic to a shadow environment where the new deployment is tested without affecting the live environment All-At-Once Deployment All-at-once deployment strategy, also known as the "Big Bang" deployment strategy, involves simultaneously releasing your application's new version to all servers or environments. This method is straightforward and can be implemented quickly, as it does not require complex orchestration or additional infrastructure. The primary benefit of this approach is its simplicity and the ability to immediately transition all users to the new version of the application. However, the all-at-once method carries significant risks. Since all instances are updated together, any issues with the new release immediately impact all users. There is no opportunity to mitigate risks by gradually rolling out the change or testing it with a subset of the user base first. Additionally, if something goes wrong, the rollback process can be just as disruptive as the initial deployment. Despite these risks, all-at-once deployment can be suitable for small applications or environments where downtime is more acceptable and the impact of potential issues is minimal and is used pretty often. It is also useful in scenarios where applications are inherently simple or have been thoroughly tested to ensure compatibility and stability before release. In-Place (Recreate) Deployment In-place or recreate deployment strategy is another strategy that is used pretty often when developing projects. It is the simplest and does not require additional infrastructure. Its essence lies in the fact that when we deploy a new version, we stop the application and start it with new changes. The disadvantage of this approach is that the service we are updating will experience downtime that will affect its users. Also, in case of problems with new software changes, we might need to roll back the latest changes, which will lead to service downtime. To avoid downtime and be able to roll back changes without it during the deployment process, there are deployment strategies that are created for this purpose and used in the industry. Blue/Green Deployment The first zero downtime deployment strategy we’re going to talk about is the Blue/Green deployment strategy. Its main goal is to minimize downtime and risks while deploying new software versions. This is done by having 2 identical environments of our service. One environment contains the original application (Blue environment) that serves users' requests and the other environment (Green environment) is where new software changes are deployed. This allows us to verify and test new changes with near zero downtime for users and the service, with the ability to safely roll back in case of any problems, except for some cases that we will discuss a bit later. Typically, the process is the following: after verifying and testing the new changes in the Green environment, we reroute traffic from the Blue environment to our identical Green environment with the new changes. Sounds easy, isn’t it? ... it depends. The problem is that we can easily reroute traffic between environments only when our services are stateless. If they interact with any data sources, things get more complicated, and here's why: Our identical Green and Blue environments share common data source(s). While sharing data sources such as NoSQL databases or object stores (AWS S3, for example) between our identical environment is easier to accomplish, this is completely not true for relational databases because it requires additional efforts (NoSQL also might require some effort) to support Blue/Green deployments. Since approaches to handle schema updates without downtime are out of the scope of this article, you can check out the article, "Upgrading database schema without downtime," to learn more about updating schemas without downtime (if you have any interesting resources on updating schemas without downtime - please, share with us in the comments). A general recommendation is that If your services are not stateless and use data sources with schemas, implementing a Blue/Green deployment strategy is not always recommended because of the additional risk and failure points this can introduce minimizing the benefits of the Blue/Green deployment strategy. But if you’ve decided that you need to integrate a Blue/Green deployment strategy and your infrastructure is running on Amazon Web Services, you might find this document by AWS on how to implement Blue/Green deployments and its infrastructure useful. Canary Deployment The idea of the Canary deployment strategy is to reduce the risks of deploying new software versions in production by rolling out new changes to users slowly. In the same manner, as we do in the Blue/Green deployment strategy, we roll out new software versions to our identical environment; but instead of completely rerouting traffic from one environment to another, we, for example, route a portion of users to our environment with new software version using a load balancer. The size of the portion of users getting new software versions and/or the criteria used to determine them - may be specific for every company/project. Some roll out new changes only to their internal stuff first, some determine users randomly and some may use algorithms to match users based on some criteria. Pick anything that best suits your needs. Shadow Deployment Shadow deployment strategy is the next strategy I find interesting, personally. This strategy also uses the concept of identical environments, just as the Blue/Green and Canary deployment strategies do. The main difference is that instead of completely rerouting or rerouting only a portion of real users we duplicate entire traffic to our second environment where we deployed our new changes. This way, we can test and verify our changes without negatively affecting our users, thus mitigating risks of broken software updates or performance bottlenecks. Conclusion In this article, we walked through five different deployment strategies, each with its own set of advantages and challenges. The all-at-once and in-place deployment strategies stand out for their speed and minimal effort required to deploy new versions of software. While these two strategies will be your go-to deployment strategies in most cases, it’s still useful to understand and know about more complex and resource-intensive strategies. Ultimately, implementing any deployment strategy requires careful consideration of the potential impact on both the system and its users. The choice of deployment strategy should align with your project’s needs, risk tolerance, and operational capabilities.
Data Science: A Deep Dive Into Careers and Future Scope
May 16, 2024 by CORE
Advanced Linux Troubleshooting Techniques for Site Reliability Engineers
May 15, 2024 by CORE
RocksDB: The Bedrock of Modern Stateful Applications
May 16, 2024 by
Create Proxy Application for Mule APIs
May 16, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Create Proxy Application for Mule APIs
May 16, 2024 by
Dynamic Query Building Spring Boot With JPA Criteria Queries
May 16, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Create Proxy Application for Mule APIs
May 16, 2024 by
Comparing Pandas, Polars, and PySpark: A Benchmark Analysis
May 16, 2024 by