Database optimization for high-performance Java applications. Part 2
Perform a single request instead of several requests
Here are some techniques to consolidate database interactions into fewer, more efficient requests:
Using views
A database view is essentially a stored query accessible as a virtual table in a relational database. It does not store data itself but fetches data from other tables when accessed. Views can simplify complex queries by encapsulating joins, filters and calculations, allowing you to retrieve combined data with a single query instead of multiple separate queries. This can significantly reduce the complexity of client-side data handling and decrease the network load.
Using UNION
The UNION SQL operator is used to combine the result sets of two or more SELECT statements. Each SELECT statement within the UNION must have the same number of columns in the result sets with similar data types. By using UNION, you can efficiently retrieve disparate pieces of information from different tables in a single query, reducing the need for multiple queries and consolidations on the client side.
For example, if you need to fetch a unified list of all active employees and contractors for a company-wide notification system, you might use a UNION to combine results from separate tables for employees and contractors, assuming they share a compatible structure.
Using batch processing
Batch processing is particularly useful when you need to execute multiple insert, update or delete operations. Instead of performing each operation as an individual transaction, which can be slow due to transaction overhead and network latency, batch processing sends a group of operations in one go. This reduces the number of database connections and transactions, speeding up overall processing time.
In JDBC, for instance, you can use batch processing by adding multiple update statements to a single Batch using the addBatch() method of PreparedStatement and then executing them together with executeBatch(). This is significantly faster than executing each statement individually.
Request granularity
Request granularity refers to the size and complexity of the data being retrieved in a single database query. Optimizing request granularity means striking the right balance between fetching too little and too much data in one go.
- Fine-grained requests: These involve retrieving exactly what is needed without any excess, ideal for situations where bandwidth or data processing is a concern. However, too many fine-grained requests can lead to higher network traffic due to numerous small queries.
- Coarse-grained requests: These fetch large chunks of data at once, which can be efficient when network latency is a bottleneck. The downside is the potential for over-fetching data, which can strain memory and processing resources on the client side.
Optimal request granularity should be determined based on:
- Network latency: Higher latency often justifies more coarse-grained requests.
- Data usage patterns: How the data will be used and displayed can dictate the amount of data to be retrieved in a single request.
- Application performance requirements: The balance between responsiveness and resource usage.
Using JDBC instead of JPA
While the Java Persistence API (JPA) simplifies data access, it can also introduce performance-sapping overhead. In cases where raw speed is the top priority, using the lower-level JDBC API directly can yield better results.
Of course, the productivity gains of JPA are invaluable in many enterprise applications. The key is to profile your workload and surgically apply JDBC optimizations where they matter most.
Database denormalization
Usually, the relation database stores data in the normalized form. It means that we should not have any kind of data duplication. While normalization facilitates easier maintenance and data consistency, it is not always the optimal choice for performance, particularly in read-heavy environments where complex queries are common.
Normalization often leads to the creation of additional tables to ensure that each one represents a single entity or concept. This design principle can lead to an increased number of joins needed in queries, potentially degrading performance as the database grows in size and complexity.
Consider a simple database holding information about users and their addresses. In a normalized design, you might have two tables:
- Users: UserID, Name, Email
- Addresses: AddressID, UserID, Street, City, Country, Zip
To retrieve a list of users along with their addresses, a JOIN operation between the Users and Addresses tables is necessary:
SELECT Users.Name, Addresses.Street, Addresses.City
FROM Users JOIN Addresses
ON Users.UserID = Addresses.UserID;
This JOIN operation is expensive, and denormalization allows us to optimize it. Denormalization introduces redundancy into a database to reduce the complexity of critical queries.
To improve the performance of the user-address query, you can denormalize the data by combining the Users and Addresses tables into a single table:
- UsersAndAddresses: UserID, Name, Email, Street, City, Country, Zip
Now, the same information can be retrieved with a simpler and faster query since no join operation is required:
SELECT Name, Email, Street, City, Country, Zip
FROM UsersAndAddresses;
This design allows for quicker data retrieval by eliminating the need for JOIN operations. It is particularly beneficial in scenarios where read performance is critical, and data modifications are infrequent or can be managed with careful update mechanisms.
However, using denormalization can lead to various issues:
- Data redundancy: Denormalization increases storage requirements due to duplicated data. This can also lead to potential inconsistencies if not all instances of the data are updated simultaneously.
- Update anomalies: When data that exists in multiple places needs to be updated, the update must be made in several places, increasing the complexity of data management.
- Balance between read and write performance: While denormalization can significantly improve read performance, it often does so at the expense of write performance. Decisions on denormalization should consider the typical workload of the application (read-heavy vs. write-heavy).
To avoid these issues, we can use 2 intermediate solutions:
- Materialized views
- CQRS pattern
Let’s discuss these in more detail.
Materialized views
A materialized view creates a physical snapshot of a result set, which is usually generated from a complex query. This snapshot is stored in the database and can be refreshed on demand or at regular intervals. Materialized views are particularly useful for improving the performance of complex joins and aggregations that are queried frequently but do not change often.
Given the normalized tables for Users and Addresses, a materialized view can be created to store the result of the join operation. This provides fast access to combined user and address information without the need to denormalize the tables into a single UsersAndAddresses table.
CREATE MATERIALIZED VIEW UserAddressView AS
SELECT Users.UserID, Users.Name, Users.Email, Addresses.Street, Addresses.City, Addresses.Country, Addresses.Zip
FROM Users
JOIN Addresses ON Users.UserID = Addresses.UserID;
Usage:
SELECT Name, Email, Street, City, Country, Zip
FROM UserAddressView;
This query retrieves user and address information without the overhead of a join operation at query time, as the join has already been performed when the materialized view was last refreshed.
Materialized views can be updated using several strategies, each with specific use cases:
- On-demand refresh: This strategy allows the materialized view to be manually refreshed whenever needed. It is typically used when the data needs to be as up to date as possible but does not require real-time accuracy. This method is ideal for applications where data changes infrequently or when it's acceptable to have slightly stale data between refreshes.
REFRESH MATERIALIZED VIEW
- Scheduled refresh: In this approach, the materialized view is refreshed at scheduled intervals, such as nightly or weekly. This is suitable for reports or dashboards where the latest data is needed only at certain times and the overhead of frequent refreshes cannot be justified.
- Trigger-based refresh: Some systems support refreshing materialized views through triggers, where updates to the underlying data automatically initiate a refresh of the view. This method attempts to keep the materialized view as up to date as possible without manual intervention, but it can be resource intensive.
- Incremental refresh: Advanced implementations allow for incremental refreshes where only the changed data is used to update the materialized view, rather than recomputing the entire view. This can significantly reduce the load on the database, making refreshes faster and more efficient.
Materialized views are supported by various RDBMS, each offering different features and capabilities surrounding materialized views. The list includes Oracle, PostgreSQL, SQL Server (under the name indexed views) and others.
However, using materialized views introduces some pitfalls:
- Staleness of data: Unless refreshed frequently, the data in materialized views can become outdated, which might not be acceptable for dynamic applications that require real-time data.
- Resource-intensive refreshes: Refreshing materialized views, especially full refreshes, can be resource-intensive and affect the performance of the database, particularly if the underlying data is large or the views are complex.
- Storage overhead: Materialized views require additional storage since they physically store the query results. This can be substantial, depending on the size of the data and the complexity of the queries.
- Complexity in management: Managing materialized views involves additional complexity, such as deciding when and how often to refresh them, which can increase the administrative burden.
Let’s discuss another possibility to keep the database normalized and performant at the same time — the CQRS pattern.