Tag Archives: #sql

A Good Database BUG

People generally do not think of cockroaches positively, but I have nothing but good feelings about CockroachDB. At its core, CockroachDB is resilient and reliable.

Cockroach Labs, a software company known for its cloud-native SQL databases, has found a home in Bengaluru, India. With a rapidly growing team of over 55 engineers specializing in database and cloud engineering, the company’s journey in India is as much about emotional ties as it is about strategic growth.

Bengaluru’s choice is strategic. It offers unparalleled time zone advantages and access to a rich talent pool. With a population of 1.4 billion and a digitizing economy, India is ideal for testing CockroachDB’s resilience and scalability.

The company plans to expand its Bengaluru office into a first-class R&D hub. Teams are working on innovations like vector data integration for AI, enabling operational databases to evolve into systems capable of real-time intelligence.

Building Blocks of CockroachDB

The founders’ lack of a transactional distributed database forced them to use DynamoDB, leading to inefficiencies in their early startup years. This frustration led to the birth of Cockroach Labs in 2014, with a vision to create an open-source, cloud-native distributed database.


I am a HUGE advocate of open-source databases, so this journey is intriguing. Not sitting with inefficiencies but finding a way to grow beyond them is a significant step for any startup.

True to its name, CockroachDB has built a reputation for resilience. It can run seamlessly across cloud providers, private data centers, and hybrid setups, making it a standout choice. Cockroach Labs focuses on eliminating vendor lock-in and ensuring businesses can operate uninterrupted, even during cloud or data center outages. I can’t say enough how important it is not to be locked into one cloud provider. This is a serious flex for an open-source database NOT to be “vendor dependent.” Staying in the driver’s seat and not becoming a passenger or going along for a ride with a service provider is ideal. Retaining the power of “choice” as a customer is priceless. This adaptability has made Cockroach Labs the operational backbone for global giants like Netflix and ambitious startups like Fi.

Sharing some notes on my explorer experience:

Getting Started

Install CockroachDB on Ubuntu (using Bash Shell):

1. Update Your System: First, update your system packages to the latest version:
   
   sudo apt update -y
   sudo apt upgrade -y
   

2. Install Dependencies: Install the required dependencies:
   
   sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
  

3. Download CockroachDB: Download the latest version of CockroachDB:
   
   wget -qO- https://binaries.cockroachdb.com/cockroach-latest.linux-amd64.tgz | tar xvz
   

4. Move CockroachDB Binary: Move the binary to a directory in your PATH:

      sudo cp -i cockroach-latest.linux-amd64/cockroach /usr/local/bin/
   

5. Verify Installation: Verify the installation by checking the CockroachDB version:
   
   cockroach version
   

6. Initialize CockroachDB Cluster: Create a directory for CockroachDB data and initialize the cluster:
   
   sudo mkdir -p /var/lib/cockroach
   sudo chown $(whoami) /var/lib/cockroach
   cockroach start-single-node --insecure --store=/var/lib/cockroach --listen-addr=localhost:26257 --http-addr=localhost:8080
   

7. Connect to CockroachDB SQL Shell: Connect to the CockroachDB SQL shell:
   
   cockroach sql --insecure --host=localhost:26257
   

8. Run CockroachDB as a Background Service: Create a systemd service file to run CockroachDB as a background service:
   
   sudo nano /etc/systemd/system/cockroach.service
   
   Add the following configuration:
   ini
   [Unit]
   Description=CockroachDB
   Documentation=https://www.cockroachlabs.com/docs/

   [Service]
   Type=notify
   ExecStart=/usr/local/bin/cockroach start-single-node --insecure --store=/var/lib/cockroach --listen-addr=localhost:26257 --http-addr=localhost:8080
   TimeoutStartSec=0
   Restart=always
   RestartSec=10

   [Install]
   WantedBy=multi-user.target
   

9. Enable and Start the Service: Reload the systemd manager configuration, start the CockroachDB service, and enable it to run on system startup:
   
   sudo systemctl daemon-reload
   sudo systemctl start cockroach
   sudo systemctl enable cockroach
   sudo systemctl status cockroach
   

CockroachDB is now installed and running on your Ubuntu system. 

Cockroach Labs is continuing to invests heavily in AI-specific features, including support for vector similarity searches and operationalizing AI workflows.

Here's an example of how you can use CockroachDB with AI, specifically leveraging vector search for similarity searches:

1. Install CockroachDB: Follow the steps I provided earlier to install CockroachDB on your system.

2. Create a Database and Table: Connect to CockroachDB and create a database and table to store your data:
 
   cockroach sql --insecure --host=localhost:26257
   CREATE DATABASE ai_example;
   USE ai_example;
   CREATE TABLE vectors (id INT PRIMARY KEY, vector FLOAT[] NOT NULL);
 

3. Insert Data: Insert some sample data into the table:(at sql prompt) Steps 3 -5

   INSERT INTO vectors (id, vector) VALUES (1, ARRAY[1.0, 2.0, 3.0]), (2, ARRAY[4.0, 5.0, 6.0]);


4. Enable pgvector Extension**: Enable the `pgvector` extension for vector similarity searches:
sql>
   CREATE EXTENSION IF NOT EXISTS pgvector;


5. Perform Similarity Search**: Use the `pgvector` extension to perform a similarity search:
sql>
   SELECT id, vector, similarity(vector, ARRAY[2.0, 3.0, 4.0]) AS similarity_score
   FROM vectors
   ORDER BY similarity_score DESC;
 

Create a table to store vectors, and perform similarity searches using the `pgvector` extension. "pgvector" enables similarity searches by comparing high-dimensional vectors, making it useful for tasks like finding similar items in recommendation systems, which is an AI tool. 

Yes. CockroachDB is compatible with PostgreSQL, which means you can use many PostgreSQL tools, libraries, and client applications. This can be a bridge in learning about this database, which is also a plus.

I am looking forward to testing these new developments from Cockroach Labs. There is a wealth information contained in their repository (linked-below) as well as number of repos from the open-source database community. Their investment in AI is key to the company’ sustainable growth.

https://github.com/cockroachlabs

https://www.cockroachlabs.com

Learn more about pgvector in this repo

Dynamic Views – The Storytellers of SQL Server

The successful management of data is an essential part of the role of an administrator. Although third-party tools can be helpful, SQL Server has some built-in functionality that can provide valuable information to most technical teams. As data changes over time, the data environment’s performance will also change. Administrators and Engineers rely on system-generated data within SQL Server and its management tools to resolve issues and monitor SQL Server’s performance. This discussion aims to examine the built resources and how they are used to manage the performance of SQL Server. Monitoring is one way to see changes over time. Visual graphs or charts can help spot trends or changes within the data.

Let’s briefly consider a few of the built-in features. SQL Server Management Studio has an Activity Monitor.  The Activity Monitor within SQL Server provides insights on the activities and processes running within SQL Server.  Activity Monitor Overview Pane displays the following data:

  • Processor Time
    • This monitor will show how busy the CPUs are and if it is operating at an acceptable level. 
  • Waiting Tasks
    • The number of tasks that are waiting for, I/O, or memory to be released so the tasks can be processed.
  • Database I/O
    •  The data transfer rate in MB/s from memory to disk, disk to memory, or disk to disk.
  • Batch Requests/sec
    • The number of T-SQL command batches received by the server per second.

Expanded View (activity is queried only when the panel is expanded)

  • Processes
    • Displays information about active running processes on the SQL databases; including login ID, database, command, and application name.
  • Resource Waits
    • Queries utilize a large number of resources – memory, disk, and network.
  • Data File I/O
    • Display the following information about database files in the SQL Server instance: MDF, LDF, and NDF, their paths, and names
  • Recent Expensive Queries
    • The pane shows expensive queries executed in the last 30 seconds
  • Active Expensive Queries
    • Displays a list of queries that utilize a large number of  resources – memory, disk, network.

Windows provides a Performance Monitor (PerfMon) to add customized counters related to SQL Server

The following types of counters are available for SQL Server:

    • Avg. Disk sec/Read
      • The average time, in seconds, of a read of data from a disk.
    • Avg. Disk sec/Write
      • The average time in seconds needed to write data to a disk.
    • Page Life Expectancy
      • How long in seconds is a data file page expected to remain in memory in the buffer cache? It is not based on averages.
    • Buffer cache hit ratio 
      • The percentage of pages found in the SQL buffer pool. It may not indicate a problem. 
    • Compilations/sec
      • Measures the number of times SQL Server compiles an execution plan per second. 

SQL Profiler is the GUI for SQL Trace. It can trace the steps and analyze queries and processes executed against SQL SERVER.  The results are saved in a file that can be replayed for a closer review. This process helps create a repeatable error based on another process.

All of these monitoring tools are useful but limited by the amount of data each can provide. In most cases, a closer look into the system data is needed.  User-created data and objects pose the most risk to any system. It is a known fact that the original schema, concept or design of data can change over time.  Improper or limited scaling, poorly written queries, data ingestion, and access can impact the database’s performance. The discovery of an issue means the technologist must become a detective as they try to discover the unknown issue with the known results or impact. 

SQL Server is designed to store data and related objects and is a collector of insights based on its running state.  The information records the activities performed and executed on the SQL Server since its last reboot. Unless this data is stored within a table, it is cleared whenever SQL Server gets restarted.  These insights are called Dynamic Views. A Dynamic View in SQL Server is designed not only to store data and related objects but also to collect information based on its running state.  The collection of information a record of the activities performed and executed on SQL Server since its last reboot. Unless stored within a table, these information repositories are cleared whenever the SQL Server is restarted.

Dynamic management views return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance.[i]  Database-based dynamic management views are the storytellers of the database. Each of these views can be used to tell the story of the data by diagnosing performance issues in SQL Server.

Understanding Dynamic Views

Dynamic views can play an essential role in troubleshooting bothersome performance issues.  Engineers and Administrators are often called upon to address these issues by using them. The results from dynamic views can be customized to contain only the data that you need.  VIEW DATABASE STATE permission is needed in SQL Server to see these dynamic views.

There are two types of dynamic management views and functions:

  • Server-based dynamic management views and functions. These require VIEW SERVER STATE permission on the server.
  • Database-based dynamic management views and functions. These require VIEW DATABASE STATE permission on the database.

The key to utilizing these views is selecting the appropriate fields. Most views contain numerous columns; generally, only a few are needed to troubleshoot most problems.

The Problem Query

The complexity of a query can impact performance.  A complex query is beyond a basic query with a SELECT statement against one table. A complex query can involve inner and outer joins against multiple tables. It can combine sub-queries, nested commands, AND, OR, and WHERE clauses.

The purpose of a query is to retrieve specific data from a single or joined table.  The data you want to see can be spread across several tables, and queries allow you to view it on a single page. The performance becomes a challenge when the data starts to scale. The schema can change; tables are no longer used, field names change, or more attributes are added.  As the data scales the indexes once used may no longer provide the same value. An upgrade or user-based changes can trigger these problems.

Queries are not intentionally designed poorly. Over time, the usefulness of some queries wears out.  Data often scales beyond the scope of what the query was initially intended. Query plans define the order and method of execution; each can be impacted if the underlying data or data objects change. Poorly documented changes within a database environment contribute to the pool of challenged queries.

Blocking and Locking in SQL

In SQL Server, data is accessed by applications, services, and users. Each process running in SQL has its own unique ID called a SPID. A SPID in SQL Server is a Server Process ID. These process IDs are individual sessions in SQL Server. Each time an application connects to SQL Server, a new connection (or SPID) is created.   The database engine issues locks to prevent simultaneous changes to the database. When another request comes in, this request is blocked until the data is released from the previous session — for example, a SELECT * statement without a WHERE clause has the potential to cause lock contention. A production environment can become problematic if multiple sessions are locked and blocked — one way to view activity to use the dynamic views.

In older releases of SQL Server, one of the more common stored procedures used to determine blocking is sp_who.  Now, Administrators are being referred to dynamic views as a key source of information.

There are at least four dynamic views that can be used to dig into this performance issue.  The dynamic view sys.dm_tran_locks shows requests that are actively running in the SQL Server.

The view sys.dm_os_waiting_tasks shows the queue of tasks that are waiting for SQL resources. The view sys.dm_exec_requests details information about each request. The view sys.dm_exec_sql_text returns the identified text SQL.

The example shown below are the results of a query[ii] using sys.dm_exec_requests and sys.dm_exec_sql_text.  In this example, the Microsoft database WideWorldImporters has two select statements running against the same table.  Until one statement completes its execution, the other select statement is blocked.

SELECT  session_id
      ,db_name(database_id) as 'Database_Name'
      ,sql_text.text as 'SQL RequestBlocked'
     ,blocking_session_id
FROM sys.dm_exec_requests
CROSS APPLY sys.dm_exec_sql_text(sql_handle) as sql_text
WHERE blocking_session_id != 0

In this example, a  script[ii] provided by Microsoft combines dynamic views to display the text of SQL queries that are being executed in batches and provides statistical information about each of query plans.

SELECT s2.dbid,
s1.sql_handle,
(SELECT TOP 1 SUBSTRING(s2.text,statement_start_offset / 2+1 ,
         ( (CASE WHEN statement_end_offset = -1
      THEN (LEN(CONVERT(nvarchar(max),s2.text)) * 2)
      ELSE statement_end_offset END) - statement_start_offset) / 2+1)) AS sql_statement,
         execution_count,
         plan_generation_num,
         last_execution_time,
         total_worker_time,
         last_worker_time,
         min_worker_time,
         max_worker_time,
         total_physical_reads,
         last_physical_reads,
         min_physical_reads,
         max_physical_reads,
         total_logical_writes,
         last_logical_writes,
         min_logical_writes,
         max_logical_writes
     FROM sys.dm_exec_query_stats AS s1;

The item highlighted in blue represent the timings on query plans and their execution. All timings are based on microseconds.  Individual lines of SQL can be taken from this query to view what is being executed.  The code can be analyzed carefully by viewing the SQL text to determine if is impacting the system or the business process. It is also helpful in executing it outside of the application to debug it or trace the activity.  The query results can help determine the root cause of performance issues when the data is the culprit or the victim.

A deeper look into these stats with the following query will provide the top five expense queries for review:

Expensive Queries

SELECT TOP 5 total_worker_time/execution_count AS [Avg CPU Time],
                       SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
                      ((CASE qs.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE qs.statement_end_offset
            END - qs.statement_start_offset)/2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
           CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
          ORDER BY total_worker_time/execution_count DESC;

Results

Wrapping Up 

To maintain a highly available system, the database needs to be responsive and reliable. Performance and reliability are key components of business processes. These goals are attainable by taking the time to become more aware of the built-in functions of SQL Server. These tools and views are designed to provide insight into some of the most challenging areas of SQL Server. Consider the latest release of SQL Server. Microsoft is making strides in handling all the nuances of data from within. Database Administrators and Data Engineers can certainly add dynamic views to their arsenal of tools used to manage the performance of the SQL environment.