Tag Archives: #opensource

A Good Database BUG

People generally do not think of cockroaches positively, but I have nothing but good feelings about CockroachDB. At its core, CockroachDB is resilient and reliable.

Cockroach Labs, a software company known for its cloud-native SQL databases, has found a home in Bengaluru, India. With a rapidly growing team of over 55 engineers specializing in database and cloud engineering, the company’s journey in India is as much about emotional ties as it is about strategic growth.

Bengaluru’s choice is strategic. It offers unparalleled time zone advantages and access to a rich talent pool. With a population of 1.4 billion and a digitizing economy, India is ideal for testing CockroachDB’s resilience and scalability.

The company plans to expand its Bengaluru office into a first-class R&D hub. Teams are working on innovations like vector data integration for AI, enabling operational databases to evolve into systems capable of real-time intelligence.

Building Blocks of CockroachDB

The founders’ lack of a transactional distributed database forced them to use DynamoDB, leading to inefficiencies in their early startup years. This frustration led to the birth of Cockroach Labs in 2014, with a vision to create an open-source, cloud-native distributed database.


I am a HUGE advocate of open-source databases, so this journey is intriguing. Not sitting with inefficiencies but finding a way to grow beyond them is a significant step for any startup.

True to its name, CockroachDB has built a reputation for resilience. It can run seamlessly across cloud providers, private data centers, and hybrid setups, making it a standout choice. Cockroach Labs focuses on eliminating vendor lock-in and ensuring businesses can operate uninterrupted, even during cloud or data center outages. I can’t say enough how important it is not to be locked into one cloud provider. This is a serious flex for an open-source database NOT to be “vendor dependent.” Staying in the driver’s seat and not becoming a passenger or going along for a ride with a service provider is ideal. Retaining the power of “choice” as a customer is priceless. This adaptability has made Cockroach Labs the operational backbone for global giants like Netflix and ambitious startups like Fi.

Sharing some notes on my explorer experience:

Getting Started

Install CockroachDB on Ubuntu (using Bash Shell):

1. Update Your System: First, update your system packages to the latest version:
   
   sudo apt update -y
   sudo apt upgrade -y
   

2. Install the required dependencies:
   
   sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
  

3. Download the latest version of CockroachDB:

$ curl https://binaries.cockroachdb.com/cockroach-v24.3.1.linux-amd64.tgz | tar -xvz

or
     https://binaries.cockroachdb.com/cockroach-latest.linux-amd64.tgz | tar xvz
   

4. Move the binary to a directory in your PATH:

      sudo cp -i cockroach-latest.linux-amd64/cockroach /usr/local/bin/
   

5. Verify the installation by checking the CockroachDB version:
   
   cockroach version
   

6. Initialize CockroachDB Cluster: Create a directory for CockroachDB data and initialize the cluster:
   
   sudo mkdir -p /var/lib/cockroach
   sudo chown $(whoami) /var/lib/cockroach
   cockroach start-single-node --insecure --store=/var/lib/cockroach --listen-addr=localhost:26257 --http-addr=localhost:8080
   

7. Connect to CockroachDB SQL Shell: Connect to the CockroachDB SQL shell:
   
   cockroach sql --insecure --host=localhost:26257
   

8. Run CockroachDB as a Background Service: Create a systemd service file to run CockroachDB as a background service:
   
   sudo nano /etc/systemd/system/cockroach.service
   
   Add the following configuration:
   ini
   [Unit]
   Description=CockroachDB
   Documentation=https://www.cockroachlabs.com/docs/

   [Service]
   Type=notify
   ExecStart=/usr/local/bin/cockroach start-single-node --insecure --store=/var/lib/cockroach --listen-addr=localhost:26257 --http-addr=localhost:8080
   TimeoutStartSec=0
   Restart=always
   RestartSec=10

   [Install]
   WantedBy=multi-user.target
   

9. Enable and Start the Service: Reload the systemd manager configuration, start the CockroachDB service, and enable it to run on system startup:
   
   sudo systemctl daemon-reload
   sudo systemctl start cockroach
   sudo systemctl enable cockroach
   sudo systemctl status cockroach
   
CockroachDB is now installed and running on your Ubuntu system. 

Cockroach Labs is continuing to invests heavily in AI-specific features, including support for vector similarity searches and operationalizing AI workflows.

Here's an example of how you can use CockroachDB with AI, specifically leveraging vector search for similarity searches:

1. Install CockroachDB: Follow the steps I provided earlier to install CockroachDB on your system.

2. Connect to CockroachDB and create a database and table to store your data:
 
   cockroach sql --insecure --host=localhost:26257
   CREATE DATABASE ai_example;
   USE ai_example;
   CREATE TABLE vectors (id INT PRIMARY KEY, vector FLOAT[] NOT NULL);
 

3. Insert some sample data into the table:(at sql prompt) Steps 3 -5

   INSERT INTO vectors (id, vector) VALUES (1, ARRAY[1.0, 2.0, 3.0]), (2, ARRAY[4.0, 5.0, 6.0]);


4. Enable the `pgvector` extension for vector similarity searches:
sql>
   CREATE EXTENSION IF NOT EXISTS pgvector;


5. Use the `pgvector` extension to perform a similarity search:
sql>
   SELECT id, vector, similarity(vector, ARRAY[2.0, 3.0, 4.0]) AS similarity_score
   FROM vectors
   ORDER BY similarity_score DESC;
 

Create a table to store vectors, and perform similarity searches using the `pgvector` extension.

 "pgvector" enables similarity searches by comparing high-dimensional vectors, making it useful for tasks like finding similar items in recommendation systems, which is an AI tool. 

Yes. CockroachDB is compatible with PostgreSQL, which means you can use many PostgreSQL tools, libraries, and client applications. This can be a bridge in learning about this database, which is also a plus.

pgvector" enables similarity searches by comparing high-dimensional vectors, making it useful for tasks like finding similar items in recommendation systems, which is an AI tool.

Yes. CockroachDB is compatible with PostgreSQL, which means you can use many PostgreSQL tools, libraries, and client applications. This can be a bridge in learning about this database, which is also a plus.

I am looking forward to testing these new developments from Cockroach Labs. There is a wealth information contained in their repository (linked-below) as well as number of repos from the open-source database community. Their investment in AI is key to the company’ sustainable growth.

https://github.com/cockroachlabs

https://www.cockroachlabs.com

Learn more about pgvector in this repo

Search in AI?

I may be stating the obvious, but the search is an essential component of the ecosystem of AI. Let’s see how these two work together.

First, let’s consider why we need to search:

Information Retrieval:

Search is crucial for AI systems to retrieve relevant information from large volumes of unstructured data. Whether analyzing text documents, social media feeds, or sensor data, AI models must quickly locate and extract the most pertinent information to perform tasks such as sentiment analysis, recommendation systems, or decision-making processes.

Knowledge Discovery:

Search enables AI systems to discover patterns, relationships, and insights within vast datasets. By applying advanced search algorithms and techniques, AI can uncover hidden knowledge, identify trends, and extract valuable information from diverse sources. This knowledge discovery process enables businesses and organizations to make informed decisions, gain a competitive edge, and drive innovation.

Natural Language Understanding:

Search is a fundamental component of natural language understanding in AI. It enables systems to interpret user queries, comprehend context, and generate relevant responses. Whether voice assistants, chatbots, or question-answering systems, search algorithms are pivotal in understanding human language and providing accurate and context-aware responses.

The Infrastructure of Search in AI:

  • Data Ingestion and Indexing: The search infrastructure begins with ingesting data from various sources, including databases, documents, and real-time streams. The data is then transformed, preprocessed, and indexed to enable efficient search operations. Indexing involves creating a searchable representation of the data, typically using data structures like inverted indexes or trie-based structures, which optimize search performance.
  • Search Algorithms and Ranking: AI systems leverage various search algorithms to retrieve relevant information from the indexed data. These algorithms, such as term frequency-inverse document frequency (TF-IDF), cosine similarity, or BM25, rank the search results based on relevance to the query. Advanced techniques like machine learning-based ranking models can further enhance the precision and relevance of search results.
  • Query Processing: When a user submits a query, the search infrastructure processes it to understand its intent and retrieve the most relevant results. Natural language processing techniques, such as tokenization, stemming, and part-of-speech tagging, may enhance query understanding and improve search accuracy. Query processing also involves analyzing user context and preferences to personalize search results when applicable.
  • Distributed Computing: To handle the scale and complexity of modern AI systems, search infrastructure often employs distributed computing techniques. Distributed search engines, such as Apache Solr or Elasticsearch, use a distributed cluster of machines to store and process data. This distributed architecture enables high availability, fault tolerance, and efficient parallel processing, allowing AI systems to scale seamlessly and handle large volumes of data and user queries.
  • Continuous Learning and Feedback: AI-powered search systems continuously learn and adapt based on user feedback and analytics. User interactions, click-through rates, and relevance feedback help refine search algorithms and improve result ranking over time. This iterative learning process makes search systems increasingly more accurate and personalized, delivering better user experiences and enhancing the overall AI ecosystem.


Search is a fundamental component of AI, enabling information retrieval, knowledge discovery, and natural language understanding. The infrastructure supporting search in AI involves data ingestion, indexing, search algorithms, query processing, distributed computing, and continuous learning. By harnessing the power of search, AI systems can effectively navigate vast datasets, uncover valuable insights, and deliver relevant information to users. Embracing the search infrastructure is essential for unlocking the full potential of AI.

Azure OpenAI and Cognitive Search is a match made in the cloud.

Key-Value-Based Data Storage

Submitting to speak for technical events can be tedious as the number of people competing for a few spots grows. I have found myself on more than one occasion with a presentation that didn’t get selected. I discovered some I wanted to share as I went through this body of work. Although this is not a presentation platform at a conference, I wanted to share my experience working with Redis Database. This presentation is a few years old, so I needed to revisit it to see what’s changed. I also find it inspiring to review this technology to see what it can do. Enjoy.

Open-source databases have gained significant popularity due to their flexibility, scalability, and cost-effectiveness. When storing key-value-based data, an open-source database like Redis offers several advantages. Let’s explore the benefits of using Redis and delve into a technical demonstration of how data is stored in Redis.

Items that could be used as a presentation deck:

  1. High Performance: Redis is known for its exceptional performance, making it ideal for applications that require low latency and high throughput. It stores data in memory, allowing for swift read and write operations. Additionally, Redis supports various data structures, such as strings, hashes, lists, sets, and sorted sets, providing the flexibility to choose the appropriate structure based on the application’s requirements.
  2. Scalability: Redis is designed to be highly scalable vertically and horizontally. Vertical scaling involves increasing the resources of a single Redis instance, such as memory, CPU, or storage, to handle larger datasets. Horizontal scaling involves setting up Redis clusters, where data is distributed across multiple nodes, providing increased capacity and fault tolerance. This scalability allows Redis to handle growing workloads and accommodate expanding datasets.
  3. Persistence Options: While Redis primarily stores data in memory for optimal performance, it also provides persistence options to ensure data durability. Redis supports snapshotting, which periodically saves a snapshot of the in-memory data to disk. Additionally, it offers an append-only file (AOF) persistence mechanism that logs all write operations, allowing for data recovery in case of failures or restarts.
  4. Advanced-Data Manipulation: Redis provides a rich set of commands and operations to manipulate and analyze data. It supports atomic operations, enabling multiple commands to be executed as a single, indivisible operation. Redis also includes powerful features like pub/sub messaging, transactions, and Lua scripting, allowing for advanced data processing and complex workflows.
  5. Community and Ecosystem: Redis benefits from a large and active open-source community, contributing to its continuous development and improvement. The Redis community provides support, documentation, and a wide range of libraries and tools that integrate with Redis, expanding its capabilities and making it easier to work with.

Technical Demonstration: Storing Data in Redis

Prerequisite:

Install Redis on WSL2 for Windows

Let’s consider an example where we want to store user information using Redis. We’ll use Redis commands to store and retrieve user data.

  1. Setting a User Record:
    To set a user record, we can use the SET command, specifying the user’s ID as the key and a JSON representation of the user’s data as the value. For example:
SET user:1234 "{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 30}"
  1. Retrieving User Information:
    To retrieve the user information, we can use the GET command, providing the user’s ID as the key. For example:
GET user:1234

This command will return the JSON representation of the user data: "{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 30}"

  1. Updating User Information:
    To update a user’s information, we can use the SET command again with the same user ID. Redis will overwrite the existing value with the new one.
  2. Deleting User Information:
    To delete a user record, we can use the DEL command, specifying the user’s ID as the key. For example:
DEL user:1234

This command will remove the user record from Redis.

Using an open-source database like Redis for key-value-based data storage provides numerous benefits, including high performance, scalability, persistence options, advanced data manipulation capabilities, and a vibrant community. Redis offers an efficient and flexible solution.

General Installation Guides for Redis