DEV Community

Cong Li
Cong Li

Posted on

GBASE数据库 | GBase 8a MPP Cluster Cold-Warm-Hot Data Analysis Based on Audit Logs

In the era of big data, managing data efficiently has become critical for businesses to optimize storage costs and improve system performance. For GBase database (GBase数据库), audit logs provide valuable insights into data usage patterns, enabling a structured approach to analyzing data across cold, warm, and hot categories. This article delves into a practical method for leveraging audit logs to classify tables based on their access history, offering actionable insights for lifecycle management and resource optimization.

Abstract

Data lifecycle management is a common issue in projects. Typically, the solution involves:

  1. Determining the frequency of table usage based on their creation and last accessed times.
  2. Classifying tables into cold, warm, and hot categories to facilitate data cleanup or migration.

However, for tables with large amounts of data, this information cannot be quickly or directly retrieved from the cluster metadata. Thus, a programmatic approach is needed to extract and analyze multidimensional metadata, organizing commonly used and infrequently used tables for classification and analysis. This ultimately resolves such issues.

Analysis Approach

In the database, the table_list field in the audit_log table records the tables involved in each SQL statement. Each row corresponds to an SQL query and lists the tables, separated by commas. To extract details about the table names and access times, single-row to multi-row transformations are required, which resemble the reverse operation of the GROUP_CONCAT function.

Audit Log Example:

Image description

Desired Format After Transformation:

Image description

Expected Outcome:

Image description

Analysis of Challenges

Requirement: Transform multi-table strings in rows into multiple rows of data.

  1. Initial Attempt:

    The previous solution classified the table_list field into single-table and multi-table data. Multi-table data was exported, processed with the sed command to replace commas with newlines, and then reloaded into the database. However, this approach involved operations outside the database, making it less efficient compared to the current in-database solution.

  2. Inspiration:

    The idea stemmed from Python's split() function for string splitting. It inspired the use of the SUBSTRING_INDEX function for necessary string segmentation.

  3. Technical Reference:

    The logic of creating a Cartesian product with a cross join was referenced to transform a single row into multiple rows, fundamentally addressing the problem.

  4. Innovative Approach:

    This solution pioneered using audit log analysis for understanding data usage trends, offering a practical reference for lifecycle management of customer data.

Solution

1) Create a Parameter Table:

A table, test.splist_bryan, is created to serve as a pseudo-parameter table for the Cartesian product in the cross join.

   CREATE TABLE test.splist_bryan (  
     `id` int(19) NOT NULL,  
     `num` int(19) DEFAULT NULL  
   );
Enter fullscreen mode Exit fullscreen mode

2) Initialize the Parameter Table:

A stored procedure inserts 100 rows as parameters.

(Assumes no SQL references more than 100 tables. This limit can be adjusted if needed.)

   DELIMITER //  
   CREATE PROCEDURE test.splist_init()  
   BEGIN  
     DECLARE num INT;  
     SET num = 1;  
     WHILE num < 100 DO  
       INSERT INTO test.splist_bryan (id, num) VALUES (num, num);  
       SET num = num + 1;  
     END WHILE;  
   END //  
   DELIMITER ;  

   CALL test.splist_init();
Enter fullscreen mode Exit fullscreen mode

3) Create test1 Table:

Summarize audit_log data by classifying and grouping.

(Here, audit_log is a combined Express engine table. Adjustments for cross-engine data migration involve enabling _gbase_query_path.)

   CREATE TABLE test.test1 (start_time TIMESTAMP, tbl VARCHAR(5000));

   SET GLOBAL _gbase_query_path = 1;

   INSERT INTO test.test1 
   SELECT MAX(start_time) AS start_time,
          REPLACE(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', '')), '  ', ',') AS tbl
   FROM gbase.audit_log
   WHERE LENGTH(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', ''))) <> 0
     AND NOT REGEXP_LIKE(db, 'gbase|information_schema|performance_schema|gctmpdb|gclusterdb', 'i')
   GROUP BY REPLACE(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', '')), '  ', ',');
Enter fullscreen mode Exit fullscreen mode

Field Descriptions for audit_log Table:

Field Description
thread_id Thread ID, same as the ID in processlist.
taskid Globally unique task ID.
start_time SQL execution start time.
end_time SQL execution end time.
user_host Logged-in username and host.
query_time Execution duration.
rows Number of rows returned in the result set.
db Current database in use.
table_list Involved tables in the format: <db>.<tb>, ....
sql_type SQL type (DDL, DML, DQL, etc.).

4) Transform Rows into Multiple Rows (test2):

Use a cross join for the Cartesian product and SUBSTRING_INDEX for segmentation. Filter out system tables.

   CREATE TABLE test.test2 AS
   SELECT * FROM (
     SELECT MAX(start_time) AS mydate, col AS tblist
     FROM (
       SELECT t.start_time,
              SUBSTRING_INDEX(SUBSTRING_INDEX(t.tbl, ',', b.num), ',', -1) AS col
       FROM test.test1 AS t
       CROSS JOIN test.splist_bryan AS b
       ON b.num <= LENGTH(t.tbl) - LENGTH(REPLACE(t.tbl, ',', '')) + 1
     ) AS m
     GROUP BY 2
   ) AS p
   WHERE NOT REGEXP_LIKE(tblist, 'gbase|information_schema|performance_schema|gctmpdb|gclusterdb', 'i');
Enter fullscreen mode Exit fullscreen mode

5) Analysis of the Results

The results from the test2 table display the most recent usage date for each table. Each table occupies one row without duplicate entries, allowing for heat analysis based on usage dates.

Cold, Warm, and Hot Data Analysis

1) Overview of Cold, Warm, and Hot Data

Using the telecommunications industry as an example:

  • Data from within 1 year is considered hot.
  • Data between 1 and 3 years old is warm.
  • Data older than 3 years is classified as cold.

The definitions of cold, warm, and hot data can be customized based on the actual data distribution.

Example:

create table test3 as
select mydate,tblist,substring(tblist,1, INSTR(tblist, '.')-1) as dbname,substring(tblist,INSTR(tblist, '.')+1) as tbname,
        case when diff<=256 then 'hot'
                when diff>256 and diff<3*256 then 'warm'
                        else 'cold' end as class
 from (select date(mydate) mydate,datediff(sysdate(),date(mydate)) as diff , tblist from  test.test2 ) as p 
Enter fullscreen mode Exit fullscreen mode

Details for Each Table Type (Partial Example):

Image description

Total Counts for Each Type:

Image description

2) Database-Level Cold, Warm, and Hot Data Analysis

Sometimes it is necessary to identify which databases contain the most cold data for prioritizing cleaning or migration efforts. The following query provides an analysis:

select class,dbname,count(tblist) from test.test3 group by class,dbname ;
Enter fullscreen mode Exit fullscreen mode

Image description

Example Query:

select dbname,sum(decode(class,'cold',1,0)), 
sum(decode(class,'warm',1,0)),
           sum(decode(class,'hot',1,0)) 
from test.test3 group by dbname ;
Enter fullscreen mode Exit fullscreen mode

Image description

From the results, the databases gbaserpt, gbaseods, gbasedwd, and gbasemsm have the highest amounts of cold data. These should be prioritized for data migration or cleanup.


By effectively analyzing audit logs in GBase database (GBase数据库), organizations can uncover meaningful patterns in data usage, aiding in the classification of cold, warm, and hot data. This classification empowers data administrators to make informed decisions on storage optimization, prioritize cleanup or migration efforts, and enhance the overall efficiency of their systems. With the methodologies outlined in this article, businesses can establish a solid foundation for data lifecycle management, ensuring sustainable and cost-effective data strategies.

Top comments (0)