DEV Community

Franck Pachot
Franck Pachot

Posted on • Edited on

Approximate Count for hash sharded YugabyteDB tables

In the previous post we have seen parallel scan for a count(*): all rows are read and counted on each tablet, in parallel, returning a partial count to be aggregated to the exact count.

What if we don't need an exact count? Can we get an approximate count by reading a sample of rows? Yes, if we use hash sharding and the rows are evenly distributed. YugabyteDB applies a hash function on the hash columns of the primary key, to a number between 0 and 65536, and rows are sharded by ranges of hash value. If I read only the hash values between 0 and 65536/100 I will read only 1% of the rows and hit only one or a few tablet.

Example

On a 9 nodes cluster I've created the following table:

drop table if exists demo;
create extension if not exists pgcrypto;
create table demo (id uuid default gen_random_uuid() primary key, value float);
insert into demo(value) select random() from generate_series(1,1000000)\; 
\watch interval=0 c=14
Enter fullscreen mode Exit fullscreen mode

I've run the insert until I have 150 million rows (I used the \watch enhancement in PG16):

yugabyte=> \timing on
Timing is on.

yugabyte=> select count(*) from demo;

   count
-----------
 150000000
(1 row)

Time: 18980.257 ms (00:18.980)

Enter fullscreen mode Exit fullscreen mode

You can see that the count(*) takes 18 seconds to count 150 million rows.

Approximate count

I can approximate the count by reading 0.1% of the rows:

yugabyte=> select 1000*count(*) from demo where yb_hash_code(id)<65536/1000;
 ?column?
-----------
 148268000
(1 row)

Time: 854.053 ms
Enter fullscreen mode Exit fullscreen mode

That's not bad. I get the approximate count in less that 1 second, with 1% error.

Even reading one hash code out of 65536 gives an acceptable approximate count in 33 milliseconds:

yugabyte=> select 65536*count(*) from demo where yb_hash_code(id)=1;
 ?column?
-----------
 151388160
(1 row)

Time: 33.269 ms
Enter fullscreen mode Exit fullscreen mode

yb_approximate_count()

This is easy to do manually but requires knowing the columns that are part of hashing in the primary key. I can get them automatically from pg_index.indoption with a few joins on the catalog tables. I've created a function to do that automatically, and run the partial count:

drop function if exists yb_approximate_count;
create or replace function yb_approximate_count(
 table_relid oid -- the oid of a hash partitioned table
 , pct float default null -- percentage of rows (actually of hash codes)
) returns bigint as
$$
 declare 
 count_sql text;
 count_result bigint;
begin
with
/*+ Set(random_page_cost 1e42) */ -- https://github.com/yugabyte/yugabyte-db/issues/15224#issuecomment-1341239529
 ind_att_option as (
 select indrelid, indexrelid, schemaname, tablename, indexname, indisprimary, key_position, indoption, attname, n_distinct from
 (
 select indrelid, indexrelid, indisprimary, indkeys.indkey, indclass[indkeys.k-1], k as key_position,
  -- indoption defined in https://github.com/yugabyte/yugabyte-db/blob/2.18/src/postgres/src/include/catalog/pg_index.h#L78
  case indoption[indkeys.k-1]::int when 0 then 'ASC' when 1 then 'DESC' when 3 then 'ASC NULLS FIRST' when 4 then 'HASH' end as indoption
  from pg_index
  , unnest(indkey) with ordinality  as indkeys(indkey,k)
 ) pg_index
 natural left join (select oid as indexrelid, relname as indexname  from pg_class) idx
 natural left join (select oid as indrelid, relnamespace, relname as tablename  from pg_class) tab
 natural left join (select oid as relnamespace, nspname as schemaname  from pg_namespace) nsp
 natural left join (select attrelid as indrelid, attnum as indkey, attname as attname from pg_attribute) att
 natural left join (select schemaname, tablename, attname, n_distinct from pg_stats) stat
),
ind_hash as (
 select  schemaname, tablename,string_agg(format('%I',attname), ',' order by key_position) hash_cols, indisprimary
 , case when indisprimary then indrelid else indexrelid end relid
 from ind_att_option
 where indoption='HASH'
 group by schemaname, tablename, indrelid, indexrelid, indisprimary
)
select
  format(
  case
  when pct=100       -- read all (pushdown parallel scan)
  then 'select       count(*) as "%2$I.%3$I" from %2$I.%3$I'
  when pct is null   -- read only one hash code (1/65536th)
  then 'select 65536*count(*) as "%2$I.%3$I" from %2$I.%3$I where yb_hash_code(%4$s)=1'
  else               -- read a percentage of hash code range
       'select  %1$s*count(*) as "%2$I.%3$I" from %2$I.%3$I where yb_hash_code(%4$s)< 65536/%1$s'
  end , 100/pct , schemaname, tablename, hash_cols
  ) into count_sql
from ind_hash
--, lateral (select num_tablets from yb_table_properties(relid)) as yb_table_properties
where indisprimary and relid=table_relid
;
if count_sql is null then
 raise warning '%','ERROR: Can''t find HASH columns for the table''s primary key';
 return null;
end if;
raise debug 'DEBUG: Running: %',count_sql;
execute count_sql into count_result;
return count_result;
end;
$$ language plpgsql;
Enter fullscreen mode Exit fullscreen mode

The approximate count on 0.1% rows takes less than 1 second, and counting from one yb_hash_code() only takes 150 milliseconds:

yugabyte=> select yb_approximate_count('demo'::regclass,0.1);
 yb_approximate_count
----------------------
            148268000
(1 row)

Time: 994.010 ms

yugabyte=> select yb_approximate_count('demo'::regclass);

 yb_approximate_count
----------------------
            151388160
(1 row)

Time: 152.628 ms

yugabyte=> select yb_approximate_count('demo'::regclass,1);
 yb_approximate_count
----------------------
            149836900
(1 row)

Time: 8566.000 ms (00:08.566)
Enter fullscreen mode Exit fullscreen mode

Note that the query with the where clause doesn't pushdown the count, and then returns all rows to the postgres backend. Then, it has a faster response time than the parallel pushed down count only for a very small sample, like 0.1%. The function without a percentage reads only from one yb_hash_code(), so 1/65536, and may be sufficient for an approximation.

This function works only on hash sharding with even distribution. With range sharding, counting on a range cannot be extrapolated to get a relevant approximate count.

Top comments (0)