The Insatiable Postgres Replication Slot - Gunnar Morling
The Insatiable Postgres Replication Slot
When I was working on a presentation on the processing of a change event from Postgres with the support of Apache Flink, I came across an interesting development: For this demonstration, the Postgress data backbone created on Amazon RDS has fallen into a lack of disk capacity. That's it. The machine had a 200GB disk, but was completely used in less than two weeks.
In this case, Postgres protects all Wal partitions after the last logsea number (LSN) proved in the specified slot. In fact, I set a replication connector (via the DECODODEABLE CDC source connector for Postgres implemented in Debezium). After this, the connector was pause and the connector was impractical. The problem was that I did not doubt that this database had no slight traffic! What increased WAL?~18GB per day?
The following is a simple summary of my research, which is intended as a reference material for myself, but it is assumed that other people in the same situation need this.
The Observation
Let's start with the survey I did. The original story data and log files are no longer at hand, but there are many good steps to reproduce the problem. First, model a new PostGres database on Amazon RDS (I used a free layer version 14. 5). Next, get a session based on the data and simulate the replication connector as follows:
1 2 3 4 5 6Select * From pg_create_logical_replication_slot( 'Regress_slot', 'test_decoding', FALSE, True );Drink a cup of coffee (or two cups, three cups), and see the database statistics on the RDS web console a few hours later. The appropriate, annoying image is displayed in the "storage capacity of the storage":
In other words, the database with 20GB free space will lose the disk capacity within two days. Next, let's look at the statistics of "Transaction Log disk usage". This shows this problem in a very clearly embodied form:
The database transaction log increases by 64MB for the given fractions. Write IOps' Statistics completes this figure. Again, every 5 minutes causes something to write IOPS into a specific no n-active data frame:
Let's see if the replication connector is the cause. Looking at the difference between the LS N-Restart (the most damaged LSN that needs to be saved to operate this lock restore) and the current LS N-database, we are inexperienced. For this connector, how many bytes warn the release will be displayed: it's inactive:
1 2 3 5 5 6 7 8 9 10 11 12 12Select Name _ slot, pg_size_pretty( pg_wal_lsn_difff( pg_current_wal_lsn(), Restart_lsn)) AS Savered_val, active, Restart_lsn From pg_replication_slots; +-----------------+----------------+----------+---------------+ | Name _ slot | Savered_val | active | Restart_lsn | |-----------------+----------------+----------+---------------| | Regression lock | 2166 MB | lie | 0/4A05af0 | +-----------------+----------------+----------+---------------+Almost literally the same WAL was seen i n-Metrics in the database. Now, of course, the main question is, in fact, what causes the same increase in WAL? Which process is contributing to 5 minutes with 64MB? Finally, let's look at the function process of the Postgres server using pg_stat_activity:
1 2 3 5 5 6 7 7 9 10 11 13 14 16 17 17Select PID AS Process identifier, User name AS Username, Database name AS Data name of data, Address clint AS Address clint, Application name, Back end start, situation, Change of state From pg_stat_activity what User name IS no zero; +--------------+------------+-----------------+------------------+------------------------+-------------------------------+---------+-------------------------------+ | Process identifier | Username | Data name of data | Address clint | Application name | Back end start | situation | Change of state | |--------------+------------+-----------------+------------------+------------------------+-------------------------------+---------+-------------------------------| | 370 | Administrator | zero> | zero> | | 2022-11-30 11:11:03.424359+00 | zero> | zero> | | 468 | Administrator | Administrator | 127.0.0.1 | Post GREESQL JDBC driver | 2022-11-30 11:12:02.517528+00 | Inactivity | 2022-11-30 14:15:05.601626+00 | | 14760 | Pos t-elestress | Dedicated test | Www.XXX.yyy.Zzz | Pgcli | 2022-11-30 14:04:58.765899+00 | active | 2022-11-30 14:15:06.820204+00 | +--------------+------------+-----------------+------------------+------------------------+-------------------------------+---------+-------------------------------+In addition to our personal session (user Postgres), there are two RDSADMIN users sessions. We observed that once they were connected to Wa l-Tillen, we did not make any data changes.
The Solution
At this point, I had enough information to execute a reasonable intelligence on Google, and I literally filled the same task, a blog "Postgress and Working Database" I came across an entry of "logical replication". According to it, RDS sometimes records the rhythm of the heart on the rdsadmin data base:
RDS writes heart rate on the internal base "RDSADMIN" every 5 minutes.
This is a part of the comment: RDS's Postgres data, a seemingly informative base, really has a traffic. But how is this heartbeat very likely to cause similar Wal-lifting? Does these heartbeat actions, which are most likely, have 64 MB values?
In another article, the blog has good information touch: from Postgres 11, WA L-Section siz e-that is, WA L-Element valu e-it can be configured. In RDS, this value is changed from 16 MB to 64 MB by default. I remember listening!
Posting to the center of this knowledge also led to the last missing part of the puzzle, Archive_Timeout parameters (5 minutes default). This parameter has the following description on the excellent web site called PostgreSqlco. nf:
If this parameter is zero, the server will always switch to a new segment file at any time and observe some types of energy in the database ... archive file. Pies the previous locked care.
And this is the ultimate why inactive replication connectors cause such a WAL number in non-working data: basically 5 minutes. Several data configuration is generated when the heartbeat is displayed based on RDSADMIN data. In this case, a 64MB new WAL section is created every 5 minutes. Until the replication area is an idol and the operation is substantially performed, all of these sections of WAL are actually stored (not quite slow), which is actually connected to the actual database server, The disk capacity ends.
Take Away
Here's the lesson of this situation: Do not leave the full operation of replication! There are no very long and random slots. For example, you can configure a higher query that will be notified when a specific slot is saved with a volume of 100MB or more. And, of course, enter the footprint to observe the free space on the disk.
In some cases, the active replication connector can freeze the sudden storage status of the Wal. For example, if a huge configuration is introduced in one data infrastructure and the replication connector is configured to another data infrastructure that is rarely changed, all this connector cannot be performed accurately.
Creating an artificial origin of the data is considered a normal conclusion for such scenarios, such as the Debezium Postgres connector. Actually, there is no need to create a special table at any time to support not only for this, but also to support the pg_logical_emit_message () support and the news with WAL.
Select pg_logical_emit_message(FALSE, 'Heartbeat, Right now()::Barker);If you are using a logical decoding plu g-in that supports logical replication messages, for example, Pgoutput, for example, to prevent replication from moving.