Some time in the last week, the iscsi volume of one of our PostgreSQL went up to 98% and nagios vomited on the standby mobile.
The specific postgres database holds customer’s preferences related to our webmail.
Unfortunately the webmail is a java web app (tomcat) - custom written by some company and the source code is a spaghetti mesh. The code has also gazillion bugs, so we took a decision to migrate to an opensource php based webmail. Hopefully in the near future we will official migrate to the new webmail platform and all known problems to humanity will cease to exist.
Till that time, we have to maintain the current webmail platform and figure out how a ~500Mb database has become a nearly ~50Gb nightmare!
My knowledge on databases are not basic but to be fair i lack in experience. As a veteran standby engineer I know that I need to apply a quick & dirty patch and investigate afterworks. Also I am not afraid to ask for help! And so i did.
First thing to do: increase the volume on the storage machine. I’ve said already that we are using an iscsi partition so it’s pointless that action. In fact - no it isnt !!! The storage machine has a percentage for reserving storage for snapshots. And the increase gave us a little space to breath as the snapshots were “eating” space from the actual volume! You are probably thinking that we should resize the partition - but this is a live-production machine and we dont want a downtime on the service (umount/resize/mount).
From 98% to 93% with only one command.
Second, but most popular thing to do, was VACUUM. A colleague took that step and tried to VACUUM each table separately so not to “lock” or provoke the daemon to a crash or even worst. That gave us a 88% of free space and the time to think before we act again.
For all you people that dont know postgres, postgres doesnt delete actual data from the storage only from the database. So you need to enable autovacuum or vacuum by hand from time to time.
Of course before everything else (or even vacuum) we took a pg_dump to another partition.
But pg_dump was taking hours and hours to complete.
After further investigation, we found a table that pg_dump was getting difficult with.
Fired up a new database and tried to restore this table there.
I couldnt. There was an error of duplicates and the restoration process was failing.
Tried to figured out the duplicate entries. 20 entries! The table has only four columns and a ~ 50.000 data entries. Only 20 of them were duplicates. The amount of data in size is ~20Mb. I was looking the data/entries and removed by hand the duplicates. After that i re-index the specific table and an hour later over 20Gb were free. Down to 44% from 98% by deleted 20 entries.
At that point i was thinking that postgres is mocking me. How the hell a 20Mb table had gone over 20G ?
Now pg_dump is taking 6.5 minutes - but is still taking a long time to dump this specific table.
Tomorrow is a new day to experiment with PostgreSQL
[edit1]: Just to be fair, postgres version is 8.1
[edit2]: The VACUUM process just finished. Another 20G free !!! So in total for 20 duplicate entries a total 40G disk free! We are now at 9% from 98% of used disk.
PS: We have already discussed a lot of plans (upgrade postgres version, restore the dump to a new machine etc etc) in our department but we believe not to focus to any of them (yet) as we havent found the trigger that fired up the database from 500Mb to 50Gb. After that all plays are in hand.