I have a compressed file of:
250.000.000 lines
Compressed the file size is: 671M
Uncompressed, it's: 6,5G
Need to extract a plethora of things and verify some others.
I dont want to use bash but something more elegant, like python or lua.
Looking through “The-Internet”, I’ve created some examples for the single purpose of educating my self.
So here are my results.
BE AWARE they are far-far-far away from perfect in code or execution.
Sorted by (less) time of execution:
pigz
# time pigz -p4 -cd 2016-08-04-06.ldif.gz &> /dev/null
real 0m9.980s
user 0m16.570s
sys 0m0.980s
gzip
gzip 1.8
# time /bin/gzip -cd 2016-08-04-06.ldif.gz &> /dev/null
real 0m23.951s
user 0m23.790s
sys 0m0.150s
zcat
zcat (gzip) 1.8
# time zcat 2016-08-04-06.ldif.gz &> /dev/null
real 0m24.202s
user 0m24.100s
sys 0m0.090s
Perl
Perl v5.24.0
code:
#!/usr/bin/perl
open (FILE, '/bin/gzip -cd 2016-08-04-06.ldif.gz |');
while (my $line = ) {
print $line;
}
close FILE;
time:
# time ./dump.pl &> /dev/null
real 0m49.942s
user 1m14.260s
sys 0m2.350s
PHP
PHP 7.0.9 (cli)
code:
#!/usr/bin/php
< ? php
$fp = gzopen("2016-08-04-06.ldif.gz", "r");
while (($buffer = fgets($fp, 4096)) !== false) {
echo $buffer;
}
gzclose($fp);
? >
time:
# time php -f dump.php &> /dev/null
real 1m19.407s
user 1m4.840s
sys 0m14.340s
PHP - Iteration #2
PHP 7.0.9 (cli)
Impressed with php results, I took the perl-approach on code:
< ? php
$fp = popen("/bin/gzip -cd 2016-08-04-06.ldif.gz", "r");
while (($buffer = fgets($fp, 4096)) !== false) {
echo $buffer;
}
pclose($fp);
? >
time:
# time php -f dump2.php &> /dev/null
real 1m6.845s
user 1m15.590s
sys 0m19.940s
not bad !
Lua
Lua 5.3.3
code:
#!/usr/bin/lua
local gzip = require 'gzip'
local filename = "2016-08-04-06.ldif.gz"
for l in gzip.lines(filename) do
print(l)
end
time:
# time ./dump.lua &> /dev/null
real 3m50.899s
user 3m35.080s
sys 0m15.780s
Lua - Iteration #2
Lua 5.3.3
I was depressed to see that php is faster than lua!!
Depressed I say !
So here is my next iteration on lua:
code:
#!/usr/bin/lua
local file = assert(io.popen('/bin/gzip -cd 2016-08-04-06.ldif.gz', 'r'))
while true do
line = file:read()
if line == nil then break end
print (line)
end
file:close()
time:
# time ./dump2.lua &> /dev/null
real 2m45.908s
user 2m54.470s
sys 0m21.360s
One minute faster than before, but still too slow !!
Lua - Zlib
Lua 5.3.3
My next iteration with lua is using zlib :
code:
#!/usr/bin/lua
local zlib = require 'zlib'
local filename = "2016-08-04-06.ldif.gz"
local block = 64
local d = zlib.inflate()
local file = assert(io.open(filename, "rb"))
while true do
bytes = file:read(block)
if not bytes then break end
print (d(bytes))
end
file:close()
time:
# time ./dump.lua &> /dev/null
real 0m41.546s
user 0m40.460s
sys 0m1.080s
Now, that's what I am talking about !!!
Playing with window_size (block) can make your code faster or slower.
Python v3
Python 3.5.2
code:
#!/usr/bin/python
import gzip
filename='2016-08-04-06.ldif.gz'
with gzip.open(filename, 'r') as f:
for line in f:
print(line,)
time:
# time ./dump.py &> /dev/null
real 13m14.460s
user 13m13.440s
sys 0m0.670s
Not enough tissues on the whole damn world!
Python v3 - Iteration #2
Python 3.5.2
but wait ... a moment ... The default mode for gzip.open is 'rb'.
(read binary)
let's try this once more with rt(read-text) mode:
code:
#!/usr/bin/python
import gzip
filename='2016-08-04-06.ldif.gz'
with gzip.open(filename, 'rt') as f:
for line in f:
print(line, end="")
time:
# time ./dump.py &> /dev/null
real 5m33.098s
user 5m32.610s
sys 0m0.410s
With only one super tiny change and run time in half!!!
But still tooo slow.
Python v3 - Iteration #3
Python 3.5.2
Let's try a third iteration with popen this time.
code:
#!/usr/bin/python
import os
cmd = "/bin/gzip -cd 2016-08-04-06.ldif.gz"
f = os.popen(cmd)
for line in f:
print(line, end="")
f.close()
time:
# time ./dump2.py &> /dev/null
real 6m45.646s
user 7m13.280s
sys 0m6.470s
Python v3 - zlib Iteration #1
Python 3.5.2
Let's try a zlib iteration this time.
code:
#!/usr/bin/python
import zlib
d = zlib.decompressobj(zlib.MAX_WBITS | 16)
filename='2016-08-04-06.ldif.gz'
with open(filename, 'rb') as f:
for line in f:
print(d.decompress(line))
time:
# time ./dump.zlib.py &> /dev/null
real 1m4.389s
user 1m3.440s
sys 0m0.410s
finally some proper values with python !!!
Specs
All the running tests occurred to this machine:
4 x Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz
8G RAM
Conclusions
Ok, I Know !
The shell-pipe approach of using gzip for opening the compressed file, is not fair to all the above code snippets.
But ... who cares ?
I need something that run fast as hell and does smart things on those data.
Get in touch
As I am not a developer, I know that you people know how to do these things even better!
So I would love to hear any suggestions or even criticism on the above examples.
I will update/report everything that will pass the "I think I know what this code do" rule and ... be gently with me ;)
PLZ use my email address: evaggelos [ _at_ ] balaskas [ _dot_ ] gr
to send me any suggestions
Thanks !