jarifibrahim commented :
@dougdonohoe How are you inserting the data? Stream writer?
jarifibrahim commented :
@dougdonohoe How are you inserting the data? Stream writer?
dougdonohoe commented :
Iām using badger.WriteBatch
. This smells like a race condition bug. Error occurs when calling Flush()
.
dougdonohoe commented :
In other news, I ran my test program to create 22,000,000 entries (with random lookups done for each one). In 1.6, it ran in 25m22s. In 2.0.2 with patch, it ran in 20m40s. So that is a positive sign.
OTOH, I was trying to compare the re-world performance of my Kafka extraction program, but it died due to the above filename bug. At the time of the crash, it had been running for 1h45m. At the approximately same time when last run using 1.6, it took 1h30m. So it appears to be 16% slower in 2.x. Note that this code is predominately write-focused.
dougdonohoe commented :
I donāt know much about the code, but I noticed this:
1.6 value.go:1013
newid := atomic.AddUint32(&vlog.maxFid, 1)
2.0 value.go:1405
newid := vlog.maxFid + 1
Seems suspicious to have removed the atomic add.
dougdonohoe commented :
I spent a little more time trying to understand the Badger code base. Itās a lot to come up to speed on. It is obvious the maxFid
logic changed substantially from 1.6 to 2.0 so my point above may not be relevant.
However, one thing I donāt understand in 2.0 is why the *.vlog
file timestamps arenāt monotonically increasing.
In 1.6, ls
by timestamp shows timestamps increasing along with file id.
$ ls -ltrh *.vlog
-rw-rw-r-- 1 donohoe donohoe 250M Feb 17 15:08 000000.vlog
-rw-rw-r-- 1 donohoe donohoe 268M Feb 17 15:13 000001.vlog
-rw-rw-r-- 1 donohoe donohoe 247M Feb 17 15:17 000002.vlog
-rw-rw-r-- 1 donohoe donohoe 232M Feb 17 15:21 000003.vlog
-rw-rw-r-- 1 donohoe donohoe 209M Feb 17 15:23 000004.vlog
-rw-rw-r-- 1 donohoe donohoe 288M Feb 17 15:28 000005.vlog
-rw-rw-r-- 1 donohoe donohoe 310M Feb 17 15:33 000006.vlog
-rw-rw-r-- 1 donohoe donohoe 273M Feb 17 15:38 000007.vlog
-rw-rw-r-- 1 donohoe donohoe 280M Feb 17 15:43 000008.vlog
-rw-rw-r-- 1 donohoe donohoe 279M Feb 17 15:51 000009.vlog
-rw-rw-r-- 1 donohoe donohoe 299M Feb 17 16:04 000010.vlog
-rw-rw-r-- 1 donohoe donohoe 274M Feb 17 16:12 000011.vlog
-rw-rw-r-- 1 donohoe donohoe 220M Feb 17 16:14 000012.vlog
-rw-rw-r-- 1 donohoe donohoe 218M Feb 17 16:17 000013.vlog
-rw-rw-r-- 1 donohoe donohoe 412M Feb 17 16:24 000014.vlog
-rw-rw-r-- 1 donohoe donohoe 259M Feb 17 16:28 000015.vlog
-rw-rw-r-- 1 donohoe donohoe 253M Feb 17 16:30 000016.vlog
-rw-rw-r-- 1 donohoe donohoe 288M Feb 17 16:33 000017.vlog
-rw-rw-r-- 1 donohoe donohoe 246M Feb 17 16:36 000018.vlog
-rw-rw-r-- 1 donohoe donohoe 188M Feb 17 16:37 000019.vlog
-rw-rw-r-- 1 donohoe donohoe 187M Feb 17 16:38 000020.vlog
-rw-rw-r-- 1 donohoe donohoe 225M Feb 17 16:39 000021.vlog
-rw-rw-r-- 1 donohoe donohoe 245M Feb 17 16:41 000022.vlog
-rw-rw-r-- 1 donohoe donohoe 223M Feb 17 16:44 000023.vlog
-rw-rw-r-- 1 donohoe donohoe 220M Feb 17 16:47 000024.vlog
-rw-rw-r-- 1 donohoe donohoe 228M Feb 17 16:50 000025.vlog
-rw-rw-r-- 1 donohoe donohoe 228M Feb 17 16:53 000026.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 16:56 000027.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 16:59 000028.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 17:02 000029.vlog
-rw-rw-r-- 1 donohoe donohoe 244M Feb 17 17:05 000030.vlog
-rw-rw-r-- 1 donohoe donohoe 238M Feb 17 17:09 000031.vlog
-rw-rw-r-- 1 donohoe donohoe 244M Feb 17 17:12 000032.vlog
-rw-rw-r-- 1 donohoe donohoe 267M Feb 17 17:15 000033.vlog
-rw-rw-r-- 1 donohoe donohoe 240M Feb 17 17:18 000034.vlog
-rw-rw-r-- 1 donohoe donohoe 249M Feb 17 17:23 000035.vlog
-rw-rw-r-- 1 donohoe donohoe 275M Feb 17 17:27 000036.vlog
-rw-rw-r-- 1 donohoe donohoe 282M Feb 17 17:32 000037.vlog
-rw-rw-r-- 1 donohoe donohoe 303M Feb 17 17:35 000038.vlog
-rw-rw-r-- 1 donohoe donohoe 311M Feb 17 17:38 000039.vlog
-rw-rw-r-- 1 donohoe donohoe 276M Feb 17 17:43 000040.vlog
-rw-rw-r-- 1 donohoe donohoe 253M Feb 17 17:47 000041.vlog
-rw-rw-r-- 1 donohoe donohoe 226M Feb 17 17:50 000042.vlog
-rw-rw-r-- 1 donohoe donohoe 235M Feb 17 17:53 000043.vlog
-rw-rw-r-- 1 donohoe donohoe 229M Feb 17 17:56 000044.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 17:59 000045.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:03 000046.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:06 000047.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:09 000048.vlog
-rw-rw-r-- 1 donohoe donohoe 274M Feb 17 18:14 000049.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:17 000050.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:21 000051.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:24 000052.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:27 000053.vlog
-rw-rw-r-- 1 donohoe donohoe 273M Feb 17 18:32 000054.vlog
-rw-rw-r-- 1 donohoe donohoe 290M Feb 17 18:37 000055.vlog
-rw-rw-r-- 1 donohoe donohoe 292M Feb 17 18:44 000056.vlog
-rw-rw-r-- 1 donohoe donohoe 164M Feb 18 03:09 000057.vlog
However in 2.0, it looks like .vlog
files are accessed again after they have been written. In my use case, Iām only setting one value for a particular key, so no need to remove/touch previous .vlog
files. Yet:
$ ls -ltrh *.vlog
-rw------- 1 donohoe donohoe 181M Mar 12 16:47 000018.vlog
-rw------- 1 donohoe donohoe 177M Mar 12 16:48 000019.vlog
-rw------- 1 donohoe donohoe 192M Mar 12 16:49 000020.vlog
-rw------- 1 donohoe donohoe 238M Mar 12 16:51 000000.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 16:51 000021.vlog
-rw------- 1 donohoe donohoe 229M Mar 12 16:55 000022.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 16:58 000001.vlog
-rw------- 1 donohoe donohoe 207M Mar 12 16:59 000023.vlog
-rw------- 1 donohoe donohoe 222M Mar 12 17:02 000024.vlog
-rw------- 1 donohoe donohoe 235M Mar 12 17:03 000002.vlog
-rw------- 1 donohoe donohoe 211M Mar 12 17:05 000025.vlog
-rw------- 1 donohoe donohoe 221M Mar 12 17:08 000003.vlog
-rw------- 1 donohoe donohoe 224M Mar 12 17:09 000026.vlog
-rw------- 1 donohoe donohoe 198M Mar 12 17:11 000004.vlog
-rw------- 1 donohoe donohoe 229M Mar 12 17:13 000027.vlog
-rw------- 1 donohoe donohoe 276M Mar 12 17:16 000005.vlog
-rw------- 1 donohoe donohoe 232M Mar 12 17:16 000028.vlog
-rw------- 1 donohoe donohoe 230M Mar 12 17:21 000029.vlog
-rw------- 1 donohoe donohoe 299M Mar 12 17:23 000006.vlog
-rw------- 1 donohoe donohoe 230M Mar 12 17:25 000030.vlog
-rw------- 1 donohoe donohoe 226M Mar 12 17:28 000031.vlog
-rw------- 1 donohoe donohoe 262M Mar 12 17:29 000007.vlog
-rw------- 1 donohoe donohoe 259M Mar 12 17:32 000032.vlog
-rw------- 1 donohoe donohoe 268M Mar 12 17:35 000008.vlog
-rw------- 1 donohoe donohoe 228M Mar 12 17:36 000033.vlog
-rw------- 1 donohoe donohoe 235M Mar 12 17:41 000034.vlog
-rw------- 1 donohoe donohoe 267M Mar 12 17:44 000009.vlog
-rw------- 1 donohoe donohoe 249M Mar 12 17:46 000035.vlog
-rw------- 1 donohoe donohoe 269M Mar 12 17:52 000036.vlog
-rw------- 1 donohoe donohoe 274M Mar 12 17:57 000037.vlog
-rw------- 1 donohoe donohoe 287M Mar 12 17:58 000010.vlog
-rw------- 1 donohoe donohoe 307M Mar 12 18:00 000038.vlog
-rw------- 1 donohoe donohoe 279M Mar 12 18:06 000039.vlog
-rw------- 1 donohoe donohoe 263M Mar 12 18:07 000011.vlog
-rw------- 1 donohoe donohoe 208M Mar 12 18:09 000012.vlog
-rw------- 1 donohoe donohoe 243M Mar 12 18:11 000040.vlog
-rw------- 1 donohoe donohoe 207M Mar 12 18:11 000013.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 18:17 000041.vlog
-rw------- 1 donohoe donohoe 401M Mar 12 18:19 000014.vlog
-rw------- 1 donohoe donohoe 204M Mar 12 18:20 000042.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:23 000043.vlog
-rw------- 1 donohoe donohoe 248M Mar 12 18:24 000015.vlog
-rw------- 1 donohoe donohoe 242M Mar 12 18:27 000016.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:27 000044.vlog
-rw------- 1 donohoe donohoe 277M Mar 12 18:30 000017.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:31 000045.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:34 000046.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:37 000047.vlog
-rw------- 1 donohoe donohoe 226M Mar 12 18:41 000048.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 18:46 000049.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:49 000050.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:53 000051.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:56 000052.vlog
-rw------- 1 donohoe donohoe 237M Mar 12 19:00 000053.vlog
-rw------- 1 donohoe donohoe 273M Mar 12 19:05 000054.vlog
-rw------- 1 donohoe donohoe 280M Mar 12 19:11 000055.vlog
-rw------- 1 donohoe donohoe 275M Mar 12 19:18 000056.vlog
-rw------- 1 donohoe donohoe 110M Mar 12 19:21 000057.vlog
Also notice file 0000018.vlog
is the least recently modified ⦠and it is the one where the bug happened. This seems like a clue.
Iād like to dig in more, but I have to do some day job stuff
dougdonohoe commented :
Note that both occurrence of this bug happened well after startup. To summarize previous info:
WithSyncWrites(false)
badger.WriteBatch
with a batch size of 1000Iāve been looking through the code a bit more, and it must be failing in the toDisk()
anonymous function. In write
:
newid := vlog.maxFid + 1
y.AssertTruef(newid > 0, "newid has overflown uint32: %v", newid)
newlf, err := vlog.createVlogFile(newid)
AFAIK, aside from open, the only place that vlog.maxFid
is changed is when vlog.createVlogFile
is called. The only place that is called is in write
. So the only way youād get the error Iām seeing if if a multiple toDisk()
are running concurrently, right?
dougdonohoe commented :
FYI: Iāve moved on to other things and have put on hold our upgrade to 2.0. This intermittent issue where the vlog
files already exist is a huge blocker for us. Iāve seen it happen twice, but it takes a non-deterministic amount of time for it to occur. In 1.6, I never had any issues running long running imports before (some over 24 hours long).
jarifibrahim commented :
Hey @dougdonohoe, do you some specific time/pattern between the missing
vlog file issues? I tried doing the same thing using the badger benchmark write -m 1000
command which inserts 1000 million keys in badger and didnāt see any missing file/file already exists issue.
If thereās any script/test program that you can share which can help me reproduce the issue it would be great. I havenāt been able to reproduce the failure and thatās why we havenāt fixed it yet.
As for the vlog.maxFid
code here
It is a sequential call and thatās why it should be okay to not have any locks there. Earlier we were using atomics but they werenāt really needed.
So the only way youād get the error Iām seeing if if a multiple toDisk() are running concurrently, right?
Writes to the vlog file should always be serial. We cannot write concurrently to the vlog file which is the write-ahead log file.
Issue 1187 looks like the code that changed this functionality. Linking here just for future reference.
The file already exists
issue mentioned in Performance regression 1.6 to 2.0.2 - #20 by diggy was fixed by Remove vlog file if bootstrap, syncDir or mmap fails by jarifibrahim Ā· Pull Request #1434 Ā· dgraph-io/badger Ā· GitHub (itās not yet released).
I re-ran my entire application in v2.0.3 and things look great - it is even a bit faster that v1.6.1 in places. You can close this ticket.