| |
| ▲ | kbaker 4 days ago | parent | next [-] | | > If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post. I think this is the part that is confusing. The fsyncing of the directory is supposed to be done by the filesystem/OS itself, not the application. From man fsync, As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
So from sqlite's perspective on DELETE it is either: before the fsync call, and not committed, or after the fsync call, and committed (or partially written somehow and needing rollback.)Unfortunately it seems like this has traditionally been broken on many systems, requiring workarounds, like SYNCHRONOUS = EXTRA. | | |
| ▲ | agwa 4 days ago | parent [-] | | No, the metadata is information like the modification time and permissions, not the directory entry. The next paragraph in the man page explains this: > Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. https://man7.org/linux/man-pages/man2/fsync.2.html Edit to add: I don't think there's a single Unix-like OS on which fsync would also fsync the directory, since a file can appear in an arbitrary number of directories, and the kernel doesn't know all the directories in which an open file appears. This is a moot point anyways, because in DELETE mode, the operation that needs to be durably persisted is the unlinking of the journal file - what would you fsync for that besides the directory itself? | | |
| ▲ | kbaker 4 days ago | parent [-] | | OK, interesting, I think I see... So you are asking about if SQLite opens and finds a not-committed rollback journal that looks valid, then it rolls it back? I was more curious so I looked at the code here: https://sqlite.org/src/file?name=src/pager.c&ci=trunk and found something similar to what you are asking in this comment before `sqlite3PagerCommitPhaseTwo`: ** When this function is called, the database file has been completely
** updated to reflect the changes made by the current transaction and
** synced to disk. The journal file still exists in the file-system
** though, and if a failure occurs at this point it will eventually
** be used as a hot-journal and the current transaction rolled back.
So, it does this: ** This function finalizes the journal file, either by deleting,
** truncating or partially zeroing it, so that it cannot be used
** for hot-journal rollback. Once this is done the transaction is
** irrevocably committed.
Assuming fsync works on both the main database and the hot journal, then I don't see a way that it is not durable? Because, it has to write and sync the full hot journal, then write to the main database, then zero out the hot journal, sync that, and only then does it atomically return from the commit? (assuming FULL and DELETE) | | |
| ▲ | agwa 3 days ago | parent [-] | | > OK, interesting, I think I see... So you are asking about if SQLite opens and finds a not-committed rollback journal that looks valid, then it rolls it back? Right. > Assuming fsync works on both the main database and the hot journal, then I don't see a way that it is not durable? Because, it has to write and sync the full hot journal, then write to the main database, then zero out the hot journal, sync that, and only then does it atomically return from the commit? (assuming FULL and DELETE) DELETE mode doesn't truncate or zero the journal; it merely deletes it from the directory. You need to switch to TRUNCATE or PERSIST for that behavior: https://sqlite.org/pragma.html#pragma_journal_mode I confirmed all of this by attaching gdb to SQLite and setting a breakpoint on unlink. At the time of unlink the journal is still "hot" and usable for rollback: https://news.ycombinator.com/item?id=45069533 | | |
| ▲ | kbaker 3 days ago | parent [-] | | Yeah, I see the comments down below in pager.c which explain it a bit better. I guess I thought its behavior was more like PERSIST by default. ** journalMode==DELETE
** The journal file is closed and deleted using sqlite3OsDelete().
**
** If the pager is running in exclusive mode, this method of finalizing
** the journal file is never used. Instead, if the journalMode is
** DELETE and the pager is in exclusive mode, the method described under
** journalMode==PERSIST is used instead.
So I guess this is one of the tradeoffs SQLite makes between extreme durability with (PERSIST, TRUNCATE, or using EXTRA) vs speed.I know we have used SQLite quite a bit without getting into this exact scenario - or at least the transaction that would have been rolled back wasn't important enough in our case, I guess when the device powers down milliseconds later to trigger this case we never noticed (or needed this transaction), only that the DB remained consistent. And it seems like if DELETE is the default and doesn't get noticed enough in practice that it needs to be changed to a different safer default (like to PERSIST or something,) I guess it is better to have the speed gains from reducing that extra write + fsync. But, now I guess you know enough to submit better documentation upstream for a sliding scale of durability, I definitely agree that the docs could be better about how to get ultimate durability, and how to tune the knobs for `journal_mode` and `synchronous`. |
|
|
|
| |
| ▲ | NewsaHackO 4 days ago | parent | prev [-] | | >if the rollback journal is present when sqlite opens a database, it applies the rollback journal to undo the changes that the transaction made. See https://www.sqlite.org/atomiccommit.html#1_deleting_the_roll... I don't follow. How would fsyncing the rollback journal affect the durability of the actual database? Do you actually think that the database would reapply an already committed journal whose ID in the header already indicates that the transaction was committed, when the database is already consistent? I really think you should re-review the definition of durability of a database, especially before saying the creator of SQLite is incorrect about its implementation. | | |
| ▲ | int_19h 4 days ago | parent | next [-] | | It's specifically about fsyncing journal deletion. The problem isn't that it would reapply it if it was already used to rollback. Rather, the problem is that if you commit, and that commit has succeeded (and so your app believes that it has written the data and might e.g. perform some other actions on it), the deletion of the now-unneeded journal might not be flushed to disk in event of power loss or similar. So when you start the app again, SQLite sees said rollback journal, and - since it would be considered "hot" - applies it, effectively reverting the transaction that was supposedly already committed. FWIW I don't think it's wrong per se. The article links to a HN comment in which Richard Hipp explains why this is the default behavior, and it does make sense: https://news.ycombinator.com/item?id=45014296. At the same time, clearly, the definition of "durable" here could use some clarification. | | |
| ▲ | agwa 4 days ago | parent | next [-] | | Yes, that's exactly right. Note that the comment by Richard Hipp is justifying why WAL mode is not durable by default. It's a completely reasonable explanation, and would be for DELETE mode too, yet his comment claims that DELETE mode is durable by default, which I can't reconcile with the docs. | |
| ▲ | NewsaHackO 4 days ago | parent | prev [-] | | >So when you start the app again, SQLite sees said rollback journal, and - since it would be considered "hot" - applies it, effectively reverting the transaction that was supposedly already committed. Guys. The journal would not be a hot journal though, as the hot journal selection only applies if the database is in a inconsistent state. Otherwise, the database knows from the ID of the journal not to reapply an already applied rollback journal. The process you are talking about ONLY happens when the journal database has been corrupted state, and it has to try and file a file to help recover the database. | | |
| ▲ | agwa 4 days ago | parent | next [-] | | OK, I just tested it: In terminal 1, I created a database and added a table to it: $ sqlite3 testdb
sqlite> create table test (col int);
In terminal 2, I attached gdb to sqlite3 and set a breakpoint on unlink: $ gdb sqlite3 `pidof sqlite3`
(gdb) b unlink
(gdb) c
Back in terminal 1, I inserted data into the table: sqlite> insert into test values(123);
In terminal 3, I saved a copy of testdb-journal: $ cp testdb-journal testdb-journal.save
Then in terminal 2, I resumed executing sqlite3: (gdb) c
In terminal 1, the INSERT completed without error.Back in terminal 3, I sent SIGKILL to sqlite3, simulating a power failure: $ killall -9 sqlite3
I then restored testdb-journal, simulating what could happen after a power failure when the parent directory is not fsynced: $ mv testdb-journal.save testdb-journal
I then opened testdb again and ran `SELECT * FROM test` and it returned zero rows.This proves int_19h and I are right - if the journal file comes back, SQLite will apply it and roll back a committed transaction. I then confirmed with strace that, as the documentation says, the directory is only fsynced after unlink when synchronous=EXTRA. It doesn't happen with synchronous=FULL. So you need synchronous=EXTRA to get durability in DELETE mode. | |
| ▲ | agwa 4 days ago | parent | prev | next [-] | | The docs list 5 conditions that all must be satisfied for the journal to be considered hot: https://www.sqlite.org/atomiccommit.html#_hot_rollback_journ... I believe they would all be satisfied. I don't see any mention of checking IDs. Not saying you're wrong - I think the docs could very well be wrong - but could you provide a citation for that behavior? | | |
| ▲ | NewsaHackO 4 days ago | parent [-] | | Please read the entire document instead of just picking out sections; you will then be able to see where your misconceptions are occurring. You have attempted to make this same point three times, so I will say it for a third time; that section is about CORRUPTED databases, not database that are consistent after fsync. | | |
| ▲ | aktiur 4 days ago | parent | next [-] | | The text in this document also directly contradicts what you're saying. Put another way: the presence of a hot journal is how SQLite determines the database might be corrupted. https://sqlite.org/lockingv3.html#hot_journals | |
| ▲ | agwa 4 days ago | parent | prev [-] | | I read the whole document. It doesn't mention IDs anywhere. If you're not going to provide citations for your claims, yet criticize me for "picking out sections" when I provide citations, then continuing this conversation won't be productive. |
|
| |
| ▲ | int_19h 4 days ago | parent | prev [-] | | > Otherwise, the database knows from the ID of the journal not to reapply an already applied rollback journal. But it's not "already applied", that's the whole point. The transaction was committed, not rolled back, so the changes in transaction were persisted to disk and the journal was just thrown away. If it magically reappears again, how is SQLite supposed to know that it needs to be discarded again rather than applied to revert the change? |
|
| |
| ▲ | agwa 4 days ago | parent | prev [-] | | I'm pretty sure I understand what durability means; the definition is not hard - https://en.wikipedia.org/wiki/Durability_(database_systems) It's possible I've misunderstood how DELETE mode works. But here's the thing - I shouldn't have to understand how DELETE mode works to know what SQLite setting I need to use to get durability. Unfortunately, the SQLite docs don't clearly say what guarantees each setting provides - instead they talk about about what SQLite does when you choose the setting, leaving the reader to try to figure out if those actions provide durability. And the docs really make it seem like you need synchronous=EXTRA in DELETE mode to get durability, for the reasons explained above. This is a docs problem; I'm not saying SQLite is buggy. | | |
| ▲ | NewsaHackO 4 days ago | parent [-] | | This may just be a expectations difference then. I would fully expect a developer to read the docs and know how a settings works to know what guarantees it has. |
|
|
|