Now that I have 3 days till my next exam, why not continue with my interests.


View on LinkedIn


With this in mind, I dove deeper into the concurrency control topic for Databases, I did not get the deal about it the first time.


My primary source of information were CMU's DBS Course on YouTube and Jeremy Cole's MVCC blog, paired with some Wikipedia and textbooks.




What is Concurrency Control ...


It is the "thing" you would do to make sure the "I" for isolation in ACID holds in your DBS.


ACID : properties to make sure your DBS works correctly.

So.


ACID. Ok. what next.


See, transactions are what they are, bunch of statements that should run (or fail) as a single unit, not partial.


SPOILER : The theory is a lot to cover, and none of it dives into multi-processing systems. It is strawman architecture or something, that explores single processing systems. Only 1 thread of execution.


So even with multiple simultaneous transactions, the OS would have performed a context-switch and then execute the instruction from the other transaction.


In the intricacies of multi threaded system, where instructions can execute parallelly, I am yet to submerge.

Now, you can have multiple connections to your DB, and each connection could simultaneously run a transaction. The thing is, they could be interleaved in time, and in some possible interleaved arrangements (called schedules) the outcome could be something that violates the integrity of the database (common example is bank transfer is happening, but some money seems to have vanished into thin air at the end).


Ok, so what?


Now that this problem is possible for some schedules, it is identified that the reason is either one of or a combination of a unrepeatable read (r-w), a dirty read (w-r) or an overwrite (w-w) conflict. For this you can find plenty of resources online (YT, books...).


Now you identify the schedules that do not face such problems. You give them fancy terms (conflict serializable, view serializable...) and figure out ways to achieve these : pessimistic (locks, two-phase locking (strict/strong)) or optimistic (timestamp-ordering, snapshot isolation, MVCC), and work with their implementations. Ofc, each have there own pros and cons. Postgres uses locks, MongoDB uses optimistic something, MySQL uses MVCC and snapshot isolation (kind of). You have to really just understand why some technique exists and you can play around with a combination (or make modifications to the proposed theory).


Ok, great. Now I can have "correct" "schedules", that is they will maintain the integrity of my database even if multiple transactions run parallelly


Wait. I have heard of something called deadlocks, aren't they a problem. Correct. Locks introduce deadlocks. Snapshot isolation introduces undo logs and with them the overhead of memory and runtime. Timestamp ordering introduces starvation. Now you need to devise ways to solve these issues as well. Like I said, pros and cons to each. But unless you plan to implement a DBS, you can leave this for now. Understanding the tradeoffs can help you choose a more suitable database for your use case for sure (take the example of UBER switching to MySQL(I have no clue why they did it specifically)), but I leave that decision to make the deeper dive up to you. Of course, I would love to discuss what you think and what I know about the topic.




The next logical step is one of two : understanding the "D" in ACID, or the "I" for multi-processing systems.


Then distributed databases. Then LSM trees. Or whatever you like. Even thinking of exploring Redis's native capabilities better. TOO MUCH TO SEE. JUST IN THE REALM OF DATABASES.