Database Management Systems
Database Management Systems: A Complete Beginner’s Guide
Table of Contents
- What is a Database?
- What is a DBMS?
- Types of Databases
- Database Architecture: The Three-Schema Model
- Data Models
- Entity-Relationship (ER) Model
- Relational Model: Tables, Keys, and Constraints
- Structured Query Language (SQL)
- Normalization
- Transactions and ACID Properties
- Concurrency Control
- Indexing and Query Optimization
- Views
- Stored Procedures and Triggers
- NoSQL vs Relational Databases
- Backup, Recovery, and Security
- Summary and What to Learn Next
1. What is a Database?
A database is an organized collection of structured data stored electronically in a computer system. Think of it as a digital filing cabinet — instead of paper folders scattered across a room, all your information lives in one place, organized so that finding or updating it takes seconds rather than hours.
Before databases existed, organizations used flat files — plain text or binary files on disk — to store data. This worked for small amounts of data but quickly became unmanageable. If a company stored customer records in one file and order records in another, there was no easy way to link them. Updating a customer’s address meant finding every file that mentioned that customer and changing each one manually. Worse, two programs might read the same file at the same time and overwrite each other’s changes.
Databases were invented to solve these problems. They provide:
- Persistent storage: Data survives when the program or computer is turned off.
- Organized structure: Data is arranged so it can be queried efficiently.
- Shared access: Multiple users and applications can read and write data simultaneously without corrupting it.
- Data integrity: Rules enforce that only valid data enters the system.
- Reduced redundancy: The same piece of information is stored once, not copied everywhere.
2. What is a DBMS?
A Database Management System (DBMS) is the software layer that sits between the user (or application) and the raw data stored on disk. Users never manipulate files directly. Instead, they issue commands to the DBMS, which handles the complexities of reading, writing, locking, recovering, and organizing the data.
Well-known DBMS products include MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, SQLite, and MongoDB.
Why Use a DBMS Instead of Files?
| Problem with Files | How a DBMS Solves It |
|---|---|
| Data redundancy | Normalization stores each fact once |
| Inconsistency | Constraints prevent conflicting data |
| No concurrent access control | Transaction management and locking |
| No query language | SQL lets you ask complex questions in one line |
| No security | Users, roles, and permissions |
| No recovery after crash | Logging and recovery mechanisms |
Functions of a DBMS
A DBMS performs several core functions:
Data Definition: The DBMS lets you define the structure of your data — which tables exist, what columns they have, and what data types and constraints apply.
Data Manipulation: You can insert, update, delete, and retrieve data.
Data Security: The DBMS controls who can see or change which data.
Data Integrity: Constraints ensure data follows rules (e.g., a student ID cannot be null, a price cannot be negative).
Concurrency Control: When multiple users access the database simultaneously, the DBMS ensures they do not interfere with each other.
Backup and Recovery: If the system crashes, the DBMS can restore the database to a consistent state.
3. Types of Databases
Databases come in several flavors, each suited to different problems.
Relational Databases (RDBMS)
The relational model organizes data into tables (called relations). Each table has rows (records) and columns (attributes). Tables relate to each other through keys. This is the dominant model in the industry and the main focus of this guide.
Examples: MySQL, PostgreSQL, Oracle, SQL Server, SQLite.
Hierarchical Databases
Data is organized in a tree structure with parent-child relationships. Every child node has exactly one parent. These are mostly legacy systems from the 1960s–1970s, still found in banking and telecommunications mainframes.
Network Databases
An extension of the hierarchical model where a child can have multiple parents, forming a graph structure. Also largely legacy.
Object-Oriented Databases
Data is stored as objects, similar to how object-oriented programming languages represent data. Useful for applications like CAD systems where complex, nested data structures are common.
NoSQL Databases
A broad category of databases that do not use the traditional relational table model. They trade some relational guarantees for horizontal scalability and flexibility. Subtypes include:
- Document stores (MongoDB, CouchDB): Data stored as JSON-like documents.
- Key-Value stores (Redis, DynamoDB): Simple pairs of a key and its value, very fast.
- Column-family stores (Cassandra, HBase): Data stored in column groups, good for analytics.
- Graph databases (Neo4j): Data stored as nodes and edges, ideal for social networks or recommendation systems.
NoSQL is covered in more depth in section 15.
4. Database Architecture: The Three-Schema Model
ANSI proposed the Three-Schema Architecture to achieve data independence — the ability to change one layer of the system without affecting the others. The three levels are:
External Schema (View Level)
This is what individual users or applications see. Different users can have different views of the same underlying data. A bank teller might see customer names and account balances, while a manager sees summary statistics. Neither sees the raw physical storage.
Conceptual Schema (Logical Level)
This describes the entire logical structure of the database — all the tables, columns, relationships, and constraints — without worrying about how data is physically stored on disk. Database administrators and designers work at this level.
Internal Schema (Physical Level)
This describes how data is actually stored on disk: file formats, index structures, storage paths, block sizes, and access methods. This level is managed by the DBMS itself and storage administrators.
Why This Matters
The three-schema model gives you two important properties:
Physical Data Independence: You can change how data is stored on disk (move to a faster SSD, reorganize file layout) without changing the logical schema or rewriting applications.
Logical Data Independence: You can change the logical schema (add a column, restructure a table) without rewriting every application that uses the database, as long as the external views remain the same.
5. Data Models
A data model is a conceptual framework that describes how data is structured, related, and constrained. Before designing a database, you choose a data model.
Relational Data Model
The relational model, proposed by Edgar F. Codd in 1970, represents data as a set of relations (tables). It is based on mathematical set theory. The key ideas are:
- Data is organized into tables with rows and columns.
- Each row is unique (no duplicate rows).
- The order of rows and columns does not matter.
- Relationships between tables are expressed through shared column values (keys).
This model has dominated the industry for over 50 years because it is simple, mathematically rigorous, and maps well to the SQL query language.
Entity-Relationship Model
The ER model is a high-level conceptual model used during the design phase, before you create actual tables. It uses entities (things), attributes (properties of things), and relationships (connections between things) to describe the problem domain. It is then translated into a relational schema.
Hierarchical and Network Models
These older models represent data as trees or graphs respectively. They are covered here only for historical context; you will rarely work with them in a new project.
6. Entity-Relationship (ER) Model
The ER model is the blueprint of your database. You draw an ER diagram before writing a single line of SQL. It helps you understand the problem before you implement a solution.
Entities
An entity is a real-world object or concept about which you want to store data. Examples include Student, Course, Employee, Product, and Order.
Entities are represented as rectangles in an ER diagram.
Attributes
An attribute is a property of an entity. A Student entity might have attributes: StudentID, Name, DateOfBirth, and Email.
Types of attributes:
- Simple attribute: Cannot be divided further. Example: Age.
- Composite attribute: Can be broken down into smaller parts. Example: Name can be split into FirstName and LastName.
- Single-valued attribute: Holds one value. Example: DateOfBirth.
- Multi-valued attribute: Can hold multiple values. Example: PhoneNumbers (a student may have several).
- Derived attribute: Calculated from another attribute. Example: Age can be derived from DateOfBirth.
Relationships
A relationship describes how two or more entities are associated. Examples: A Student enrolls in a Course. An Employee works for a Department.
Relationships have a cardinality that describes how many instances of one entity relate to instances of another:
One-to-One (1:1): Each instance of entity A relates to at most one instance of entity B, and vice versa. Example: A person has one passport; a passport belongs to one person.
One-to-Many (1:N): One instance of entity A relates to many instances of entity B, but each instance of B relates to only one instance of A. Example: A department has many employees, but each employee belongs to one department.
Many-to-Many (M:N): Many instances of A relate to many instances of B. Example: A student enrolls in many courses; a course has many students enrolled.
Weak Entities
A weak entity cannot be uniquely identified by its own attributes alone. It depends on a strong (owner) entity for its identity. Example: An OrderItem cannot exist without an Order. The OrderItem uses the OrderID from the Order plus its own LineNumber to form its identity.
ER Diagram to Relational Schema
After drawing the ER diagram, you convert it to relational tables:
- Each strong entity becomes a table.
- Each weak entity becomes a table that includes the primary key of its owner entity.
- Each many-to-many relationship becomes a separate junction table containing the primary keys of both related entities.
- One-to-many relationships are represented by placing a foreign key in the “many” side table.
7. Relational Model: Tables, Keys, and Constraints
Now that you understand the conceptual model, it is time to understand how data is actually structured in a relational database.
Terminology
| Formal Term | Common Term | Meaning |
|---|---|---|
| Relation | Table | A set of rows sharing the same structure |
| Tuple | Row / Record | One instance of data |
| Attribute | Column / Field | A named property of the relation |
| Domain | Data Type | The set of allowed values for an attribute |
| Degree | Number of Columns | How many attributes a relation has |
| Cardinality | Number of Rows | How many tuples a relation has |
Keys
Keys are the most important concept in the relational model. They uniquely identify rows and establish relationships between tables.
Super Key
A super key is any set of attributes that uniquely identifies a row in a table. A super key can contain extra attributes that are not strictly necessary for uniqueness.
For a Student table with columns (StudentID, Name, Email):
- {StudentID} is a super key.
- {Email} is a super key (if emails are unique).
- {StudentID, Name} is a super key (StudentID alone already uniquely identifies a row, Name is extra).
- {StudentID, Name, Email} is a super key.
Candidate Key
A candidate key is a minimal super key — a super key with no unnecessary attributes. If you remove any attribute from a candidate key, it no longer uniquely identifies rows.
From the example above:
- {StudentID} is a candidate key.
- {Email} is a candidate key.
- {StudentID, Name} is NOT a candidate key because {StudentID} alone works.
A table can have multiple candidate keys. Each one is a valid option for the primary key.
Primary Key
The primary key is the candidate key chosen by the database designer to be the official identifier for a table. Rules:
- A primary key cannot be NULL.
- A primary key must be unique.
- A table has exactly one primary key (though it can span multiple columns — a composite primary key).
Alternate Key
Every candidate key that was not chosen as the primary key is called an alternate key. They are still unique identifiers, just not the chosen one.
Foreign Key
A foreign key is a column (or set of columns) in one table that references the primary key of another table. It establishes and enforces a link between the two tables.
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE Department (
DeptID INT PRIMARY KEY,
DeptName VARCHAR(100) NOT NULL
);
CREATE TABLE Employee (
EmpID INT PRIMARY KEY,
Name VARCHAR(100) NOT NULL,
DeptID INT,
FOREIGN KEY (DeptID) REFERENCES Department(DeptID)
);
Here, Employee.DeptID is a foreign key referencing Department.DeptID. You cannot insert an employee with a DeptID that does not exist in the Department table. This is called referential integrity.
Composite Key
A composite key is a primary key made of two or more columns. Neither column alone is unique, but together they are.
1
2
3
4
5
6
CREATE TABLE Enrollment (
StudentID INT,
CourseID INT,
EnrollDate DATE,
PRIMARY KEY (StudentID, CourseID)
);
A student can enroll in many courses, and a course can have many students, but each (StudentID, CourseID) combination must be unique.
Constraints
Constraints are rules the DBMS enforces to maintain data integrity.
NOT NULL: The column must have a value; it cannot be left empty.
1
Name VARCHAR(100) NOT NULL
UNIQUE: All values in the column must be distinct.
1
Email VARCHAR(255) UNIQUE
PRIMARY KEY: Combines NOT NULL and UNIQUE. Uniquely identifies each row.
FOREIGN KEY: Ensures a value exists in the referenced table.
CHECK: Validates that a value meets a condition.
1
Salary DECIMAL(10,2) CHECK (Salary >= 0)
DEFAULT: Provides a default value if none is specified.
1
Status VARCHAR(20) DEFAULT 'Active'
8. Structured Query Language (SQL)
SQL is the standard language for communicating with relational databases. It is declarative — you describe what you want, not how to get it. The DBMS figures out the most efficient way to execute your query.
SQL is divided into several sublanguages:
- DDL (Data Definition Language): CREATE, ALTER, DROP — defines structure.
- DML (Data Manipulation Language): INSERT, UPDATE, DELETE, SELECT — manipulates data.
- DCL (Data Control Language): GRANT, REVOKE — controls permissions.
- TCL (Transaction Control Language): COMMIT, ROLLBACK, SAVEPOINT — manages transactions.
DDL: Creating and Modifying Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Create a new table
CREATE TABLE Student (
StudentID INT PRIMARY KEY,
FirstName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
Email VARCHAR(100) UNIQUE,
DateOfBirth DATE,
GPA DECIMAL(3,2) CHECK (GPA >= 0.0 AND GPA <= 4.0),
DeptID INT,
FOREIGN KEY (DeptID) REFERENCES Department(DeptID)
);
-- Add a new column to an existing table
ALTER TABLE Student ADD COLUMN Phone VARCHAR(15);
-- Remove a column
ALTER TABLE Student DROP COLUMN Phone;
-- Delete the entire table and all its data
DROP TABLE Student;
DML: Inserting Data
1
2
3
4
5
6
7
8
9
-- Insert a single row
INSERT INTO Student (StudentID, FirstName, LastName, Email, GPA, DeptID)
VALUES (1001, 'Nafiul', 'Islam', 'nafiul@example.com', 3.85, 10);
-- Insert multiple rows at once
INSERT INTO Student (StudentID, FirstName, LastName, Email, GPA, DeptID)
VALUES
(1002, 'naba', 'Rahman', 'naba@example.com', 3.90, 10),
(1003, 'Karim', 'Hossain', 'karim@example.com', 3.50, 20);
DML: Querying Data (SELECT)
The SELECT statement is the most powerful and commonly used SQL command.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Select all columns from a table
SELECT * FROM Student;
-- Select specific columns
SELECT FirstName, LastName, GPA FROM Student;
-- Filter rows with WHERE
SELECT FirstName, LastName, GPA
FROM Student
WHERE GPA >= 3.75;
-- Sort results
SELECT FirstName, LastName, GPA
FROM Student
ORDER BY GPA DESC;
-- Limit the number of results
SELECT FirstName, LastName
FROM Student
ORDER BY GPA DESC
LIMIT 5;
Filtering with WHERE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Equality
WHERE DeptID = 10
-- Range
WHERE GPA BETWEEN 3.0 AND 4.0
-- Pattern matching (% matches any sequence, _ matches one character)
WHERE Email LIKE '%@gmail.com'
-- List of values
WHERE DeptID IN (10, 20, 30)
-- NULL check (never use = NULL, always IS NULL)
WHERE DateOfBirth IS NULL
-- Combining conditions
WHERE DeptID = 10 AND GPA > 3.5
WHERE DeptID = 10 OR DeptID = 20
WHERE NOT (GPA < 2.0)
Aggregate Functions
Aggregate functions operate on a set of rows and return a single value.
1
2
3
4
5
6
7
8
-- Count of all rows
SELECT COUNT(*) FROM Student;
-- Count of non-null values in a column
SELECT COUNT(Email) FROM Student;
-- Average, sum, minimum, maximum
SELECT AVG(GPA), SUM(GPA), MIN(GPA), MAX(GPA) FROM Student;
GROUP BY and HAVING
GROUP BY groups rows that have the same value in specified columns, allowing aggregates to be computed per group.
1
2
3
4
-- Average GPA per department
SELECT DeptID, AVG(GPA) AS AvgGPA, COUNT(*) AS StudentCount
FROM Student
GROUP BY DeptID;
HAVING filters groups after aggregation. WHERE filters rows before aggregation.
1
2
3
4
5
-- Only show departments where average GPA is above 3.5
SELECT DeptID, AVG(GPA) AS AvgGPA
FROM Student
GROUP BY DeptID
HAVING AVG(GPA) > 3.5;
DML: Updating and Deleting Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Update specific rows
UPDATE Student
SET GPA = 3.95
WHERE StudentID = 1001;
-- Update multiple columns
UPDATE Student
SET GPA = 4.0, Email = 'nafiul_new@example.com'
WHERE StudentID = 1001;
-- Delete specific rows
DELETE FROM Student WHERE StudentID = 1003;
-- Delete all rows (table remains, data is gone)
DELETE FROM Student;
JOIN: Combining Tables
A JOIN combines rows from two or more tables based on a related column.
Assume these tables:
1
2
3
4
5
6
7
8
9
10
CREATE TABLE Department (
DeptID INT PRIMARY KEY,
DeptName VARCHAR(100)
);
CREATE TABLE Student (
StudentID INT PRIMARY KEY,
Name VARCHAR(100),
DeptID INT REFERENCES Department(DeptID)
);
INNER JOIN: Returns only rows where there is a match in both tables.
1
2
3
SELECT s.Name, d.DeptName
FROM Student s
INNER JOIN Department d ON s.DeptID = d.DeptID;
LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table, and the matching rows from the right table. If there is no match, NULLs fill the right side.
1
2
3
4
-- Shows all students, even those not assigned to a department
SELECT s.Name, d.DeptName
FROM Student s
LEFT JOIN Department d ON s.DeptID = d.DeptID;
RIGHT JOIN: Returns all rows from the right table and matching rows from the left.
1
2
3
4
-- Shows all departments, even those with no students
SELECT s.Name, d.DeptName
FROM Student s
RIGHT JOIN Department d ON s.DeptID = d.DeptID;
FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there is no match.
1
2
3
SELECT s.Name, d.DeptName
FROM Student s
FULL OUTER JOIN Department d ON s.DeptID = d.DeptID;
SELF JOIN: A table joined with itself. Useful for hierarchical data.
1
2
3
4
-- Find employees and their managers (both in the same Employee table)
SELECT e.Name AS Employee, m.Name AS Manager
FROM Employee e
LEFT JOIN Employee m ON e.ManagerID = m.EmpID;
CROSS JOIN: Returns the Cartesian product — every row of the first table paired with every row of the second. If table A has 3 rows and B has 4 rows, the result has 12 rows.
1
2
3
SELECT s.Name, c.CourseName
FROM Student s
CROSS JOIN Course c;
Subqueries
A subquery is a query nested inside another query.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Find students with GPA above the average GPA
SELECT Name, GPA
FROM Student
WHERE GPA > (SELECT AVG(GPA) FROM Student);
-- Find students enrolled in the 'Database Systems' course
SELECT Name
FROM Student
WHERE StudentID IN (
SELECT StudentID
FROM Enrollment
WHERE CourseID = (
SELECT CourseID FROM Course WHERE CourseName = 'Database Systems'
)
);
Set Operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- UNION: combines results, removes duplicates
SELECT Name FROM Student
UNION
SELECT Name FROM Alumni;
-- UNION ALL: combines results, keeps duplicates
SELECT Name FROM Student
UNION ALL
SELECT Name FROM Alumni;
-- INTERSECT: only rows that appear in both
SELECT DeptID FROM Student
INTERSECT
SELECT DeptID FROM Faculty;
-- EXCEPT (or MINUS in Oracle): rows in first query but not second
SELECT DeptID FROM Department
EXCEPT
SELECT DeptID FROM Student;
9. Normalization
Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity. It involves decomposing tables according to a set of formal rules called normal forms.
Why Normalization Matters
Consider this poorly designed table:
| OrderID | CustomerName | CustomerEmail | ProductName | ProductPrice | Quantity |
|---|---|---|---|---|---|
| 1 | Nafiul | n@x.com | Laptop | 75000 | 1 |
| 2 | Nafiul | n@x.com | Mouse | 500 | 2 |
| 3 | naba | a@x.com | Laptop | 75000 | 1 |
Problems:
- Insertion anomaly: You cannot add a product to the catalog unless someone orders it.
- Update anomaly: If the Laptop price changes, you must update every row that mentions it. Miss one, and the data is inconsistent.
- Deletion anomaly: If naba cancels her order, you lose all information about the Laptop product.
Normalization fixes these problems by splitting this table into smaller, more focused tables.
First Normal Form (1NF)
A table is in 1NF if:
- Every column contains atomic (indivisible) values.
- Every column contains values of a single type.
- Each row is unique (there is a primary key).
Violation example: A PhoneNumbers column that stores “01711111111, 01822222222” in one cell violates 1NF because the value is not atomic.
Fix: Create a separate table for phone numbers with a foreign key back to the customer.
Second Normal Form (2NF)
A table is in 2NF if:
- It is in 1NF.
- Every non-key attribute is fully functionally dependent on the entire primary key (not just part of it).
This only matters when the primary key is composite.
Violation example: In an Enrollment table with primary key (StudentID, CourseID):
| StudentID | CourseID | StudentName | CourseName | Grade | |———–|———-|————-|————|——-|
- StudentName depends only on StudentID, not on the full key.
- CourseName depends only on CourseID, not on the full key.
- Only Grade depends on the full (StudentID, CourseID) key.
Fix: Split into three tables: Student(StudentID, StudentName), Course(CourseID, CourseName), Enrollment(StudentID, CourseID, Grade).
Third Normal Form (3NF)
A table is in 3NF if:
- It is in 2NF.
- No non-key attribute is transitively dependent on the primary key through another non-key attribute.
Violation example: In an Employee table:
| EmpID | EmpName | DeptID | DeptName |
- EmpID -> DeptID (EmpID determines DeptID).
- DeptID -> DeptName (DeptID determines DeptName).
- Therefore, EmpID -> DeptName transitively. DeptName depends on EmpID through DeptID.
Fix: Split into Employee(EmpID, EmpName, DeptID) and Department(DeptID, DeptName).
Boyce-Codd Normal Form (BCNF)
BCNF is a stricter version of 3NF. A table is in BCNF if, for every functional dependency X -> Y, X must be a super key of the table.
BCNF handles certain anomalies that 3NF does not. In practice, if a table is in 3NF and has only one candidate key, it is also in BCNF.
Fourth Normal Form (4NF)
A table is in 4NF if it is in BCNF and has no multi-valued dependencies. A multi-valued dependency exists when one attribute determines multiple independent multi-valued attributes.
Example: If an Employee can have multiple skills and multiple languages, and skills and languages are independent of each other:
| EmpID | Skill | Language |
|---|---|---|
| 1 | SQL | English |
| 1 | SQL | Bengali |
| 1 | Python | English |
| 1 | Python | Bengali |
Fix: Split into Employee_Skills(EmpID, Skill) and Employee_Languages(EmpID, Language).
Functional Dependencies
A functional dependency A -> B means that knowing the value of A uniquely determines the value of B. This concept underlies all normal forms.
- If StudentID -> Name, then knowing the StudentID tells you exactly which name belongs to that student.
- Functional dependencies are discovered from the real-world semantics of data, not just from looking at a current snapshot of the table.
Denormalization
Sometimes, fully normalized databases are deliberately denormalized to improve read performance. If a query needs to join five tables every time it runs, it may be faster to store some redundant data and avoid the joins. This is a conscious trade-off: you accept some redundancy and potential inconsistency risk in exchange for speed. Data warehouses and analytics systems often use denormalized schemas (star schema, snowflake schema) for this reason.
10. Transactions and ACID Properties
What is a Transaction?
A transaction is a unit of work that is treated as a single, indivisible operation. A transaction consists of one or more database operations (reads and writes) that must all succeed or all fail together.
The classic example is a bank transfer: to move money from Account A to Account B, you need to:
- Deduct the amount from Account A.
- Add the amount to Account B.
If step 1 succeeds but step 2 fails (due to a system crash, for example), the money disappears. This is unacceptable. A transaction ensures that either both steps happen, or neither does.
ACID Properties
ACID stands for Atomicity, Consistency, Isolation, and Durability. These four properties guarantee that database transactions are processed reliably.
Atomicity
Atomicity means “all or nothing.” A transaction is treated as a single atomic unit. Either all of its operations are committed to the database, or none of them are.
If a transaction fails midway, the DBMS rolls back all changes made so far, returning the database to the state it was in before the transaction started.
1
2
3
4
5
6
7
8
9
BEGIN TRANSACTION;
UPDATE Account SET Balance = Balance - 5000 WHERE AccountID = 'A101';
UPDATE Account SET Balance = Balance + 5000 WHERE AccountID = 'B202';
COMMIT; -- Both updates are permanently saved
-- If an error occurs:
ROLLBACK; -- Both updates are undone
Consistency
Consistency means a transaction brings the database from one valid state to another valid state. All integrity constraints, rules, and triggers are satisfied before and after every transaction.
If a transaction would violate a constraint (e.g., setting a balance below zero when there is a CHECK constraint), the DBMS rejects it and rolls back.
Isolation
Isolation means that concurrent transactions execute as if they were running one at a time. The intermediate state of a transaction is not visible to other transactions.
Without isolation, you could have dirty reads (reading uncommitted data from another transaction), non-repeatable reads (reading the same row twice within a transaction and getting different results), or phantom reads (a query returns different rows the second time because another transaction added or deleted rows).
SQL defines four isolation levels (from weakest to strongest):
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible |
| Serializable | Prevented | Prevented | Prevented |
Higher isolation levels prevent more anomalies but reduce concurrency (more locking, lower throughput).
Durability
Durability means that once a transaction is committed, its changes are permanent — even if the system crashes immediately afterward. The DBMS achieves this through write-ahead logging (WAL): changes are written to a log on stable storage before being applied to the actual data files.
Transaction States
A transaction goes through several states during its lifecycle:
- Active: The transaction is executing.
- Partially committed: The final operation has been executed but not yet written to disk.
- Committed: The transaction has been successfully committed; changes are permanent.
- Failed: An error has occurred and the transaction cannot proceed normally.
- Aborted: The transaction has been rolled back and the database is restored to its previous state.
Savepoints
A savepoint is a marker within a transaction that lets you roll back to a specific point rather than the very beginning.
1
2
3
4
5
6
7
8
9
10
11
12
BEGIN TRANSACTION;
INSERT INTO Orders (OrderID, CustomerID) VALUES (500, 1001);
SAVEPOINT after_order;
INSERT INTO OrderItems (OrderID, ProductID, Qty) VALUES (500, 201, 3);
-- Suppose this second insert fails
ROLLBACK TO SAVEPOINT after_order; -- Only the OrderItems insert is undone
COMMIT; -- The Orders insert is still committed
11. Concurrency Control
When multiple users access the database simultaneously, the DBMS must manage their interactions to prevent conflicts.
The Problem: Concurrency Anomalies
Lost Update: Two transactions read the same value, both modify it, and the second write overwrites the first.
Transaction T1 and T2 both read a stock quantity of 100. T1 subtracts 10 and writes 90. T2 subtracts 20 and writes 80. The final quantity is 80, but it should be 70 (100 - 10 - 20). T1’s update was lost.
Dirty Read: Transaction T1 reads data written by T2, but T2 is later rolled back. T1 made decisions based on data that never officially existed.
Non-Repeatable Read: T1 reads a row, T2 updates that row and commits, T1 reads the same row again and gets a different value.
Phantom Read: T1 runs a query and gets a set of rows. T2 inserts new rows matching T1’s query and commits. T1 runs the same query again and gets additional rows that did not exist before.
Locking
The most common concurrency control mechanism is locking.
Shared Lock (S-Lock): Multiple transactions can hold a shared lock on the same item simultaneously. A shared lock allows reading but not writing.
Exclusive Lock (X-Lock): Only one transaction can hold an exclusive lock on an item. An exclusive lock is required for writing.
A transaction must acquire a shared lock before reading and an exclusive lock before writing. If another transaction holds an incompatible lock, the requesting transaction must wait.
Two-Phase Locking (2PL)
Two-phase locking is a protocol that guarantees serializability (the gold standard of isolation):
Phase 1 — Growing Phase: A transaction acquires all the locks it needs. It may not release any locks during this phase.
Phase 2 — Shrinking Phase: A transaction releases locks. It may not acquire any new locks during this phase.
The point at which a transaction acquires its last lock is called the lock point.
Deadlock
A deadlock occurs when two or more transactions are waiting for each other to release locks, and none can proceed.
- T1 holds a lock on resource A and waits for B.
- T2 holds a lock on resource B and waits for A.
Both are stuck forever. The DBMS detects deadlocks using a wait-for graph and resolves them by aborting one of the transactions (the victim) and rolling it back.
Optimistic Concurrency Control
Instead of locking before accessing data, optimistic control lets transactions run freely and only checks for conflicts at commit time. If a conflict is found, the transaction is rolled back and retried. This works well when conflicts are rare.
Timestamp-Based Concurrency Control
Each transaction is assigned a unique timestamp when it begins. The DBMS uses these timestamps to decide the order in which conflicting operations should execute, ensuring serializability without locks.
12. Indexing and Query Optimization
As a database grows to millions of rows, a simple sequential scan through every row becomes too slow. Indexes solve this problem.
What is an Index?
An index is a separate data structure that the DBMS maintains alongside the table. It stores a copy of selected column values in a way that allows fast lookup, much like the index at the back of a book.
Without an index, a query like WHERE Email = 'nafiul@example.com' must scan every row. With an index on Email, the DBMS jumps directly to the matching row.
1
2
3
4
5
6
7
8
9
10
11
-- Create an index
CREATE INDEX idx_student_email ON Student(Email);
-- Create a composite index
CREATE INDEX idx_emp_dept_salary ON Employee(DeptID, Salary);
-- Create a unique index (also enforces uniqueness)
CREATE UNIQUE INDEX idx_student_email_unique ON Student(Email);
-- Drop an index
DROP INDEX idx_student_email;
Types of Indexes
B-Tree Index: The default index type in most DBMS. Organized as a balanced tree, it supports equality lookups, range queries (>, <, BETWEEN), and ORDER BY efficiently. Suitable for columns with many distinct values.
Hash Index: Uses a hash function to map values to bucket locations. Extremely fast for exact equality lookups (=), but useless for range queries. MySQL’s MEMORY engine uses hash indexes.
Bitmap Index: Stores a bitmap for each distinct value in a column. Efficient for columns with low cardinality (few distinct values), such as Gender or Status. Common in data warehouses.
Full-Text Index: Optimized for searching within text, supporting features like word stemming and relevance ranking. Used for search functionality.
Clustered Index: The table data itself is physically sorted and stored in the order of the index. There can be only one clustered index per table (because the data can only be physically ordered one way). In SQL Server, the primary key is the clustered index by default.
Non-Clustered Index: A separate structure that points to the actual data rows. A table can have many non-clustered indexes.
The Cost of Indexes
Indexes speed up reads but slow down writes. Every INSERT, UPDATE, and DELETE must update not only the table but also all its indexes. Creating too many indexes on a heavily written table hurts write performance. Indexes also consume additional disk space.
Query Optimization
The DBMS’s query optimizer analyzes your SQL query and determines the most efficient execution plan. It considers available indexes, table statistics (row counts, value distributions), and the cost of different join strategies.
You can examine the execution plan using EXPLAIN:
1
2
3
4
EXPLAIN SELECT s.Name, d.DeptName
FROM Student s
INNER JOIN Department d ON s.DeptID = d.DeptID
WHERE s.GPA > 3.5;
The output shows whether the DBMS is using indexes, how many rows it estimates to scan, and what join algorithm it is using. This is invaluable for diagnosing slow queries.
Tips for Writing Efficient Queries
- Avoid
SELECT *; select only the columns you need. - Use indexes on columns that appear in WHERE, JOIN ON, and ORDER BY clauses.
- Avoid applying functions to indexed columns in WHERE clauses (
WHERE YEAR(DateOfBirth) = 2000cannot use an index on DateOfBirth; use a range instead). - Use JOINs instead of correlated subqueries where possible.
- Filter early: apply WHERE conditions before joining large result sets.
13. Views
A view is a virtual table defined by a stored SQL query. It does not store data itself; it dynamically executes the underlying query whenever accessed.
1
2
3
4
5
6
7
8
9
10
11
-- Create a view
CREATE VIEW HighAchieverStudents AS
SELECT StudentID, FirstName, LastName, GPA, DeptID
FROM Student
WHERE GPA >= 3.75;
-- Query a view exactly like a table
SELECT * FROM HighAchieverStudents WHERE DeptID = 10;
-- Drop a view
DROP VIEW HighAchieverStudents;
Why Use Views?
Simplicity: Complex multi-table joins can be wrapped in a view so that users query a simple virtual table instead of writing the join every time.
Security: You can grant users access to a view without giving them access to the underlying tables. A view might show only non-sensitive columns (hiding salary, for example).
Logical Independence: If the underlying table structure changes, you can update the view definition rather than every query that uses it.
Encapsulation: Business logic embedded in SQL (e.g., which students are “high achievers”) is defined in one place.
Updatable Views
In some cases, you can run INSERT, UPDATE, and DELETE on a view, and the DBMS translates these operations to the underlying tables. However, many views are not updatable (e.g., views with aggregates, DISTINCT, GROUP BY, or joins across tables).
Materialized Views
A materialized view physically stores the result of the query. Unlike a regular view, it does not re-execute the query every time; it returns the cached result instead. This is much faster for expensive aggregation queries run frequently.
The trade-off is that the materialized view may be stale (its data does not reflect the latest changes) until it is refreshed.
14. Stored Procedures and Triggers
Stored Procedures
A stored procedure is a named block of SQL code saved in the database and executed on demand. Instead of sending multiple SQL statements over the network, an application can call one stored procedure that runs entirely on the database server.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Example (MySQL syntax)
DELIMITER //
CREATE PROCEDURE GetStudentsByDept(IN p_DeptID INT)
BEGIN
SELECT StudentID, FirstName, LastName, GPA
FROM Student
WHERE DeptID = p_DeptID
ORDER BY GPA DESC;
END //
DELIMITER ;
-- Call the procedure
CALL GetStudentsByDept(10);
Benefits of stored procedures:
- Reduced network traffic: one call instead of many round trips.
- Code reuse: write the logic once, use it everywhere.
- Security: grant EXECUTE permission on a procedure without giving direct table access.
- Performance: the DBMS can pre-compile and cache the execution plan.
Triggers
A trigger is a special stored procedure that the DBMS automatically executes in response to certain events on a table — INSERT, UPDATE, or DELETE. Triggers fire automatically without being explicitly called.
1
2
3
4
5
6
7
-- Example: Automatically set UpdatedAt timestamp on row update
CREATE TRIGGER trg_student_updated
BEFORE UPDATE ON Student
FOR EACH ROW
BEGIN
SET NEW.UpdatedAt = NOW();
END;
1
2
3
4
5
6
7
8
-- Example: Log deletions to an audit table
CREATE TRIGGER trg_student_delete_log
AFTER DELETE ON Student
FOR EACH ROW
BEGIN
INSERT INTO StudentAuditLog (StudentID, DeletedAt, DeletedBy)
VALUES (OLD.StudentID, NOW(), CURRENT_USER());
END;
Timing options:
- BEFORE trigger: fires before the operation. Used to validate or modify the data being inserted/updated.
- AFTER trigger: fires after the operation. Used for logging, auditing, or cascading changes to other tables.
Caution with triggers: Triggers can make it difficult to understand what is happening in the database because they execute invisibly. Overusing them leads to complex, hard-to-debug behavior. Use them for audit logging and data integrity enforcement, not for complex business logic.
15. NoSQL vs Relational Databases
Why NoSQL Emerged
Relational databases were designed in an era of vertical scaling (adding more power to one server) and structured data. The rise of the internet brought:
- Massive scale (billions of users, petabytes of data).
- Unstructured or semi-structured data (JSON, images, social network graphs).
- High write throughput requirements.
- Need for horizontal scaling (spreading data across thousands of commodity servers).
The strict relational model with ACID guarantees made it hard to scale horizontally. NoSQL databases relaxed some of these guarantees in exchange for scalability and flexibility.
CAP Theorem
The CAP theorem states that a distributed data store can guarantee at most two of three properties simultaneously:
Consistency: Every read receives the most recent write or an error.
Availability: Every request receives a response (not necessarily the most recent data).
Partition Tolerance: The system continues to function despite network partitions (network failures that split nodes).
Since network partitions are unavoidable in distributed systems, you must choose between consistency and availability.
- Relational databases traditionally prioritize Consistency and Availability (CA), running on a single node where partitions do not occur.
- Most NoSQL databases prioritize Availability and Partition Tolerance (AP), accepting that data may be temporarily inconsistent across nodes (eventual consistency).
NoSQL Database Types Compared
Document Stores (MongoDB):
- Data stored as JSON/BSON documents.
- Good for: content management, user profiles, catalogs.
- Weakness: No joins (data must be denormalized or fetched in multiple queries).
Key-Value Stores (Redis):
- Extremely simple: store and retrieve values by key.
- Extremely fast, often in-memory.
- Good for: caching, session management, leaderboards.
- Weakness: No complex querying.
Column-Family Stores (Cassandra):
- Data stored in rows but organized by column families.
- Optimized for writing and reading large amounts of data across many nodes.
- Good for: time-series data, event logs, IoT data.
- Weakness: Queries must be designed around known access patterns.
Graph Databases (Neo4j):
- Data stored as nodes (entities) and edges (relationships).
- Querying relationships is native and extremely fast.
- Good for: social networks, fraud detection, recommendation engines.
- Weakness: Not suited for tabular data.
When to Use What
| Scenario | Recommended |
|---|---|
| Financial transactions, banking | Relational (ACID required) |
| User profiles with flexible schema | Document store |
| High-speed caching | Key-value store |
| Social network friend-of-friend queries | Graph database |
| Time-series IoT sensor data | Column-family store |
| E-commerce with complex queries | Relational |
| Real-time analytics at massive scale | Column-family or hybrid |
16. Backup, Recovery, and Security
Backup Strategies
A backup is a copy of data that can be used to restore the database after data loss.
Full Backup: A complete copy of the entire database. Simple to restore but takes the most time and storage.
Incremental Backup: Backs up only the data that has changed since the last backup (full or incremental). Faster and smaller, but restoration requires the full backup plus all subsequent incrementals.
Differential Backup: Backs up all data changed since the last full backup. Restoration requires only the last full backup and the last differential.
Transaction Log Backup: Backs up the transaction log. Allows point-in-time recovery — you can restore the database to any specific moment, not just the time of a backup.
Recovery Mechanisms
When a system crashes, the DBMS must recover to a consistent state.
Write-Ahead Logging (WAL): Before any change is written to the actual data file, the change is first written to a log file. If the system crashes, the log is used to redo committed changes or undo uncommitted changes.
Redo: After a crash, the DBMS replays the log to re-apply all changes from committed transactions that might not have been flushed to disk yet.
Undo: The DBMS reverses the changes made by transactions that were active (not committed) at the time of the crash.
Checkpointing: Periodically, the DBMS writes a checkpoint to the log, indicating that all changes up to this point have been safely flushed to disk. Recovery only needs to start from the last checkpoint, not the beginning of the log.
Database Security
Security in a DBMS has multiple layers.
Authentication: Verifying the identity of a user before granting access. Databases support password-based authentication, OS-level authentication, and more.
Authorization: Controlling what authenticated users can do. SQL’s DCL commands manage this:
1
2
3
4
5
6
7
8
9
10
11
-- Grant a user permission to select from a table
GRANT SELECT ON Student TO 'analyst_user';
-- Grant multiple permissions
GRANT SELECT, INSERT ON Orders TO 'sales_user';
-- Grant all permissions
GRANT ALL PRIVILEGES ON DATABASE university TO 'admin_user';
-- Remove a permission
REVOKE INSERT ON Orders FROM 'sales_user';
Roles: Instead of granting permissions to individual users, you can grant permissions to a role and assign users to roles. This simplifies management.
1
2
3
CREATE ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO data_analyst;
GRANT data_analyst TO 'nafiul';
Encryption: Sensitive data (passwords, financial data, personal identifiers) should be encrypted at rest (on disk) and in transit (over the network). Never store passwords as plain text — use hashing algorithms like bcrypt.
SQL Injection: This is one of the most dangerous security vulnerabilities in database-driven applications. It occurs when user input is directly concatenated into a SQL query without sanitization.
Vulnerable code example (in a hypothetical application):
1
query = "SELECT * FROM Users WHERE Username = '" + userInput + "';"
If a user inputs ' OR '1'='1, the query becomes:
1
SELECT * FROM Users WHERE Username = '' OR '1'='1';
This returns all users, bypassing authentication.
Prevention: Use parameterized queries (prepared statements), which separate the SQL code from the data:
1
2
3
-- In a parameterized query, the DBMS treats userInput as pure data,
-- never as executable SQL code.
SELECT * FROM Users WHERE Username = ?;
Auditing: Database audit logs record who accessed or modified what data and when. This is essential for regulatory compliance and detecting suspicious activity.
17. Summary and What to Learn Next
You have now covered the foundational concepts of Database Management Systems. Here is a quick recap:
A database is an organized collection of data. A DBMS is the software that manages it, providing structured access, integrity, concurrency control, and recovery.
The relational model organizes data into tables with rows and columns, related by keys — primary keys for unique row identification, foreign keys for referential integrity, and candidate keys as alternative identifiers.
SQL is the language of relational databases, covering data definition (CREATE, ALTER, DROP), data manipulation (INSERT, UPDATE, DELETE, SELECT), joins for combining tables, and aggregation for summarizing data.
Normalization eliminates redundancy by decomposing tables according to functional dependencies, progressing through 1NF, 2NF, 3NF, and BCNF.
Transactions group operations into atomic units, guaranteed to be safe by the ACID properties: Atomicity, Consistency, Isolation, and Durability.
Concurrency control — through locking and protocols like two-phase locking — ensures that simultaneous transactions do not corrupt each other’s work.
Indexes accelerate query performance by providing fast lookup paths. The query optimizer uses statistics and index availability to generate efficient execution plans.
Views provide virtual tables for simplicity, security, and encapsulation. Stored procedures and triggers move logic into the database layer for reuse and automation.
NoSQL databases offer scalability and flexibility for non-relational data, governed by the CAP theorem trade-off between consistency and availability.
Backup, recovery, and security ensure that data is protected, recoverable after failure, and accessible only to authorized users.
Topics to Explore Next
Once you are comfortable with these fundamentals, the logical next steps are:
Advanced SQL: Window functions (ROW_NUMBER, RANK, LAG, LEAD), Common Table Expressions (CTEs), recursive queries, and JSON support in modern SQL databases.
Query Performance Tuning: Deep dive into execution plans, index strategies, partitioning, and caching layers.
Database Design: Practicing ER design for real-world scenarios, understanding star schema and snowflake schema for data warehouses.
Transaction Internals: Study how WAL, MVCC (Multi-Version Concurrency Control), and lock managers work inside PostgreSQL or MySQL.
Distributed Databases: Sharding, replication, consensus algorithms (Raft, Paxos), and distributed transaction protocols.
Specific DBMS: Hands-on practice with PostgreSQL (recommended for learning — feature-rich and open-source) or MySQL.
ORM and Application Integration: How application frameworks like Django, SQLAlchemy, or Hibernate abstract database access.
The best way to solidify these concepts is to build something. Create a schema for a project you care about — a library system, an e-commerce store, a university database — write the SQL from scratch, normalize it, query it, and experiment with transactions and indexes. Reading about databases and using them are two very different experiences, and the latter is where real understanding develops.
This guide covers the core theoretical and practical foundations of DBMS. With these concepts understood, you have a solid base from which to explore any database technology or deepen your expertise in any specific area.