Role at a Glance

Tech Lead for a distributed team of 2–10 devs at Cartrack Singapore — a SaaS fleet telematics platform under Karooooo Ltd (NASDAQ: KARO) with 2.5M+ active subscribers across 20+ countries. ~80% leadership and coordination, ~20% hands-on coding. You own technical scope, timelines, code reviews, and cross-functional alignment with Product, Design, and QA.

GPS Live TrackingDriver BehaviourFuel AnalyticsGeofencingRoute OptimisationAsset TrackingDelivery ManagementMulti-tenant SaaS
Must-Have
  • Team management (5+ yrs, 2–10 devs)
  • Language & stack agnostic (15+ yrs)
  • SQL — writing, optimising, DB design
  • Linux — scripting, system management
  • Git — branching, merging, workflows
Nice-to-Have
  • PHP / C# backend
  • TypeScript + React frontend
  • PostgreSQL query optimisation
  • Docker / Kubernetes
  • CI/CD pipelines (GitLab)
Interview Readiness by Topic
Team leadership & deliveryHighest weight
SQL & database designHigh weight
Linux & scriptingMedium-high
Git workflowsMedium
Fleet/telematics domain knowledgeDifferentiator
Stack nice-to-haves (PHP, React, Docker)Bonus

Leadership

Mentoring, delivery management, code review, removing blockers.

8 Q&As

SQL & Databases

Query writing, EXPLAIN ANALYZE, index types, schema design, multi-tenancy, migrations.

10 Q&As + code

Linux

System diagnostics, scripting, log management, service troubleshooting.

5 Q&As + code

Git & Workflows

Branching, rebase vs merge, hotfixes, pull requests lifecycle, code review process, hooks & automation.

12 Q&As + code

System Design

Scalability, load balancing, DB sharding, CAP theorem, microservices vs monolith, interview framework.

10 concepts + framework

App Tech Fundamentals

Caching strategies, CDN, message queues, REST vs GraphQL, rate limiting, circuit breakers, observability.

14 concepts + Q&As

Fleet / Telematics

GPS ingestion pipeline, AWS architecture deep-dive, geofencing at scale, driver scoring, multi-tenancy.

5 system design Qs · AWS diagram

Scenarios

Incident response, scope negotiation, legacy codebases, questions to ask.

7 scenarios

Docker / K8s

Images, layers, K8s architecture, Deployments vs StatefulSets, probes.

6 Q&As + code

CI/CD & GitLab

Pipeline design, GitLab CI yaml, secrets, runners, environments.

4 Q&As + full YAML

Agile & Scrum

Roles, all five ceremonies, DoD vs acceptance criteria, Kanban vs Scrum.

6 Q&As

If You Don't Know

4-step framework, ready-to-use phrases, what never to say.

5 response templates
Mentoring & managing developers
Q1

Walk me through how you structured 1-on-1s and growth plans for junior developers.

Strong answer structure: Describe a regular cadence (weekly or bi-weekly), what you cover (blockers, growth goals, project feedback), and how you adapted it per individual.
Example: "I ran weekly 30-minute 1-on-1s. First half was a safe space for whatever was on their mind. Second half was structured — I maintained a shared doc with each developer tracking their current goal, blockers, and one skill we were actively building. For a junior dev who struggled writing testable code, we'd spend 10 minutes reviewing a PR together and I'd explain the reasoning. Within three months, their review-to-merge ratio improved significantly and they started flagging design issues before I did."
Tip: Show you adapted your style per person — not one-size-fits-all. Mention you tracked progress over time, not just had the meetings.
Q2

Describe a situation where a developer was underperforming. How did you approach it?

Key elements: Early detection, private direct feedback, a clear improvement plan with measurable goals, follow-through.
"I noticed a mid-level developer consistently delivering features with high defect rates and missing estimates. I didn't wait for a formal review — I scheduled a private conversation, framed it around patterns I observed (not personal criticism), and asked if something was blocking them. Turns out they'd been guessing at requirements rather than asking questions. We agreed on a 30-day plan: they would confirm requirements in writing before coding, and I'd review their understanding at task start. After two sprints, defect rate dropped by 60%."
Never say you "let HR handle it." Show you took ownership. Also show empathy — ask what's blocking them before concluding it's capability.
Q3

How do you handle a team member who strongly disagrees with a technical decision you've made?

"I welcome pushback — it's a signal the person cares and is thinking. I'd ask them to walk me through their concern in detail and engage with the actual argument. If they raised a valid point I'd missed, I'd change the decision and acknowledge it publicly. If I still disagreed after hearing them out, I'd explain my reasoning clearly, acknowledge their concern was legitimate, and ask them to commit to the decision for one iteration so we can evaluate the results together. People accept a decision they disagreed with much more readily when they felt genuinely heard."
Tip: Don't position yourself as always right. Interviewers want intellectual humility paired with decisiveness.
Q4

How do you balance giving autonomy to experienced devs while keeping visibility?

"For senior developers, I'd agree upfront on what I needed to see: major design decisions before implementation, blockers immediately, a brief status note at end of week. Everything else was theirs. In code review I'd focus more on architecture and knowledge-sharing rather than style nitpicks. The key is making it easy for them to pull you in rather than you pushing into their work."
Technical leadership & delivery
Q5

How do you define technical scope at the start of a project?

Framework: Requirements → constraints → design → risks → resource estimate → sign-off
"I'd start with a requirements session with Product — not just what we're building but why, who the user is, and what success looks like. Then I'd run a technical design meeting with the team to map the system — data flows, APIs, DB schema, external dependencies. I'd document this in a lightweight technical spec: problem, proposed solution, alternatives considered, timeline, dependencies, and what's explicitly out of scope. Out of scope is as important as in scope."
Tip: Mention you involve the team in scoping — not just you dictating. This shows you build shared ownership of commitments.
Q6

With less than 20% time for hands-on coding, how do you stay technically credible?

"Credibility comes from depth in code reviews, quality of technical questions in design discussions, and occasionally unblocking someone with a targeted PR rather than owning large features. I stay current by reviewing architecture decisions closely, pairing with developers on hard problems, and deliberately picking up small but complex tasks — things that touch the core of the system. I also read broadly: technical blogs, postmortems, and adjacent team architecture decisions."
Q7

What does a constructive code review look like for you?

"I look at correctness first, then readability, then edge cases, then performance where it matters. I label my comments: 'blocking' means it must change, 'suggestion' means I'd do it differently but it's not a blocker, 'question' means I want to understand their choice. I avoid vague feedback like 'this is confusing' — I say what specifically confuses me and why. For junior developers I'll often include a code snippet showing the alternative. I also try to call out what's done well — reviews shouldn't just be a list of problems."
Q8

How do you coordinate between engineering, QA, and product when requirements change mid-sprint?

"First, I'd assess impact: is this a clarification or a scope change? If it's genuine scope creep, I'd document the delta and the cost — additional days, testing time, risk — and bring it to the product owner for a decision: do we cut something else, push the timeline, or reduce quality? I'd never silently absorb scope changes. For QA, I'd keep a shared test plan updated and flag any new acceptance criteria as soon as requirements shift. Communication is the fix — not heroics."
Core query writing
JN

JOINs — types, algorithms, and common traps

INNER, LEFT, FULL OUTER — and the execution algorithms Postgres chooses

When you write a JOIN, Postgres selects an algorithm based on table sizes and indexes. Nested Loop: for each outer row, scan the inner — fast when inner is small or indexed. Hash Join: build a hash table from the smaller table, probe it per outer row — good for large unindexed tables. Merge Join: both sides must be sorted on the join key — very fast when indexes exist on both sides.
-- INNER JOIN: only rows with a match in BOTH tables
SELECT o.id, c.name, o.total
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN: all rows from left, NULLs where no match on right
-- Use to find customers who have NEVER placed an order
SELECT c.name, o.id AS order_id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL;  -- anti-join pattern

-- FULL OUTER JOIN: all rows from both, NULLs where no match
-- Rarely used — good for reconciliation / diff reports

-- CROSS JOIN: cartesian product — every row × every row
-- Generates all possible driver-vehicle combinations for assignment
SELECT d.name, v.plate
FROM drivers d CROSS JOIN vehicles v;
Interviewer questions
QWhat's the difference between WHERE and HAVING? Why can't you use aggregates in WHERE?
WHERE filters individual rows before grouping. HAVING filters groups after GROUP BY. Aggregates like COUNT and SUM don't exist at the WHERE stage — rows haven't been grouped yet. SQL logical execution order: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Use HAVING when filtering on aggregate results (e.g. "fleets with more than 50 active vehicles").
QWhat is a self-join and when would you use one?
A self-join joins a table to itself. It's used when rows in a table have a relationship to other rows in the same table — like hierarchies (manager/employee), adjacency lists, or sequential comparisons.

-- Find each employee and their manager (same employees table)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Find consecutive GPS events for the same vehicle (gap detection)
SELECT a.vehicle_id, a.recorded_at AS t1, b.recorded_at AS t2,
       b.recorded_at - a.recorded_at AS gap
FROM gps_events a
JOIN gps_events b ON a.vehicle_id = b.vehicle_id
                  AND b.recorded_at = (
  SELECT MIN(recorded_at) FROM gps_events
  WHERE vehicle_id = a.vehicle_id AND recorded_at > a.recorded_at
);
CTE

CTEs, subqueries, and window functions

WITH clauses, RANK, ROW_NUMBER, LAG, LEAD, running totals

CTEs (WITH clauses) create named subqueries reusable in the main query — they make complex logic readable and allow recursive queries. Window functions perform calculations across rows related to the current row without collapsing them like GROUP BY does. The OVER() clause defines the window: PARTITION BY groups rows, ORDER BY sets order within the window.
-- Running total of fuel cost per vehicle (window SUM)
SELECT vehicle_id, recorded_at, fuel_cost,
       SUM(fuel_cost) OVER (
         PARTITION BY vehicle_id
         ORDER BY recorded_at
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS running_total
FROM fuel_events;

-- Compare each GPS ping speed to the previous ping (LAG)
SELECT vehicle_id, recorded_at, speed,
       LAG(speed) OVER (PARTITION BY vehicle_id ORDER BY recorded_at) AS prev_speed,
       speed - LAG(speed) OVER (PARTITION BY vehicle_id ORDER BY recorded_at) AS delta
FROM gps_events;

-- RANK vs ROW_NUMBER vs DENSE_RANK
-- ROW_NUMBER: unique 1,2,3,4 (no ties)
-- RANK:       ties get same rank, next rank skips: 1,1,3
-- DENSE_RANK: ties same rank, next rank doesn't skip: 1,1,2

-- Recursive CTE: all vehicles in a fleet hierarchy
WITH RECURSIVE fleet_tree AS (
  SELECT id, name, parent_fleet_id, 1 AS level
  FROM fleets WHERE parent_fleet_id IS NULL
  UNION ALL
  SELECT f.id, f.name, f.parent_fleet_id, ft.level + 1
  FROM fleets f
  JOIN fleet_tree ft ON f.parent_fleet_id = ft.id
)
SELECT * FROM fleet_tree ORDER BY level, name;
Interviewer questions
QWhat are window functions and how are they different from GROUP BY?
GROUP BY collapses rows into one summary row per group — you lose the individual rows. Window functions compute a value for each row using a window of related rows, without collapsing anything. AVG(salary) GROUP BY dept_id gives one row per department. AVG(salary) OVER (PARTITION BY dept_id) keeps every employee row and adds the dept average as a column — letting you compare each employee to their department average in a single query. Window functions are perfect for rankings, running totals, moving averages, and sequential comparisons with LAG/LEAD.
QWhen would you use a CTE vs a subquery vs a temporary table?
Subquery: Use for simple, one-time inline logic. No reuse needed. Postgres may or may not materialise it — the planner decides.

CTE: Use when the same result is referenced multiple times, or when you want to break complex logic into readable named steps. In PostgreSQL 12+, CTEs are inlined by default (treated like subqueries). Add MATERIALIZED keyword to force execution and reuse: WITH data AS MATERIALIZED (...).

Temporary table: Use for very large intermediate results that will be queried multiple times in the same session. Written to disk, can be indexed, persists for the session duration. Good for multi-step ETL within a stored procedure.
AGG

Aggregations, grouping sets, and pivot patterns

-- Standard aggregates
SELECT fleet_id,
       COUNT(*) AS total_vehicles,
       AVG(speed) AS avg_speed,
       MAX(speed) AS top_speed,
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY speed) AS p95_speed
FROM gps_events
GROUP BY fleet_id;

-- GROUPING SETS: multiple GROUP BY levels in one query
SELECT fleet_id, vehicle_id, COUNT(*) AS events
FROM gps_events
GROUP BY GROUPING SETS (
  (fleet_id, vehicle_id),  -- per vehicle subtotal
  (fleet_id),              -- per fleet subtotal
  ()                       -- grand total
);

-- FILTER clause on aggregates (cleaner than CASE WHEN)
SELECT
  COUNT(*) FILTER (WHERE speed > speed_limit) AS speeding_events,
  COUNT(*) FILTER (WHERE speed = 0)          AS idle_events,
  COUNT(*) AS total_events
FROM gps_events
WHERE fleet_id = 42;
Query optimisation
EXP

EXPLAIN ANALYZE — reading execution plans

How to diagnose slow queries systematically

EXPLAIN shows the plan Postgres intends to use. EXPLAIN ANALYZE executes the query and shows actual timings vs estimates. EXPLAIN (ANALYZE, BUFFERS) also shows cache hit vs disk read counts — the most complete picture.
-- Full diagnosis mode
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT v.plate, COUNT(*) AS events
FROM gps_events g
JOIN vehicles v ON g.vehicle_id = v.id
WHERE g.recorded_at > NOW() - INTERVAL '7 days'
GROUP BY v.plate;

-- What to look for in the output:
-- Seq Scan on gps_events → missing index on recorded_at
-- rows=1000 (estimated) vs rows=50000 (actual) → stale stats, run ANALYZE
-- Buffers: shared hit=0, read=50000 → data not cached, cold I/O is the bottleneck
-- Hash Batches: 4 → hash doesn't fit in work_mem, spilling to disk

-- Fix stale statistics
ANALYZE gps_events;

-- Increase work_mem for sort/hash operations (session-level)
SET work_mem = '256MB';
Interviewer questions
QWalk me through diagnosing and fixing a slow query in Postgres.
My systematic approach:

1. Identify the query: Use pg_stat_statements in production to find queries by total time or mean time — not just the one that felt slow right now.

2. Run EXPLAIN (ANALYZE, BUFFERS): Look for Seq Scans on large tables (missing index), row estimate mismatches (stale stats), Hash Batches > 1 (work_mem too low), and high total cost nodes.

3. Check for index-defeating patterns: Functions on indexed columns (DATE_TRUNC(col) — rewrite as range), implicit type casts (WHERE int_col = '42'), leading wildcard LIKE (LIKE '%foo' can't use B-Tree).

4. Fix and verify: Create the index, rewrite the query, or add a partial/covering index. Re-run EXPLAIN ANALYZE to confirm improvement.

5. Monitor: Set up pg_stat_statements dashboard to catch regressions before users notice.
QWhat are common query patterns that silently break index usage?
-- 1. Function on indexed column (index bypassed)
WHERE DATE_TRUNC('month', recorded_at) = '2024-01-01'  -- BAD
WHERE recorded_at >= '2024-01-01' AND recorded_at < '2024-02-01'  -- GOOD

-- 2. Implicit type cast
WHERE vehicle_id = '42'   -- vehicle_id is INT, '42' is TEXT → cast prevents index
WHERE vehicle_id = 42     -- GOOD

-- 3. Leading wildcard
WHERE plate LIKE '%ABC'   -- can't use B-Tree, full scan
WHERE plate LIKE 'ABC%'   -- can use B-Tree index

-- 4. IS NULL / IS NOT NULL (often OK but check cardinality)
-- 5. OR across different columns (use UNION instead)
WHERE fleet_id = 1 OR driver_id = 99  -- may not use indexes well
-- Rewrite as:
SELECT * FROM vehicles WHERE fleet_id = 1
UNION
SELECT * FROM vehicles WHERE driver_id = 99
QFind the top 10 vehicles by average speed in the last 30 days from a 50M-row GPS events table.
SELECT vehicle_id, ROUND(AVG(speed), 2) AS avg_speed
FROM   gps_events
WHERE  recorded_at >= NOW() - INTERVAL '30 days'
  AND  speed > 0
GROUP  BY vehicle_id
ORDER  BY avg_speed DESC
LIMIT  10;
Making it fast at scale: Partition gps_events by month — the 30-day filter hits only 1–2 partitions instead of 50M rows. Add composite index (recorded_at, vehicle_id, speed) as a covering index — no heap lookup needed. For a dashboard widget, consider a materialised view refreshed every 5 minutes rather than computing live.
IDX

Index types and strategies

B-Tree, GIN, GiST, BRIN, partial, covering, expression indexes

B-Tree (default)

Balanced tree. Handles =, <, >, BETWEEN, LIKE 'prefix%', ORDER BY. Right choice for 90% of cases. Created automatically for PRIMARY KEY and UNIQUE.

GIN

Generalised Inverted Index. Best for JSONB, arrays, full-text search (tsvector). Stores each element and maps to rows containing it. Slow to build, fast to query.

GiST

Generalised Search Tree. For geometric types and PostGIS spatial data. Critical for geofencing — point-in-polygon queries use GiST on a geometry column.

BRIN

Block Range Index. Tiny (stores min/max per block range). Only useful for very large tables that are physically sorted (append-only time-series like GPS events partitions).

-- Composite index: column ORDER matters
-- (fleet_id, recorded_at) serves: WHERE fleet_id=X AND recorded_at > Y
-- Does NOT efficiently serve: WHERE recorded_at > Y (no fleet_id prefix)
CREATE INDEX idx_gps_fleet_time ON gps_events(fleet_id, recorded_at DESC);

-- Covering index: INCLUDE avoids a second heap lookup (index-only scan)
CREATE INDEX idx_gps_cover
  ON gps_events(vehicle_id, recorded_at)
  INCLUDE (speed, lat, lng);  -- dashboard queries get everything from index

-- Partial index: smaller, faster, specific
CREATE INDEX idx_active_alerts ON alerts(vehicle_id, triggered_at)
  WHERE resolved_at IS NULL;  -- only unresolved alerts indexed

-- Expression index: index the computation, not the raw column
CREATE INDEX idx_plate_lower ON vehicles(LOWER(plate));
-- Now WHERE LOWER(plate) = 'sba1234a' uses the index

-- Non-blocking index creation (never locks table)
CREATE INDEX CONCURRENTLY idx_gps_speed ON gps_events(speed)
  WHERE speed > 0;
Interviewer questions
QWhen would you avoid adding an index? What are the downsides?
Every index has a cost: it consumes disk space and — more importantly — every INSERT, UPDATE, and DELETE must update all indexes on that table. In a high-write system like GPS event ingestion, too many indexes dramatically slows writes.

Avoid adding an index when: the table is write-heavy and the column is rarely queried; the column has very low cardinality (e.g. a boolean — the planner may prefer a seq scan anyway); the table is small enough that a seq scan is fast; or the index would duplicate an existing one. Always measure — create the index in a test environment, run EXPLAIN ANALYZE on both the write-heavy and read-heavy queries, and check if the net benefit is positive.
Database design
NRM

Normalization — 1NF through 3NF + when to denormalize

1
1NF — Atomic values, no repeating groups
Every column holds a single indivisible value. No arrays, no comma-separated lists, no "phone1, phone2, phone3" columns. Each row uniquely identifiable by a primary key. Violation example: storing "geofence_ids: 1,5,12" in a vehicle row.
2
2NF — No partial dependencies (composite PKs only)
Every non-key column must depend on the entire primary key. Only matters when PK is composite. If a table has PK (driver_id, vehicle_id) and also has driver_name — driver_name depends only on driver_id, not the full composite key. Move driver_name to a drivers table.
3
3NF — No transitive dependencies
Non-key columns must depend only on the PK, not on each other. If vehicles has (vehicle_id, fleet_id, customer_name) and customer_name depends on fleet_id (not vehicle_id), that's a transitive dependency. Extract customer_name to a fleets/customers table.

When to denormalize intentionally

Read-heavy reporting where JOINs across 5+ tables are a bottleneck. Duplicate the column (e.g. store fleet_name on trips table) or use materialised views. Accept the trade-off: writes are more complex, but reads are 10x faster.

BCNF — Boyce-Codd (stricter 3NF)

Every determinant must be a candidate key. Rarely needed in practice but good to know. Example: instructor determines room, but instructor is not the PK — violates BCNF. Most production schemas stop at 3NF.

REL

Relationships — foreign keys, cardinality, junction tables

-- One-to-Many: one fleet → many vehicles
vehicles (id, fleet_id REFERENCES fleets(id) ON DELETE CASCADE)

-- Many-to-Many: drivers ↔ vehicles over time (junction table)
driver_vehicle_assignments (
  id           BIGSERIAL PRIMARY KEY,
  driver_id    INT NOT NULL REFERENCES drivers(id),
  vehicle_id   INT NOT NULL REFERENCES vehicles(id),
  assigned_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  unassigned_at TIMESTAMPTZ,           -- NULL = currently assigned
  CONSTRAINT no_overlap EXCLUDE USING gist (
    vehicle_id WITH =,
    tstzrange(assigned_at, unassigned_at) WITH &&  -- no overlapping time ranges
  )
);

-- ON DELETE options
-- CASCADE:  deleting fleet deletes all vehicles
-- RESTRICT: error if fleet has vehicles (default-ish)
-- SET NULL: vehicle.fleet_id becomes NULL
-- NO ACTION: like RESTRICT but deferred until end of transaction
Interviewer questions
QDesign the schema for the Cartrack fleet platform — vehicles, drivers, trips, GPS events, and alerts.
-- Multi-tenancy boundary
customers   (id, name, plan, created_at)
fleets      (id, customer_id, name)              -- one customer → many fleets

-- Core operational entities
vehicles    (id, fleet_id, plate, make, model, year, vin, status)
drivers     (id, fleet_id, name, licence_no, licence_expiry, status)

-- Many-to-many assignment with time tracking
driver_vehicle_assignments (
  id, driver_id, vehicle_id, assigned_at, unassigned_at
)

-- Trips reconstructed from GPS stream
trips (id, vehicle_id, driver_id, start_time, end_time,
       distance_km, max_speed, avg_speed, fuel_litres,
       start_lat, start_lng, end_lat, end_lng)

-- High-volume: PARTITION BY RANGE(recorded_at) monthly
gps_events (id, vehicle_id, trip_id, recorded_at,
            lat, lng, speed, heading, odometer, ignition)

-- Flexible alert payload with JSONB
alerts (id, vehicle_id, driver_id, alert_type, severity,
        triggered_at, resolved_at,
        payload JSONB)   -- e.g. {"geofence_id": 5, "zone": "depot"}

Key design decisions to explain: Fleets as the tenancy boundary (not customers directly on vehicles). GPS events partitioned monthly — mandatory at 50M+ rows. JSONB on alerts for flexible payload without schema migrations every time a new alert type is added. Trips computed/reconstructed, not streamed live into a trips table.
MIG

Zero-downtime database migrations

The expand-contract pattern for production schema changes

1
Expand — add the new structure
Add the new column (nullable first), table, or index. Deploy code that writes to both old and new simultaneously. CREATE INDEX CONCURRENTLY — never locks the table.
2
Backfill — populate existing data
Run a background job in small batches (1,000–10,000 rows at a time), not one giant UPDATE that holds a lock for minutes.
3
Migrate reads — switch to new structure
Deploy code that reads from the new column/table. Old code still writes to both.
4
Contract — remove the old structure
Once all reads are on the new structure, drop the old column/table and remove dual-write code. Only now add NOT NULL constraint if needed.
Never: Rename a column directly (add new, migrate, drop old). Never run a table-wide UPDATE or ALTER TABLE on a large production table — both hold locks. Never add a NOT NULL column without a DEFAULT in one step (locks the table while backfilling).
Transactions, locking & concurrency
ISO

Isolation levels — Read Committed, Repeatable Read, Serializable

Isolation levels control what data a transaction can see while other transactions are running concurrently. Higher isolation = fewer anomalies but more lock contention and lower throughput.

Read Committed (PostgreSQL default)

Each statement sees only committed data. Two statements in the same transaction can see different data if a commit happens between them. Most practical for OLTP. Allows non-repeatable reads.

Repeatable Read

All statements in the transaction see the same snapshot (taken at transaction start). Prevents non-repeatable reads. Still allows phantom reads in most DBs — not in PostgreSQL's implementation.

Serializable

Transactions execute as if they were run one at a time, even though they run concurrently. Prevents all anomalies. PostgreSQL uses SSI (Serializable Snapshot Isolation) — no read locks, but detects conflicts and may abort a transaction. Use for financial transfers, inventory management.

Common anomalies

Dirty read: read uncommitted data (prevented at Read Committed+).
Non-repeatable read: same row returns different values in same transaction.
Phantom read: new rows appear between two reads of the same range.
Lost update: two transactions overwrite each other's changes.

-- Set isolation level for a transaction
BEGIN ISOLATION LEVEL REPEATABLE READ;
  SELECT balance FROM accounts WHERE id = 1;  -- snapshot taken here
  SELECT balance FROM accounts WHERE id = 1;  -- same result even if updated
COMMIT;

-- Pessimistic locking: lock rows you intend to update
BEGIN;
  SELECT * FROM accounts WHERE id = 1 FOR UPDATE;  -- locks the row
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;

-- Optimistic locking: use version column, check before update
UPDATE vehicles
SET status = 'inactive', version = version + 1
WHERE id = 42 AND version = 7;  -- fails if someone else updated first
-- Check affected rows = 0 → conflict, retry
Interviewer questions
QWhat is a deadlock and how do you prevent one?
A deadlock occurs when two transactions each hold a lock the other needs, and both wait forever. Example: Transaction A locks vehicle 1 then tries to lock vehicle 2; Transaction B locks vehicle 2 then tries to lock vehicle 1. Neither can proceed.

PostgreSQL detects deadlocks automatically (checks every deadlock_timeout, default 1s) and aborts one of the transactions with an error.

Prevention: Always acquire locks in a consistent order across all transactions — if code always updates vehicles before drivers, you can't deadlock between those two. Keep transactions short so locks are held briefly. Use NOWAIT to fail immediately rather than wait if a lock is held, and retry at the application level.
QWhat is the difference between pessimistic and optimistic locking?
Pessimistic locking: Lock the row before reading it with SELECT ... FOR UPDATE. Assumes conflicts are likely, so block other transactions immediately. Good for high-contention scenarios like a seat reservation system where two users might book simultaneously.

Optimistic locking: Read without locking, perform changes, then check at write time whether someone else changed the data first (usually via a version column or timestamp). If conflict detected, retry. Good for low-contention scenarios where conflicts are rare — most Cartrack vehicle data updates (status changes, score updates) can use optimistic locking since concurrent updates to the same vehicle are uncommon.

Optimistic locking scales much better since it holds no DB locks during the read phase, but requires application-level retry logic.
MVC

MVCC, VACUUM, and table bloat

Why reads don't block writes in PostgreSQL — and the cost

PostgreSQL uses MVCC (Multi-Version Concurrency Control). When a row is updated, Postgres writes a new version and marks the old one as expired — it doesn't overwrite. Readers see the version that was current when their transaction started, so reads never block writes. The cost: old versions accumulate as "dead tuples" and bloat the table. VACUUM reclaims this space.
-- Check for table bloat (high dead tuples = vacuum not keeping up)
SELECT relname,
       n_live_tup, n_dead_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
       last_autovacuum,
       last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Manual vacuum (non-blocking, runs in background)
VACUUM gps_events;

-- VACUUM FULL: rewrites table, reclaims disk space — but locks table!
-- Use pg_repack extension instead for zero-downtime compaction

-- Force autovacuum to run sooner on a specific table
ALTER TABLE gps_events SET (autovacuum_vacuum_scale_factor = 0.01);
For GPS event tables: With 500k inserts/second, autovacuum must be aggressively tuned — the default settings won't keep up. Set autovacuum_vacuum_cost_delay = 2ms and smaller scale factors on the partition tables. Monitor pg_stat_user_tables.n_dead_tup in your dashboards.
PostgreSQL-specific features & PostGIS
PTN

Table partitioning — range, list, hash

Partitioning splits one logical table into multiple physical sub-tables. PostgreSQL routes queries to relevant partitions automatically (partition pruning). Each partition can have its own indexes, tablespace, and retention policy.
CREATE TABLE gps_events (
  id          BIGSERIAL,
  vehicle_id  INT NOT NULL,
  recorded_at TIMESTAMPTZ NOT NULL,
  lat NUMERIC(9,6), lng NUMERIC(9,6), speed NUMERIC
) PARTITION BY RANGE(recorded_at);

CREATE TABLE gps_2024_01 PARTITION OF gps_events
  FOR VALUES FROM('2024-01-01') TO('2024-02-01');

-- Drop old partition instantly — no DELETE, no VACUUM needed
DROP TABLE gps_2023_01;  -- instant, removes millions of rows in milliseconds

-- LIST partitioning: by category
PARTITION BY LIST(region);  -- 'APAC', 'EMEA', 'Americas'

-- HASH partitioning: distribute write load evenly
PARTITION BY HASH(customer_id);  -- 8 equal-sized buckets
JSN

JSONB — when to use it and how to query efficiently

JSONB stores JSON as a parsed binary structure — faster to query than JSON (stored as raw text). Use JSONB for flexible, schema-varying data like alert payloads, user preferences, or event metadata.
-- Arrow operators for extraction
SELECT payload->'geofence_id'         -- returns JSON type
SELECT payload->>'geofence_id'        -- returns TEXT (better for comparison)
SELECT payload#>>ARRAY['location','city'] -- nested path

-- Containment operator @> (uses GIN index)
SELECT * FROM alerts
WHERE payload @> '{"alert_type":"speeding","severity":"high"}';

-- GIN index makes @> very fast even on millions of rows
CREATE INDEX idx_alerts_payload ON alerts USING GIN(payload);

-- Update a specific key without replacing the whole document
UPDATE alerts
SET payload = payload || '{"acknowledged": true}'
WHERE id = 99;

-- Remove a key
UPDATE alerts SET payload = payload - 'temp_debug_field';
Trade-offs: JSONB queries are less readable than column queries, FK constraints can't reference inside JSONB, and type enforcement is gone. Use for flexible metadata (alert payloads, event context). Use regular columns for fields queried frequently or used in JOINs.
ADV

Stored procedures, triggers, and generated columns

-- Generated column: computed from other columns, stored on disk
ALTER TABLE gps_events
ADD COLUMN speed_kmh NUMERIC GENERATED ALWAYS AS (speed * 3.6) STORED;
-- Can be indexed, queried like a normal column

-- Trigger: auto-update updated_at on any row change
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER vehicles_updated_at
  BEFORE UPDATE ON vehicles
  FOR EACH ROW EXECUTE FUNCTION set_updated_at();

-- Row Level Security: enforce tenant isolation at DB level
ALTER TABLE vehicles ENABLE ROW LEVEL SECURITY;
CREATE POLICY fleet_isolation ON vehicles
  USING (fleet_id IN (
    SELECT id FROM fleets
    WHERE customer_id = current_setting('app.current_customer_id')::INT
  ));
RLS for multi-tenancy: Even if application code has a bug and forgets to filter by customer_id, RLS at the database layer ensures a customer can never see another customer's data. Set SET app.current_customer_id = '42' at the start of each connection from the application.
GEO

PostGIS — spatial data extension for PostgreSQL

What it is, the data types it adds, and why it's essential for fleet geofencing

PostGIS is an open-source extension that adds geographic objects, spatial functions, and spatial indexing to PostgreSQL. Instead of storing lat/lng as two NUMERIC columns and doing trigonometry in application code, PostGIS stores geometry or geography objects natively and lets the database engine perform spatial calculations — point-in-polygon, distance, intersection — with hardware-optimised algorithms and GiST index support. For Cartrack, PostGIS is the technology behind geofencing, route analysis, and proximity alerts.
Installation & setup
-- Enable PostGIS on the database (run once as superuser)
CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;   -- optional: topological relationships
SELECT PostGIS_Version();             -- verify: '3.4.0 ...'
PostGIS data types — what they store

GEOMETRY

Planar (flat-earth) coordinate system. Uses Cartesian mathematics — fast but distorts over large distances. Coordinates are in arbitrary units (metres, degrees, etc.) depending on the SRID. Best for: local calculations within a city or country where curvature is negligible.

GEOGRAPHY

Spheroidal (curved-earth) coordinate system. Uses ellipsoidal mathematics (WGS 84). Coordinates must be longitude/latitude in degrees. Much more accurate for global-scale distance calculations. Slower than GEOMETRY because of the math. Best for: Cartrack's global fleet — distances between vehicles across countries.

Geometry sub-types — the shapes PostGIS can represent

POINT

A single coordinate: POINT(103.8198 1.3521) — that's (longitude, latitude). Used to represent a vehicle's current GPS position, a depot location, a driver's home address, or a point-of-interest.

LINESTRING

An ordered sequence of points connected by straight line segments: LINESTRING(103.8 1.35, 103.85 1.37, 103.9 1.38). Used to represent a vehicle route, a road segment, or a trip path.

POLYGON

A closed ring of points defining an area: POLYGON((103.8 1.3, 103.85 1.3, 103.85 1.35, 103.8 1.35, 103.8 1.3)). The first and last point must be the same (closing the ring). Used for geofences, delivery zones, restricted areas, city boundaries.

MULTIPOLYGON

A collection of polygons as a single geometry — e.g. Singapore's territory including outlying islands as one feature. Used for complex geofences with holes (exclusion zones inside a zone) or discontiguous regions.

GEOMETRYCOLLECTION

A heterogeneous collection of any geometry types. Rarely used in practice but useful for storing mixed geometry types — e.g. a delivery zone defined as a polygon plus specific point-of-interest markers.

MULTILINESTRING

Multiple disconnected line segments as one object. Used for road networks where a road splits and rejoins, or for representing multiple trip legs as a single geographic object.

SRID — Spatial Reference ID (coordinate system)
Every geometry must declare its SRID — which coordinate reference system (CRS) the coordinates are in. Getting this wrong causes silent calculation errors.

SRID 4326 — WGS 84

The GPS standard. Coordinates are (longitude, latitude) in degrees. Every GPS device outputs WGS 84. This is what you store for all vehicle positions and global geofences.

SRID 3857 — Web Mercator

Used by Google Maps, OpenStreetMap, Mapbox tile layers. Coordinates in metres. Convert to this for map tile rendering, but store data as 4326.

Creating spatial tables and indexes
-- Vehicle positions table (using GEOGRAPHY for accurate global distances)
CREATE TABLE vehicle_positions (
  id          BIGSERIAL PRIMARY KEY,
  vehicle_id  INT NOT NULL REFERENCES vehicles(id),
  recorded_at TIMESTAMPTZ NOT NULL,
  position    GEOGRAPHY(POINT, 4326) NOT NULL,   -- lon/lat in WGS 84
  speed_kmh   NUMERIC(5,1),
  heading     SMALLINT                            -- 0-359 degrees
);

-- Geofences table (using GEOMETRY for faster polygon operations)
CREATE TABLE geofences (
  id          SERIAL PRIMARY KEY,
  fleet_id    INT NOT NULL REFERENCES fleets(id),
  name        TEXT NOT NULL,
  zone_type   TEXT,                              -- 'depot', 'restricted', 'delivery'
  boundary    GEOMETRY(POLYGON, 4326) NOT NULL,  -- or MULTIPOLYGON for complex zones
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- GiST spatial index — makes spatial queries fast
CREATE INDEX idx_positions_geog  ON vehicle_positions USING GiST(position);
CREATE INDEX idx_geofences_boundary ON geofences USING GiST(boundary);
CREATE INDEX idx_positions_fleet ON vehicle_positions(vehicle_id, recorded_at DESC);
Essential spatial functions
-- Insert a GPS ping (vehicle position from device)
INSERT INTO vehicle_positions (vehicle_id, recorded_at, position, speed_kmh)
VALUES (42, NOW(), ST_MakePoint(103.8198, 1.3521)::geography, 65.5);
-- ST_MakePoint(longitude, latitude) — note: lon FIRST, lat SECOND

-- Distance between two points (returns metres when using GEOGRAPHY)
SELECT ST_Distance(
  ST_MakePoint(103.8198, 1.3521)::geography,  -- vehicle position
  ST_MakePoint(103.8500, 1.2800)::geography   -- depot location
) AS distance_metres;
-- Returns: 8432.7 (metres)

-- Point-in-polygon: is a vehicle inside a geofence?
SELECT g.name, g.zone_type
FROM geofences g
WHERE ST_Within(
  ST_MakePoint(103.8198, 1.3521)::geometry,  -- vehicle position
  g.boundary                                   -- geofence polygon
)
AND g.fleet_id = 7;

-- Alternative: ST_Contains(polygon, point) — same result, different argument order
WHERE ST_Contains(g.boundary, ST_MakePoint(103.82, 1.35)::geometry)

-- All vehicles within 500 metres of a depot
SELECT v.plate, vp.speed_kmh,
       ST_Distance(vp.position, depot.location) AS distance_m
FROM vehicle_positions vp
JOIN vehicles v ON v.id = vp.vehicle_id
JOIN depots depot ON depot.id = 5
WHERE ST_DWithin(vp.position, depot.location, 500)  -- uses GiST index efficiently
AND vp.recorded_at > NOW() - INTERVAL '30 seconds';

-- Create a circular geofence (200m radius around a depot)
INSERT INTO geofences (fleet_id, name, zone_type, boundary)
VALUES (7, 'Main Depot', 'depot',
  ST_Buffer(
    ST_MakePoint(103.8198, 1.3521)::geography,
    200        -- 200 metres radius
  )::geometry
);

-- Trip path as LINESTRING (reconstructed from GPS events)
SELECT ST_MakeLine(position::geometry ORDER BY recorded_at) AS trip_path,
       ST_Length(ST_MakeLine(position::geometry ORDER BY recorded_at)::geography) AS distance_m
FROM vehicle_positions
WHERE vehicle_id = 42
AND recorded_at BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00';

-- Bounding box of all geofences for a customer (for map viewport)
SELECT ST_AsGeoJSON(ST_Extent(boundary)) AS bounding_box
FROM geofences
WHERE fleet_id IN (SELECT id FROM fleets WHERE customer_id = 42);
Format conversion — working with APIs and maps
-- Convert to GeoJSON (for API responses to frontend map)
SELECT ST_AsGeoJSON(boundary)::json AS geojson
FROM geofences WHERE id = 1;
-- Returns: {"type":"Polygon","coordinates":[[[103.8,1.3],[103.85,1.3],...]]}

-- Convert from GeoJSON (from frontend map drawing tool)
SELECT ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[...]}');

-- Convert to WKT (Well-Known Text) for debugging
SELECT ST_AsText(boundary) FROM geofences WHERE id = 1;
-- Returns: POLYGON((103.8 1.3, 103.85 1.3, ...))

-- Convert from WKT
SELECT ST_GeomFromText('POLYGON((103.8 1.3, 103.85 1.3, 103.85 1.35, 103.8 1.35, 103.8 1.3))', 4326);
Interviewer questions
QWhy use PostGIS instead of storing lat/lng as two NUMERIC columns?
Storing lat/lng as two NUMERIC columns and doing spatial math in the application layer has three major problems:

1. Accuracy: Calculating the distance between two GPS coordinates on the Earth's curved surface requires the Haversine formula. Doing this in application code is error-prone and many implementations use flat-earth approximations that become increasingly inaccurate over long distances. PostGIS's ST_Distance on GEOGRAPHY types uses ellipsoidal math (WGS 84) that's accurate globally.

2. Indexability: Two separate NUMERIC columns can't be spatially indexed together. You can't efficiently answer "find all geofences that contain this point" without a spatial index. PostGIS's GiST index on a GEOMETRY/GEOGRAPHY column turns a full table scan into a O(log n) lookup — essential at millions of GPS pings per day.

3. Richness: PostGIS provides 200+ spatial functions out of the box — union of polygons, line intersection, convex hull, nearest-neighbour queries, trip distance from a path — all done in the database, optimised in C, not implemented repeatedly across different services.
QWhat's the difference between GEOMETRY and GEOGRAPHY in PostGIS?
GEOMETRY uses a flat-plane (Cartesian) coordinate system. It assumes the Earth is flat. Fast to compute because it's just 2D math. Accurate for small local areas (within a city) where curvature is negligible. Coordinates can be in any unit (metres, feet, degrees) depending on the SRID.

GEOGRAPHY uses a spheroidal (curved-earth) coordinate system — specifically WGS 84, the same as GPS. Coordinates must be longitude/latitude in degrees. Distance calculations are accurate globally — you get the correct great-circle distance between Singapore and London, not a distorted flat-plane approximation. ~10–15% slower than GEOMETRY due to the ellipsoidal math.

Cartrack recommendation: Store GPS vehicle positions and geofences as GEOGRAPHY(POINT, 4326) and GEOGRAPHY(POLYGON, 4326) respectively. The global accuracy is worth the minor performance cost, and ST_DWithin on GEOGRAPHY still uses the GiST index efficiently. Use GEOMETRY if you need to call PostGIS functions that only work with GEOMETRY — cast with ::geometry.
QHow would you implement geofence entry/exit detection efficiently at scale?
The naive approach (don't use): On every GPS ping, run ST_Within(vehicle_point, geofence_boundary) against the entire geofences table. At 500k pings/sec and thousands of geofences per fleet, this is catastrophic.

The right approach — three layers:

Layer 1 — GiST spatial index on the DB: The geofences table has a GiST index on boundary. A point-in-polygon check with ST_Within + WHERE fleet_id = X only scans the relevant customer's geofences and uses the spatial index to prune non-intersecting polygons. This handles the database query efficiently.

Layer 2 — In-memory state in Redis: For the real-time Lambda processor, cache each customer's active geofences in Redis (serialised as GeoJSON). On each GPS ping, load the geofence list from Redis (microseconds vs database milliseconds), do the point-in-polygon check in application code (using a spatial library like `turf.js` or Python's `shapely`), and compare against the vehicle's previous state (also in Redis).

Layer 3 — State transition events only: Only when a vehicle transitions (was outside → now inside, or vice versa) do we write to the database and trigger an alert. The majority of pings — vehicle still inside or still outside — generate zero DB writes. This reduces the geofence write load by 99%+.
QHow do you reconstruct a trip path and calculate its distance using PostGIS?
Trip reconstruction involves collecting ordered GPS points and connecting them into a LINESTRING:
-- Reconstruct trip as a path and compute total distance
WITH trip_points AS (
  SELECT position, recorded_at
  FROM vehicle_positions
  WHERE vehicle_id = 42
    AND recorded_at BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'
  ORDER BY recorded_at
)
SELECT
  ST_MakeLine(position::geometry ORDER BY recorded_at) AS path,
  ST_Length(
    ST_MakeLine(position::geometry ORDER BY recorded_at)::geography
  ) / 1000.0 AS distance_km,
  COUNT(*) AS ping_count
FROM trip_points;
Key functions: ST_MakeLine aggregates points into a LINESTRING in ORDER BY sequence. Cast to GEOGRAPHY before calling ST_Length to get accurate metres. Divide by 1000 for kilometres. This is exactly how Cartrack computes trip distance in the driver behaviour report.
System management & process management
PS

Process management — diagnosis commands you must know cold

## CPU diagnosis
top                          # live process list; press P to sort by CPU
htop                         # visual, colour-coded; F6 to sort, F9 to kill
ps aux --sort=-%cpu | head   # snapshot sorted by CPU usage
pidstat -u 1 5               # per-process CPU every 1s for 5 samples
mpstat -P ALL 1              # per-core CPU stats (find unbalanced load)

## Memory diagnosis
free -h                      # total/used/free/buffers/cache
vmstat 1 5                   # virtual memory stats: swpd, si, so = swap activity
cat /proc/meminfo            # detailed kernel memory breakdown
ps aux --sort=-%mem | head   # top memory consumers

## Disk I/O diagnosis
iostat -xz 1                 # per-disk util%, await (ms), r/s, w/s
iotop -o                     # per-process I/O (like top for disk)
lsof +D /var/lib/postgresql  # which processes have DB files open
df -h                        # disk space by filesystem
du -sh /var/log/*            # find log directories eating disk
ncdu /                       # interactive disk usage explorer

## Process inspection
strace -p <PID> -c           # summarise syscalls (what is it doing?)
lsof -p <PID>                # all files/sockets open by a process
cat /proc/<PID>/status       # detailed process info including threads
ps -eLf | grep php           # list all threads of a process
ps aux | grep Z              # find zombie processes
Interviewer questions
QA production server is showing high CPU usage. Walk me through your diagnosis step-by-step.
Step 1 — Identify the offending process: top (sort by CPU with P key) or ps aux --sort=-%cpu | head -10. Note the PID and command name.

Step 2 — Is it sustained or a spike? pidstat -u 1 10 over 10 seconds tells me if it's a runaway process or a burst.

Step 3 — What is the process doing? strace -p <PID> -c for a 10-second syscall summary. If it's in a tight loop of read/write syscalls, it's I/O-bound. If mostly futex, it's waiting on a lock.

Step 4 — Check for context: Is it a scheduled cron job? (crontab -l, check /var/log/cron). A recent deployment? (check app logs, git log). A DB query running wild? (check pg_stat_activity).

Step 5 — Mitigate: If it's an OOM-about-to-happen situation, kill -9 <PID> to stop the bleeding. If it's a production service, graceful restart: systemctl restart service-name. Then investigate root cause from logs.
QWhat's the difference between a process and a thread in Linux? How does the kernel manage them?
In Linux, both processes and threads are represented internally as tasks using task_struct. The difference is in what they share: threads within the same process share the same memory address space, file descriptors, and signal handlers — they are "lightweight" tasks. Separate processes have their own isolated memory space.

A PHP-FPM worker is a separate process — isolated memory. A Go server using goroutines runs many goroutines on a pool of OS threads — one process, many concurrent lightweight threads.

You can inspect threads with ps -eLf (each thread shows separately) or cat /proc/<PID>/status | grep Threads.
SYD

systemd — service management and unit files

## Service management
systemctl status my-service          # current state + last 10 log lines
systemctl start|stop|restart my-service
systemctl enable my-service          # start on boot
systemctl reload my-service          # reload config without full restart (if supported)
systemctl list-units --type=service  # all services and their states
systemctl list-dependencies my-service

## Log investigation
journalctl -u my-service -b          # logs since last boot
journalctl -u my-service -f          # follow live
journalctl -u my-service --since "2024-01-15 14:00" --until "14:30"
journalctl -p err -b                 # only ERROR and above from this boot

## Writing a systemd unit file
# /etc/systemd/system/gps-processor.service
[Unit]
Description=GPS Event Processor
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=appuser
WorkingDirectory=/opt/gps-processor
ExecStart=/opt/gps-processor/bin/processor
Restart=always
RestartSec=5
EnvironmentFile=/etc/gps-processor/env
StandardOutput=journal
StandardError=journal

# Memory/CPU limits
MemoryLimit=512M
CPUQuota=50%

[Install]
WantedBy=multi-user.target
After creating/modifying a unit file: systemctl daemon-reload then systemctl restart service-name. The Restart=always directive means systemd will restart the service if it crashes — essential for production services.
FS

File system management — permissions, ownership, disk usage

## Permissions
chmod 750 /opt/gps-processor/bin/processor
# 7=rwx (owner), 5=r-x (group), 0=--- (other)
chmod u+x,g-w,o-rwx script.sh    # symbolic mode
chown appuser:appgroup /opt/app   # change owner:group
chown -R appuser /opt/app         # recursive

## Special permissions
chmod +s /usr/bin/sudo            # SUID: run as file owner (not caller)
chmod g+s /shared/data            # SGID on dir: new files inherit group
chmod +t /tmp                     # Sticky bit: only owner can delete their files

## Find files by permission/ownership
find /opt -type f -perm /u+s      # find SUID files (security audit)
find / -user root -perm -4000     # world-writable SUID root files (risk!)

## Disk usage investigation
df -h                             # filesystem usage
df -i                             # inode usage (can be full even if space available)
du -sh /var/log/* | sort -rh | head -20  # largest log directories
find /var/log -name "*.log" -size +100M  # large log files
lsof | grep deleted               # files deleted but still held open (disk not freed!)
Interviewer questions
QA server shows disk is 100% full but df shows large free space. What happened and how do you fix it?
This is the deleted files still open problem. When a file is deleted (rm), Linux removes the directory entry but the data isn't freed until all processes that have the file open close it. A log file can grow to 50GB, get deleted, but if the logging process still has it open, the disk space isn't freed.

Diagnose: lsof | grep deleted — lists all deleted-but-still-open files with their sizes and the PID holding them open.

Fix: Option A: restart the process holding the file open (systemctl restart service-name). Option B: truncate the file without closing it: : > /proc/<PID>/fd/<fd-number>. Option B is better if you can't afford to restart the service.

Also check df -i — sometimes the issue is inode exhaustion, not block usage. Millions of small files in /tmp or a mail spool can exhaust inodes even with free disk space.
Networking & connectivity
NET

Network diagnosis commands

ss, netstat, tcpdump, curl, traceroute — and what each tells you

## Active connections and listening ports
ss -tlnp                     # TCP listening ports + process name (fast, modern)
ss -s                        # connection count summary by state
ss -tnp state established    # all established TCP connections
netstat -tlnp                # older equivalent (netstat is deprecated but still common)

## Is the port open / reachable?
nc -zv 10.0.1.5 5432        # TCP port check (can I reach Postgres?)
curl -v https://api.cartrack.sg/health  # full HTTP trace including TLS handshake
telnet 10.0.1.5 5432         # older alternative

## DNS resolution
dig api.cartrack.sg          # full DNS query with timing
dig +short api.cartrack.sg   # just the IP
nslookup api.cartrack.sg     # simpler alternative
cat /etc/resolv.conf         # which DNS servers the server uses

## Route tracing
traceroute api.cartrack.sg   # hop-by-hop latency to destination
mtr --report api.cartrack.sg # continuous traceroute with packet loss %

## Packet capture (powerful but requires root)
tcpdump -i eth0 port 1883    # capture MQTT traffic
tcpdump -i eth0 host 10.0.1.5 -w /tmp/capture.pcap  # save to file
tcpdump -n -r /tmp/capture.pcap  # read saved capture

## Connection count by state (useful for diagnosing connection exhaustion)
ss -tn | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn
# Shows: ESTABLISHED, TIME_WAIT, CLOSE_WAIT counts
# Too many TIME_WAIT = normal for HTTP (short-lived connections)
# Too many CLOSE_WAIT = app not closing connections properly (code bug)
Interviewer questions
QAn application can't connect to the database. Walk me through your diagnosis.
I work from the outside in — eliminate each layer:

1. Is the DB process even running? On the DB server: systemctl status postgresql. ps aux | grep postgres.

2. Is it listening on the right port? ss -tlnp | grep 5432. If it's listening on 127.0.0.1 only, remote connections can't reach it — check postgresql.conf listen_addresses.

3. Can the app server reach the DB server? nc -zv <db-host> 5432 from the app server. If this fails, it's network — firewall rules, security groups (AWS), or routing.

4. Is the connection being rejected by Postgres itself? Check pg_hba.conf — the client's IP must match an allowed rule. Check the DB logs (journalctl -u postgresql) for "authentication failed" or "no pg_hba.conf entry".

5. Check connection pool exhaustion: SELECT count(*), state FROM pg_stat_activity GROUP BY state — if max_connections is reached, new connections are rejected with "too many connections". Solution: add RDS Proxy / PgBouncer or increase max_connections (with care).
FW

Firewall and iptables / nftables basics

## UFW (Ubuntu Uncomplicated Firewall — abstraction over iptables)
ufw status verbose            # current rules
ufw allow 22/tcp              # allow SSH
ufw allow from 10.0.0.0/8 to any port 5432  # only internal IPs to Postgres
ufw deny 5432                 # block Postgres from public
ufw enable

## iptables (lower-level, still common)
iptables -L -n -v             # list all rules with packet counts
iptables -A INPUT -p tcp --dport 1883 -s 10.0.0.0/8 -j ACCEPT  # allow MQTT from private network
iptables -A INPUT -p tcp --dport 1883 -j DROP   # drop everything else to MQTT port

## Check if a specific rule is blocking (count packets hitting it)
iptables -L INPUT -v -n | grep 1883
Shell scripting
BSH

Bash fundamentals — variables, conditionals, loops, functions

#!/bin/bash
set -euo pipefail  # e=exit on error, u=error on unset var, o pipefail=pipe fails matter
trap 'echo "ERROR: Script failed at line $LINENO"' ERR

## Variables and quoting
NAME="Cartrack"
echo "Hello $NAME"           # double quotes: variable expanded
echo 'Hello $NAME'           # single quotes: literal, no expansion
FILES=$(ls /var/log/*.log)   # command substitution
COUNT=${#FILES}               # string length

## Conditionals
if [[ -f "/etc/config.env" ]]; then
  source /etc/config.env
elif [[ -d "/etc/config/" ]]; then
  source /etc/config/main.env
else
  echo "No config found" >&2; exit 1
fi

[[ $COUNT -gt 0 ]] && echo "Has files" || echo "Empty"

## String tests: -z (empty), -n (not empty), == (equal), != (not equal)
## File tests: -f (file), -d (dir), -r (readable), -x (executable)
## Number tests: -eq -ne -lt -gt -le -ge

## Loops
for file in /var/log/gps/*.log; do
  echo "Processing $file"
done

while IFS= read -r line; do
  echo "Line: $line"
done < /etc/servers.txt

## Functions
log() {
  local level="$1"; shift
  echo "$(date '+%Y-%m-%d %H:%M:%S') [$level] $*" >> /var/log/deploy.log
}
log INFO "Deployment started"
log ERROR "Config file missing"

## Arrays
SERVERS=("web1" "web2" "web3")
for s in "${SERVERS[@]}"; do
  ssh "deploy@$s" 'systemctl restart app'
done
AWK

Text processing — grep, awk, sed, sort, xargs

## grep — pattern matching
grep -i "error" /var/log/app.log           # case-insensitive
grep -v "debug" /var/log/app.log           # invert (exclude debug lines)
grep -oP 'vehicle_id=\K[^\s]+'            # extract just the match (-o) with Perl regex
grep -c "CRITICAL" /var/log/app.log        # count matching lines
grep -A3 -B1 "ERROR" app.log              # 3 lines after, 1 before match (context)

## awk — structured text processing
awk '{print $1, $3}' /var/log/access.log  # print columns 1 and 3
awk -F',' '$5 > 100 {print $1}' data.csv  # filter CSV: col 5 > 100, print col 1
awk '{sum+=$NF} END {print sum}' file     # sum the last column
awk 'NR%1000==0 {print NR, $0}' bigfile  # sample every 1000th line from huge file

## sed — stream editor for substitution
sed 's/ERROR/CRITICAL/g' app.log          # replace ERROR with CRITICAL
sed -n '100,200p' bigfile                 # print only lines 100-200
sed '/^#/d' config.ini                    # delete comment lines
sed -i 's/old_host/new_host/g' config     # edit file in-place

## sort, uniq — counting and deduplication
sort -t',' -k3 -n data.csv                # sort by column 3 numerically
sort | uniq -c | sort -rn | head -20      # frequency analysis (top 20 most common)
LC_ALL=C sort -u bigfile                  # fast ASCII sort + dedup

## xargs — batch parallel execution
find /data -name "*.gz" | xargs -P4 -I{} gzip -d {}  # decompress 4 at a time
cat server_list.txt | xargs -I{} ssh {} 'uptime'      # run uptime on all servers

## Real-world: find top 10 IPs hammering the API
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

## Parse structured logs (JSON)
cat app.log | python3 -c "
import sys, json
for line in sys.stdin:
    try:
        d = json.loads(line)
        if d.get('level') == 'ERROR':
            print(d['timestamp'], d['message'])
    except: pass
"
CRN

Cron, scheduling, and automation

## crontab format: min hour day month weekday command
crontab -e         # edit current user's crontab
crontab -l         # list current user's crontab
crontab -u postgres -l  # list another user's crontab

## Cron examples
0  2 * * *    /opt/scripts/backup_db.sh          # daily at 2am
*/5 * * * *   /opt/scripts/health_check.sh       # every 5 minutes
0  0 1 * *    /opt/scripts/monthly_report.sh     # first of every month
0  8-18 * * 1-5  /opt/scripts/fleet_sync.sh     # every hour, business hours, weekdays

## Best practices for cron scripts
# 1. Use absolute paths — cron has a minimal PATH
# 2. Redirect output to a log file
0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1

# 3. Add a lock file to prevent overlapping runs
0 */2 * * * flock -n /tmp/etl.lock /opt/scripts/etl.sh || echo "Already running"

# 4. Set MAILTO="" to suppress email output (noisy by default)
MAILTO=""

## /etc/cron.d/ — system-level cron with user specification
# /etc/cron.d/gps-archive:
0 3 * * * postgres /opt/scripts/archive_gps_partitions.sh
Cartrack use cases for cron: Nightly GPS partition archival (S3 sync), driver score batch computation, expired alert cleanup, fleet report email dispatch, certificate renewal checks.
SH2

Practical scripting — file watcher, health checker, deployment helper

#!/bin/bash
## Health check script with alerting
set -euo pipefail

ENDPOINT="https://api.cartrack.sg/health"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
THRESHOLD_MS=2000

check_health() {
  local start end elapsed http_code
  start=$(date +%s%N)
  http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$ENDPOINT")
  end=$(date +%s%N)
  elapsed=$(( (end - start) / 1000000 ))  # ms

  if [[ "$http_code" != "200" ]]; then
    alert "API returned HTTP $http_code"
  elif [[ $elapsed -gt $THRESHOLD_MS ]]; then
    alert "API slow: ${elapsed}ms (threshold: ${THRESHOLD_MS}ms)"
  fi
}

alert() {
  local msg="$1"
  echo "$(date '+%Y-%m-%d %H:%M:%S') ALERT: $msg" >> /var/log/health-check.log
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    --data "{\"text\":\"⚠️ $msg\"}" > /dev/null
}

check_health
Security & hardening
SSH

SSH, key management, and secure remote access

## Generate a key pair (Ed25519 is modern and fast)
ssh-keygen -t ed25519 -C "deploy@cartrack" -f ~/.ssh/cartrack_deploy

## Copy public key to server
ssh-copy-id -i ~/.ssh/cartrack_deploy.pub user@server

## ~/.ssh/config for managing multiple servers
Host cartrack-prod
  HostName 10.0.1.100
  User deploy
  IdentityFile ~/.ssh/cartrack_deploy
  ServerAliveInterval 60

## Then just: ssh cartrack-prod

## /etc/ssh/sshd_config hardening
PermitRootLogin no               # never allow root SSH
PasswordAuthentication no        # key-only authentication
PubkeyAuthentication yes
AllowUsers deploy appuser        # whitelist specific users
Port 2222                        # non-standard port reduces script-kiddie noise

## SSH tunnelling (port forwarding)
ssh -L 5432:db-server:5432 jump-host  # forward local 5432 through jump-host to DB
ssh -R 8080:localhost:3000 server     # expose local port 3000 on remote server's 8080

## scp and rsync for file transfer
scp -i ~/.ssh/key file.tar.gz deploy@server:/opt/releases/
rsync -avz --delete /opt/app/ deploy@server:/opt/app/  # sync, delete removed files
Interviewer questions
QHow would you harden a new Linux server before deploying a production application?
Minimum production hardening checklist:

1. User management: Disable root SSH (PermitRootLogin no). Create a dedicated deploy user with sudo for specific commands only. Remove or lock default accounts.

2. SSH hardening: Key-only authentication (PasswordAuthentication no). Non-default port. Restrict AllowUsers. Enable fail2ban to block repeated failed login attempts.

3. Firewall: Default deny all inbound. Allow only needed ports (22/2222 from known IPs, 443/80 from world, 5432 only from app server IP).

4. Updates: Apply security patches (apt upgrade). Enable unattended-upgrades for security patches only.

5. File permissions: Ensure application runs as a non-root user. chmod 600 for secrets and key files. Audit SUID binaries with find / -perm /4000.

6. Monitoring: Install auditd for syscall logging. Configure logwatch or send logs to central SIEM.
ENV

Environment variables and secrets management

## Setting and using environment variables
export DB_PASSWORD="secret"          # set for current session + subprocesses
DB_HOST=localhost ./app              # set only for one command
printenv                             # show all env vars
env | grep APP_                      # filter to app-specific vars

## /etc/environment — system-wide, all users, loaded at login
## /etc/profile.d/*.sh — loaded for login shells
## ~/.bashrc — user-specific interactive shells

## Loading from a .env file in a script
set -a                               # auto-export all variables
source /etc/app/.env
set +a

## Never commit .env files — use .gitignore
## In production: use systemd EnvironmentFile or AWS Secrets Manager

## Masking secrets in logs
echo "DB_PASSWORD: ${DB_PASSWORD:0:3}***"  # show only first 3 chars

## Check if a required env var is set (fail fast if not)
: "${DB_HOST:?DB_HOST must be set}"   # exits with error message if unset
Production secret management: Never store secrets in environment variables that get logged (they appear in ps aux output). Use a secrets manager: AWS Secrets Manager, HashiCorp Vault, or at minimum a EnvironmentFile in the systemd unit with chmod 600 so only root and the service user can read it.
Core concepts & commands
HOW

How Git stores data

Commits, trees, blobs — snapshots not diffs

Git doesn't store diffs — it stores snapshots. Each commit is a pointer to the full state of the project at that moment. Internally: blob (file content), tree (directory listing), commit (snapshot + author + parent), tag. Everything is content-addressed by SHA-1. This is why Git history is immutable — changing any commit changes its hash and all subsequent hashes.
Interviewer questions
QWhat's the difference between git reset, git revert, and git restore?
git revert: Creates a new commit that undoes a previous commit. Safe for shared branches — doesn't rewrite history. Use this when you need to undo something already pushed.

git reset: Moves the branch pointer backward, rewriting history. --soft keeps changes staged, --mixed keeps them unstaged, --hard discards changes. Only use locally before pushing — breaks teammates' branches if used on shared branches.

git restore: For working directory and staging area only — doesn't touch commits. git restore file.txt discards uncommitted changes. git restore --staged file.txt unstages a file.
Git Flow — scheduled releases & versioned software
GF

Git Flow — full model

5 branch types, their lifetimes, and strict merge rules

Created by Vincent Driessen. Designed for software with scheduled, versioned releases — mobile apps, desktop software, libraries with major/minor/patch versions. Has two permanent long-lived branches and three types of short-lived support branches.
BRANCH TYPE BRANCHES FROM MERGES INTO LIFETIME main — — permanent develop main — permanent feature/* develop develop days–weeks release/* develop main + develop days hotfix/* main main + develop hours–days
main ●────────────────────────────────────────────────────● v2.0.0 ╲ ╱ hotfix ●──────────────────────────────────────────● v1.2.1 ╲ ╱ release/1.2 ●──────────────────────────────────────● stabilise only ╱ ╲ develop ●──────────────────────────────────────────────● ╱ ╲ ╱ ╲ ╱ feature ●────────── ●──────── feature/login, feature/gps
# Start a feature
git checkout develop
git checkout -b feature/geofence-alerts
# ... work ...
git checkout develop
git merge --no-ff feature/geofence-alerts  # --no-ff preserves branch history
git branch -d feature/geofence-alerts

# Cut a release (only bug fixes from here, no new features)
git checkout develop
git checkout -b release/2.0.0
# bump version number, fix bugs found in testing
git checkout main && git merge --no-ff release/2.0.0
git tag -a v2.0.0 -m "Release 2.0.0"
git checkout develop && git merge --no-ff release/2.0.0  # back-merge fixes

# Hotfix in production
git checkout main
git checkout -b hotfix/critical-gps-null
# fix the bug
git checkout main && git merge --no-ff hotfix/critical-gps-null
git tag -a v2.0.1
git checkout develop && git merge --no-ff hotfix/critical-gps-null  # crucial!

When to choose Git Flow

Mobile/desktop apps with App Store releases. Libraries with semantic versioning. Products supporting multiple major versions simultaneously (v1.x and v2.x in production). Teams with formal QA and UAT cycles before each release. Compliance-heavy products needing audit trails per release.

Git Flow pitfalls

Merge conflicts accumulate — the longer feature branches live, the bigger the diff. The develop branch can become a "last stable mess." Hotfixes must be merged into two branches — easy to forget the back-merge to develop. Not suitable for multiple daily deployments — too much ceremony per release.

Interviewer questions
QIn Git Flow, why must hotfixes be merged into both main AND develop?
Because develop is the integration branch for future releases. If you fix a critical bug in production (main) but don't back-merge into develop, the next release will regress — the bug you just fixed will reappear. Every hotfix branch must merge into main (for the immediate fix) and develop (to carry the fix forward). This is one of the most commonly forgotten steps in Git Flow and a frequent source of regressions in teams that use it without tooling enforcement.
Trunk-Based Development — continuous integration at its core
TBD

Trunk-Based Development (TBD)

One main branch, short-lived feature branches, always deployable

Everyone integrates into a single shared trunk (main) at least once a day via short-lived feature branches (1–3 days max). There are no long-lived development or release branches. The core principle: main is always in a releasable state. Incomplete features are hidden behind feature flags, not kept on branches.
main ●──────●──────●──────●──────●──────●──────● always deployable ╱ ╲ ╱ ╲ ╱ ╲ feat ●───────────● ●───────────● ●───────────● (max 1-3 days each — merge before they diverge)
# The workflow is intentionally simple
git switch -c feat/speed-alert-refactor   # branch from main
# make small, focused commits — push frequently
git push origin feat/speed-alert-refactor
# Open MR → CI passes → 1 approval → squash merge → delete branch

# Feature flag in application code (incomplete work is merged but hidden)
if (featureFlags.isEnabled('new-geofence-engine', customerId)) {
  // new unfinished code — only visible to internal testers
} else {
  // old code — what all customers see
}

When to choose TBD

SaaS products deployed multiple times per day. Small to medium teams where everyone knows the codebase. Products with good automated test coverage (tests must pass before merge). Teams using feature flags (LaunchDarkly, Unleash, or DB-backed flags). Cartrack's engineering model — continuous delivery to 2.5M subscribers.

TBD prerequisites

Solid CI pipeline that runs in <10 minutes. Feature flags infrastructure for dark deploys. Culture of small, reviewable commits. Test coverage sufficient to catch regressions quickly. Monitoring and rollback capability for production issues.

My recommendation for Cartrack: Trunk-Based Development. Cartrack ships to a live SaaS platform — not a mobile app awaiting App Store review. Short-lived branches, mandatory CI, and feature flags for incomplete work aligns perfectly with continuous delivery to 2.5M+ subscribers.
GitHub Flow — simplified model for web services
GHF

GitHub Flow

One long-lived branch: main. Everything else is a PR with preview deployment.

Simpler than TBD, slightly more structured. There is only one permanent branch: main. Every change — feature, fix, or experiment — happens on a short-lived branch opened as a PR. A preview/staging environment is deployed from the PR branch. Merging to main triggers an automatic production deploy.
1
Branch off main
git switch -c feature/driver-coaching-alerts
2
Commit and push frequently
Small descriptive commits. Push early — CI runs on every push, giving fast feedback.
3
Open a Pull Request
Describe what changed and why. CI runs automatically. Link to the ticket. Assign reviewers.
4
Deploy branch to preview environment
Reviewers can test the actual running feature — not just read code. Essential for UI changes.
5
Review, approve, and merge
All CI checks pass. Blocking review comments resolved. Merge to main — triggers production deploy automatically.

GitHub Flow vs TBD

GitHub Flow is essentially TBD with an explicit preview deployment step as part of the PR workflow. TBD is more prescriptive about keeping branches under 2 days. Both require main to always be deployable. GitHub Flow is often the more pragmatic starting point for teams transitioning from Git Flow.

GitHub Flow limitations

No built-in mechanism for supporting multiple production versions. No explicit release stabilisation stage. Merging to main means immediate production — requires high confidence in CI and monitoring. Not suitable if you need a "release candidate" phase.

Environment branching — per-environment long-lived branches
ENV

Environment branching strategy

A branch per deployment environment — common in enterprise/regulated teams

Each deployment environment has its own long-lived branch. Code is promoted by merging upward: feature → development → staging → main (production). CI/CD pipelines are triggered per branch and deploy to the corresponding environment automatically.
main ●──────────────────────────────────────● → PRODUCTION ╱ ╱ staging ●──────────────────────────────────────● → STAGING (UAT/pre-prod) ╱ ╱ develop ●──────────────────────────────────────● → DEV/TEST environment ╱ ╲ ╱ ╲ ╱ feature ●────────── ●──── developer branches

When environment branching is used

Enterprises with separate QA, UAT, and production environments. Regulated industries (finance, healthcare) where manual approval gates are required between environments. Teams where different environment configs are managed in the branch (not recommended). Legacy organisations migrating from SVN-style trunk/branch workflows.

Problems with environment branching

Branches diverge — staging and main can be weeks apart in diff. Merge conflicts compound across environments. "Works in staging, broken in prod" becomes common. It's a deployment problem disguised as a branching problem — the real solution is environment parity and feature flags, not permanent environment branches.

The honest take: Environment branching is an anti-pattern that many enterprise teams use because they don't have confidence in their CI/CD pipeline. If you have good feature flags, automated tests, and environment parity, you don't need environment branches. Mention you understand why teams use it, but advocate for TBD + feature flags as the modern replacement.
RLS

Release branching — long-term version support

Used when multiple major versions must be maintained in production simultaneously

When a product has enterprise customers locked to older major versions (v1.x while v2.x is out), dedicated long-lived release branches allow backporting critical security patches without forcing version upgrades. Common in open-source projects, SDKs, and enterprise software.
main ●─────────────────────────────────────── v3.x (latest) ╱ release/2.x ●─────────────────────────────────────── v2.x (security patches only) ╱ release/1.x ●─────────────────────────────────────── v1.x (EOL soon)
# Backport a security fix to v2.x release branch
git checkout release/2.x
git cherry-pick abc1234   # cherry-pick the specific fix commit from main
git tag -a v2.8.1 -m "Security: fix GPS auth bypass"
# DO NOT merge release/2.x back into main — it contains old code
Cartrack relevance: If Cartrack's device firmware SDK has enterprise customers who can't upgrade to v2.x yet, release branching for the SDK library makes sense. For the web SaaS platform itself, it doesn't — all customers are on the same version.
Strategy comparison & merging mechanics
CMP

Side-by-side strategy comparison

Choose based on release cadence, team size, and product type

STRATEGY          RELEASE CADENCE   BRANCH COUNT   BEST FOR
Trunk-Based       Multiple/day      Low (1-2)      SaaS, continuous delivery
GitHub Flow       Daily             Low (1+)       Web services, small teams
Git Flow          Scheduled         High (5 types) Mobile apps, OSS libraries
Environment Br.   Scheduled         High (3+)      Enterprise with manual gates
Release Branches  Multiple versions  Medium        SDKs, enterprise software
Interviewer questions
QWhat factors do you consider when choosing a branching strategy for a new team?
I ask four questions:

1. How often do you need to release? Multiple times a day → TBD. Once a week or less → Git Flow or GitHub Flow. Multiple versions in production → release branches.

2. How stable is the codebase and test suite? If CI is unreliable or test coverage is low, long-lived branches provide a safety buffer. If CI is solid and coverage is high, TBD works safely.

3. What's the team size and codebase familiarity? Larger teams benefit from more explicit rules (Git Flow). Small, experienced teams can operate efficiently with TBD's simplicity.

4. Are there external constraints? App Store review cycles force less frequent releases → Git Flow. Compliance requiring UAT before production → environment branching or release branches.

My default recommendation is always to start with GitHub Flow (simple, mainstream) and move to TBD as the team matures its CI and feature flag discipline. Reserve Git Flow for true scheduled-release products.
MRG

Merge strategies — merge commit vs squash vs rebase

The three ways to integrate a branch and what history each produces

When integrating a completed feature branch into main, you have three options. The choice affects what git log looks like and how easy rollbacks are.
BEFORE MERGE: main ●───────●───────● ╲ feat ●───●───● MERGE COMMIT (--no-ff): — preserves branch + adds merge commit main ●───────●───────●───────────────M M = merge commit with 2 parents ╲ ╱ feat ●───●───●──── SQUASH MERGE: — collapses all feature commits into one main ●───────●───────●───────S S = single squashed commit REBASE MERGE: — replays each commit onto main (linear) main ●───────●───────●───●'──●'──●' each commit re-applied, new hashes

Merge commit

Full history preserved. Two parents on the merge commit. git log --graph shows the branch shape.

Best for: Git Flow's develop branch, long-lived branch integration. Rollback: revert the merge commit.

Squash merge

All feature commits become one commit on main. Clean, readable history. Individual commits lost.

Best for: TBD and GitHub Flow — keeps main history clean. Rollback: revert the single squash commit. GitLab's recommended default.

Rebase merge

Each commit replayed onto main, linear history. No merge commit. Rewrites feature commit hashes.

Best for: Small teams with clean commit hygiene. Rollback: revert individual commits. Never rebase public/shared branches.

# GitLab MR merge strategies (Settings → General → Merge requests)
# "Merge commit" — creates merge commit (default)
# "Merge commit with semi-linear history" — rebase first, then merge commit
# "Fast-forward merge" — rebase only, no merge commit, requires linear history

# Team recommendation: Squash and merge
# Every MR = one commit on main = easy to revert a feature = clean log
Interviewer questions
QYour team uses squash merges. A developer complains that their commit history is lost. How do you respond?
This is a valid concern and worth addressing thoughtfully. Squash merging loses the granular commit history — if someone wrote 15 commits with useful reasoning in the commit messages, that context disappears.

My response: the detailed commit history of a feature branch is available as long as the branch exists. We keep remote branches until the sprint retrospective — anyone who wants to review the commit-by-commit reasoning can do so before branch deletion. What lands on main is deliberately coarse-grained — one commit per feature — because main is a record of what shipped, not how it was built.

For the cases where granular commit history genuinely matters (complex algorithms, security-sensitive code), I'd encourage the developer to include that context in the MR description, which GitLab preserves permanently against the merge commit. If the team strongly disagrees with squash, semi-linear merge (rebase + merge commit) is a middle ground.
QHow do you handle long-running feature branches that keep getting stale?
This is usually a symptom of a feature being too large, not a Git problem. My approach:

1. Decompose the feature: If a branch is more than 3 days old, it's too big. Break the epic into independently mergeable increments. Use feature flags to merge partial implementations safely — the UI button only appears when the flag is on.

2. Rebase frequently: If the branch must stay open, have the developer rebase onto main every day or two. This keeps the diff small and conflicts manageable rather than letting them compound for two weeks.

3. Pair programming: Long-running branches are often a sign someone is stuck or uncertain. Schedule a pairing session to unblock, then get it merged.

4. Set a branch age limit policy: Flag any branch older than 5 days in the weekly team standup. Not as a blame exercise — as an early warning that something needs attention.
Pull requests & merge request lifecycle
PR

The full PR / MR lifecycle

From branch creation to merge — every step and who owns it

1
Branch from main (or develop)
Use a descriptive name: feat/gps-deduplication, fix/null-speed-on-idle, chore/upgrade-php83. Short-lived — target 1–3 days max before merging.
2
Small, focused commits
Each commit should represent one logical change. Use Conventional Commits format: feat:, fix:, refactor:, docs:, test:. Never commit "WIP" to a shared branch.
3
Self-review before opening MR
Read your own diff. Remove debug logs, check edge cases, verify tests pass locally. The author is the first reviewer — don't push review burden onto teammates for avoidable issues.
4
Open the MR with a clear description
Fill in the MR template: what changed and why, how to test it, screenshots for UI, ticket/issue link, any known limitations. Assign reviewers explicitly — don't rely on people noticing it.
5
CI runs automatically
Linting, unit tests, integration tests, security scanning. Never ask for review on a failing pipeline — fix CI first. Draft/WIP status prevents premature review requests.
6
Code review
Reviewers respond to every comment. Author addresses or rebuts each one. Discussions should resolve before merge. Aim for review turnaround within 24 hours — blocked PRs kill sprint velocity.
7
Approval & merge
At least one (ideally two) approvals required. Squash commits or standard merge depending on team policy. Delete the source branch after merge. Link the closed ticket/issue.
Interviewer questions
QWhat makes a good pull request? What do you look for when reviewing one?
A good PR: Small scope (one logical change), a clear description of what changed and why, passing CI before requesting review, linked ticket, and evidence of self-review (no debug logs, no obvious mistakes).

When reviewing: I look for correctness first — does it do what it claims? Then edge cases and error handling. Then readability and naming. I label my comments: blocking (must fix), suggestion (optional improvement), question (I want to understand the choice). I give specific, actionable feedback rather than vague comments like "this is confusing." I also call out things done well — review shouldn't just be a list of problems.

I try to approve within 24 hours. A PR waiting for review is a team bottleneck — it blocks the author and inflates WIP.
QHow do you prevent code reviews from becoming a bottleneck in a fast-moving team?
A few things help significantly:

Keep PRs small. A 50-line PR gets reviewed in 10 minutes. A 500-line PR gets skimmed and rubber-stamped. I enforce a culture of frequent small PRs over infrequent large ones.

Use draft/WIP PRs for early feedback. Opening a draft MR early means you can get architecture feedback before writing all the code — much cheaper to change direction at that point.

Set review SLAs. For my team, the expectation was: review within 24 hours on working days. If you can't, re-assign.

Automate the automatable. Linting, formatting, test coverage thresholds — all enforced by CI, not reviewers. Reviewers focus on logic and design, not style.

Pair programming as a PR alternative. For complex or risky changes, I'd do a live pairing session instead of async review. Faster, more effective.
QHow do you handle a PR where a senior developer disagrees with your implementation?
I start by assuming they have context I don't — I ask them to explain their reasoning. Often their concern is valid and I learn something.

I respond to every comment — even "good catch, fixed" or "I see your point but kept my approach because X — happy to discuss." I never silently change code without acknowledging the comment.

If I genuinely disagree after understanding their reasoning, I state my position calmly with evidence — benchmarks, docs, a concrete example. I don't dig in defensively. The goal is to ship the best code, not to win.

If we're stuck, I escalate to a third reviewer or suggest we timebox a discussion. "Disagree and commit" is a valid outcome — sometimes you implement their approach while noting your preference for the record. What I don't do is leave a comment thread unresolved and merge anyway.
MRT

GitLab MR template + branch protection setup

The configuration that enforces quality gates automatically

# .gitlab/merge_request_templates/default.md
## What does this MR do?
<!-- One sentence summary -->

## Why?
<!-- Link to issue/ticket. What problem does this solve? -->

## How to test
<!-- Steps to verify the change works -->

## Checklist
- [ ] Tests added or updated
- [ ] CI pipeline passing
- [ ] No debug/commented-out code
- [ ] Documentation updated if needed
- [ ] Screenshots attached (for UI changes)
GitLab branch protection rules (Settings → Repository → Protected Branches):

Protect main branch

Allowed to merge: Maintainers only. Allowed to push: No one (force-push disabled). Require MR approvals: 1 minimum. Require CI to pass before merge.

Useful MR settings

Delete source branch on merge (auto-cleanup). Squash commits by default (clean history). Require resolved discussions before merge. Enable "Suggested reviewers" based on file ownership (CODEOWNERS).

# CODEOWNERS file — auto-assign reviewers by path
# GitLab reads this from the root of the repo
/database/migrations/   @db-lead
/api/auth/              @security-team
/frontend/              @frontend-team
*                       @tech-lead
HKS

Git hooks — enforcing quality at commit time

pre-commit, commit-msg, pre-push hooks and what to automate

Git hooks are scripts that run automatically at specific points in the Git workflow. They live in .git/hooks/ but are not committed — use a tool like Husky (Node.js) or pre-commit (Python/any) to share hooks across the team via the repo.
# package.json — Husky setup for Node/PHP projects
{
  "husky": {
    "hooks": {
      "pre-commit": "lint-staged",          // lint only staged files
      "commit-msg": "commitlint --edit $1", // enforce conventional commits
      "pre-push": "npm test"               // run tests before push
    }
  },
  "lint-staged": {
    "*.php": ["phpcs", "phpstan analyse"],
    "*.ts": ["eslint --fix", "prettier --write"]
  }
}

# commitlint.config.js — enforce Conventional Commits
module.exports = {
  extends: ['@commitlint/config-conventional'],
  rules: {
    'type-enum': [2, 'always', [
      'feat', 'fix', 'chore', 'refactor', 'docs', 'test', 'perf'
    ]]
  }
};
Hooks vs CI: Hooks provide fast local feedback. CI is the authoritative gate — it still runs everything server-side. Never rely on hooks alone since they can be bypassed with --no-verify.
Standards, processes & advanced Git
STD

Conventional commits + automated changelogs

# Format: type(scope): description
feat(gps): add deduplication for duplicate device packets
fix(auth): resolve null pointer on expired JWT refresh
chore(deps): upgrade php-cs-fixer to 3.x
refactor(trips): extract TripReconstructionService from GPSProcessor
docs(api): add OpenAPI spec for /v1/vehicles endpoint
test(geofence): add edge case for boundary crossing at exactly midnight
perf(dashboard): cache driver score aggregations for 5 minutes

# Breaking changes — triggers major version bump
feat(api)!: remove deprecated /v1/fleet/all endpoint

# Generate changelog automatically with standard-version or release-please
npx standard-version   # bumps version, generates CHANGELOG.md, tags commit
Why this matters for a team lead: Conventional Commits enables fully automated semantic versioning and changelogs. fix: = patch, feat: = minor, feat!: = major. Your release notes write themselves, and junior developers have a clear framework for commit messages without needing a style guide document.
BST

Advanced Git: bisect, reflog, stash, worktrees

## git bisect — binary search for a regression
git bisect start
git bisect bad                  # current = broken
git bisect good v2.3.0          # this tag was working
git bisect run ./run_tests.sh   # automate (exit 0 = good)
git bisect reset                # clean up
# On 1,000 commits: finds the bad one in ~10 steps

## git reflog — your safety net for "lost" commits
git reflog                      # see every HEAD movement
git checkout HEAD@{3}           # go back 3 HEAD positions
git branch recovery HEAD@{3}   # recover "deleted" branch

## git stash — save work-in-progress without committing
git stash push -m "half-done geofence fix"
git stash list
git stash pop                   # restore latest stash
git stash apply stash@{2}       # restore specific stash, keep it

## git worktree — check out two branches simultaneously
git worktree add ../hotfix-branch hotfix/critical-fix
# Work in ../hotfix-branch while staying on your feature branch
git worktree remove ../hotfix-branch
Pro tip on reflog: git reflog is the most underused Git command. Even after a git reset --hard or accidental branch deletion, the commits are still in the object store for ~90 days. Reflog lets you recover them before they're garbage-collected.
TAG

Tagging, releases & semantic versioning

Semantic versioning (SemVer): MAJOR.MINOR.PATCH — e.g. 2.4.1. MAJOR = breaking API change. MINOR = new backwards-compatible feature. PATCH = backwards-compatible bug fix. Pre-release: 2.5.0-rc.1.
# Lightweight tag (just a pointer to a commit)
git tag v2.4.1

# Annotated tag (preferred — includes message, author, date)
git tag -a v2.4.1 -m "Release 2.4.1: fix GPS null handling"
git push origin v2.4.1          # tags don't push automatically
git push origin --tags          # push all local tags

# List and filter tags
git tag -l "v2.*"

# GitLab: tag-triggered pipeline deploys to production
# rules: - if: $CI_COMMIT_TAG
Team practice: Never deploy to production without a tag. Tags give you a precise point to roll back to, and GitLab pipelines can be configured to only deploy-to-prod on a tagged commit — adding a natural gate to production releases.
Scalability fundamentals
SCL

Vertical vs horizontal scaling

Scale up vs scale out — and when each one breaks

Vertical scaling (Scale up)

Add more CPU/RAM/disk to a single server. Simple — no application changes needed. Has a hard ceiling (biggest available machine). Creates a single point of failure. Downtime during upgrades.

Good for: Databases (simpler than distributed), early-stage products, stateful services that are hard to distribute.

Horizontal scaling (Scale out)

Add more servers, distribute load across them. Theoretically unlimited. Requires stateless services (or external state management). More complex operationally — needs load balancing, service discovery.

Good for: Stateless API servers, microservices, read replicas. Cartrack's ingestion layer must scale horizontally.

The real question isn't "which one" — it's "what's the bottleneck?" Most systems need both: horizontal scaling for stateless compute layers, with vertical scaling or read replicas for the database until sharding becomes necessary.
LB

Load balancing — algorithms and strategies

Round robin, least connections, consistent hashing, health checks

A load balancer distributes incoming requests across multiple servers. It also performs health checks and removes unhealthy instances from the pool automatically.

Round robin

Requests go to servers in rotation: 1, 2, 3, 1, 2, 3... Simple. Ignores server capacity or current load. Works well when servers are identical and requests are similar in weight.

Least connections

Each new request goes to the server with the fewest active connections. Better for variable-duration requests (e.g. a mix of quick API calls and long GPS history exports).

IP hash / Consistent hashing

Same client always reaches the same server. Useful for session affinity (sticky sessions). Consistent hashing minimises redistribution when servers are added/removed — critical for cache layers.

Layer 4 vs Layer 7

L4 (TCP): Fast, routes by IP/port, no request inspection. L7 (HTTP): Slower but content-aware — can route based on URL path, headers, or cookies. Use L7 for API gateways and microservice routing.

Interviewer questions
QHow do you handle session state when scaling horizontally across multiple servers?
Three approaches:

1. Sticky sessions (session affinity): The load balancer sends the same user to the same server. Simple, but you lose the flexibility of true horizontal scaling and a server crash takes down sessions with it.

2. Centralised session store: Store sessions in Redis or a database instead of in-memory. All servers share the same session state. This is the standard approach — stateless servers, stateful store. Sessions survive server restarts.

3. Token-based auth (JWT): No server-side session at all. The client holds a signed token containing claims. The server validates the signature without storing any state. Best for APIs — completely stateless, infinitely scalable. The tradeoff is that you can't instantly invalidate a JWT (it's valid until expiry), though a token blocklist in Redis solves this if needed.
CAP

CAP theorem — Consistency, Availability, Partition tolerance

Why you can only pick two, and what that means in practice

The CAP theorem states that a distributed system can only guarantee two of three properties simultaneously when a network partition occurs:

Consistency (C)

Every read returns the most recent write or an error. All nodes see the same data at the same time. Like a single source of truth.

Availability (A)

Every request gets a (non-error) response — but it might not contain the most recent data. The system stays online even if some nodes are unavailable.

Partition tolerance (P)

The system continues operating even when network messages are dropped or delayed between nodes. In a distributed system, P is non-negotiable — networks fail.

In practice: Since partition tolerance is mandatory in any distributed system, the real choice is between CP and AP when a partition happens:

CP systems (consistent + partition tolerant)

Return an error rather than stale data during a partition. Examples: HBase, Zookeeper, traditional RDBMS with synchronous replication. Use when: correctness is critical — financial transactions, inventory management, GPS event deduplication.

AP systems (available + partition tolerant)

Return potentially stale data during a partition — eventually consistent. Examples: DynamoDB, Cassandra, DNS. Use when: uptime matters more than perfect accuracy — user profile reads, product catalogue, fleet dashboard overview metrics.

Cartrack context: The GPS event ingestion pipeline should lean AP — it's better to accept and store a GPS ping even during a partition than to reject it. The driver scoring system can be CP — it's computed in batch where consistency matters more than speed.
SHD

Database sharding & replication

When a single DB isn't enough — and the complexity trade-offs

Replication copies data from one primary DB to one or more read replicas. Writes go to primary, reads can go to any replica. This scales read throughput without sharding complexity. Use this first — sharding is a last resort.
Sharding splits data horizontally across multiple independent DB instances. Each shard holds a subset of the data. Once sharded, cross-shard queries and transactions become very hard.

Sharding strategies

Hash-based: shard = hash(customer_id) % N. Evenly distributed. Hard to add shards (rehashing).

Range-based: customers A–M on shard 1, N–Z on shard 2. Simple but can create hot shards.

Directory-based: A lookup table maps each key to a shard. Flexible but the directory becomes a bottleneck.

Problems sharding creates

Cross-shard JOINs are not possible (must be done in application code). Distributed transactions are extremely complex. Re-sharding when you outgrow the current split is painful. Operational complexity multiplies: N databases to backup, monitor, and maintain.

Rule of thumb: Try these in order before sharding: (1) query optimisation + indexes, (2) read replicas, (3) caching layer, (4) vertical scaling, (5) partitioning within PostgreSQL. Sharding is step 6, not step 1.
Common architecture patterns
MS

Microservices vs monolith — when each makes sense

Monolith

Pros: Simple to develop, test, deploy, and debug. No network latency between components. Easy to do cross-cutting concerns (transactions, logging).

Cons: Hard to scale individual parts. Large codebase becomes unwieldy. Deployments are all-or-nothing. Tech stack is locked in.

Best for: Early-stage products, small teams (<10 devs), domains that are still being understood.

Microservices

Pros: Independent deployment and scaling per service. Teams own their service end-to-end. Technology diversity per service.

Cons: Distributed systems complexity (network failures, eventual consistency). Operational overhead multiplies. Harder to debug across service boundaries.

Best for: Large teams with clear domain boundaries, different scaling requirements per component, high-maturity engineering orgs.

The honest answer for Cartrack: "I'd start with a well-structured modular monolith and extract services where there's a clear scaling or team ownership reason — not because microservices are fashionable. Premature microservices are one of the most expensive architectural mistakes a team can make."
Interviewer questions
QHow would you decide when to break a monolith into microservices?
I'd look for specific signals, not just size: independent scaling needs (the GPS ingestion layer needs to scale 100x more than the reporting layer — good candidate), team ownership friction (two teams are constantly stepping on each other's code in the same module), different deployment cadences (one part needs to deploy 10x per day, another monthly), or technology requirements (stream processing works better in a dedicated service than embedded in a PHP monolith).

I'd extract services along domain boundaries — the classic Domain-Driven Design bounded contexts. The GPS event pipeline, driver scoring, alerting, and the customer-facing dashboard are natural candidates for separation in Cartrack's case.

What I'd avoid: extracting microservices purely because the codebase feels large, or because other companies do it. The operational cost is real — you need service discovery, distributed tracing, and much better monitoring before microservices become a net positive.
EVT

Event-driven architecture

Events, commands, queries — and when async beats synchronous

In event-driven architecture, components communicate by producing and consuming events (facts about what happened) rather than calling each other directly. This decouples producers from consumers — the producer doesn't know or care who is listening.
Event (fact): VehicleSpeedExceededLimit
Published by the GPS stream processor. Consumed independently by: the alerting service (sends push notification), the driver scoring service (deducts points), the compliance reporting service (logs for audit). Each consumer is independently deployable and scalable.
Command (intent): SendDriverAlert
Directed at a specific service with an expected action. Unlike events, commands expect exactly one handler. Used when you want to trigger a specific side effect.
Query (read): GetDriverScoreForLastMonth
Returns data without side effects. In CQRS, reads and writes use separate models — the read model is optimised for query performance, the write model for consistency.
Key benefit for Cartrack: When a GPS event arrives, 5 different systems might need to react (alerts, scoring, geofencing, trip reconstruction, live map). Event-driven architecture lets all of them react independently without the GPS ingestion service needing to know about any of them.
System design interview framework
FWK

How to answer a system design question in an interview

A structured approach that shows seniority

1
Clarify requirements (5 min)
Functional: what must the system do? Non-functional: scale (how many users, events/sec, data volume?), latency SLO, consistency requirements, availability target. Don't assume — ask. "Are we designing for 1,000 vehicles or 2,000,000?"
2
Estimate scale (3 min)
Back-of-envelope: 2.5M vehicles × 1 ping/5 sec = 500,000 events/sec. At 200 bytes/event = 100 MB/s ingestion. 86,400 seconds/day = 43 billion events/day. These numbers drive every subsequent decision.
3
High-level design (10 min)
Draw the major components: clients, load balancers, services, message queues, databases, caches. Explain data flow. Don't optimise yet — get the broad strokes right first.
4
Deep dive into key components (15 min)
Pick the hardest parts. For Cartrack: the ingestion layer, the geofence check engine, the database schema, and the real-time dashboard delivery. Discuss trade-offs at each decision point.
5
Identify bottlenecks & mitigations (5 min)
Where does this design break at 10x scale? What single points of failure exist? What would you add next: caching, sharding, async processing, read replicas?
6
Discuss trade-offs explicitly
Never present your design as perfect. "This approach optimises for write throughput at the cost of read query flexibility — for Cartrack's use case that trade-off is acceptable because..." Senior engineers are evaluated on their reasoning, not just their answer.
The biggest mistake in system design interviews: jumping straight to implementation details without establishing scale requirements. A design for 100 vehicles is completely different from a design for 2.5 million.
Caching strategies
CHE

Caching patterns — Cache-Aside, Write-Through, Write-Behind, Read-Through

Cache-Aside (Lazy loading)

App checks cache first. On miss, loads from DB, writes to cache, returns data. Most common pattern.

Pros: Only caches what's actually requested. Cache failure doesn't break reads.
Cons: First request always slow (cache miss). Cache can go stale.
Use for: Read-heavy data, fleet dashboard vehicle lists.

Write-Through

On every write, app writes to both cache and DB synchronously. Cache is always up to date.

Pros: No stale reads.
Cons: Every write is slower (two writes). Cache fills with data that may never be read.
Use for: Write-then-immediately-read patterns, user profiles.

Write-Behind (Write-Back)

App writes to cache only. Cache asynchronously flushes to DB later.

Pros: Very fast writes.
Cons: Risk of data loss if cache crashes before flush. Complex to implement.
Use for: High-frequency counters, analytics aggregation.

Read-Through

Cache sits in front of the DB. On miss, the cache (not the app) fetches from DB and populates itself.

Pros: App logic is simpler — just talk to the cache.
Cons: Cache vendor must support it (Redis doesn't natively).
Use for: ORM-level caching with libraries like Hibernate.

Interviewer questions
QHow do you handle cache invalidation? What's the hardest part about caching?
Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He was right.

Strategies for invalidation:

TTL (time-based expiry): The simplest approach — cache entries expire after N seconds. Acceptable stale window for most fleet dashboard data (vehicle list can be 30 seconds stale). Terrible for anything requiring real-time accuracy.

Event-driven invalidation: When the underlying data changes, publish an event that explicitly clears or updates the cache entry. More complex but precise. In Cartrack, when a driver's details are updated in the DB, that event triggers a cache delete for that driver's cached profile.

Write-through invalidation: Every write updates the cache simultaneously. Zero staleness, but adds write latency.

The real challenge: Cache invalidation at the boundary between services. If service A caches data owned by service B, how does A know when B's data changes? This is where event-driven architecture (publish invalidation events to a message queue) beats direct cache management.
QWhat is a cache stampede (thundering herd) and how do you prevent it?
A cache stampede happens when a popular cache entry expires and many concurrent requests all miss the cache simultaneously — they all hit the database at the same time, potentially overloading it.

Prevention strategies:

1. Mutex/lock: When a cache miss occurs, only one request fetches from the DB and populates the cache. Other requests wait. Use a Redis distributed lock (SETNX) to implement this.

2. Probabilistic early expiry (PER): Before the TTL expires, start randomly refreshing the cache for a small percentage of requests. By the time the TTL hits, the cache is already warm.

3. Background refresh: Cache entries are refreshed asynchronously before they expire. The cache always serves (possibly slightly stale) data while a background job updates it.

4. Staggered TTLs: Add random jitter to TTLs — instead of all entries expiring at time T, they expire at T + random(0, 30s). Prevents synchronised expiry across many keys.
RDS

Redis — data structures, use cases & patterns

Redis is an in-memory data structure store used as a cache, message broker, and session store. It's single-threaded (no lock contention) and persists to disk optionally.
# String — simple key-value, TTL-based cache
SET driver:score:123 "87.5" EX 300      # expires in 5 min
GET driver:score:123

# Hash — store object fields (avoid multiple roundtrips)
HSET vehicle:456 lat 1.3521 lng 103.8198 speed 65
HGETALL vehicle:456

# Sorted Set — leaderboards, rate limiting by score
ZADD driver_scores 87.5 "driver:123"
ZRANGE driver_scores 0 9 REV WITHSCORES  # top 10 drivers

# Distributed lock (prevent cache stampede)
SET lock:refresh:driver_scores "1" NX PX 5000  # NX = only if not exists, PX = ms TTL

# Rate limiting counter
INCR rate:api:customer:789
EXPIRE rate:api:customer:789 60           # reset every minute

# Pub/Sub — lightweight event fanout
PUBLISH geofence_alerts '{"vehicle":456,"zone":"depot","event":"exit"}'
For Cartrack: Redis is ideal for: (1) caching latest vehicle positions for the live map, (2) geofence state per vehicle (was it inside/outside at last check), (3) driver score cache, (4) API rate limiting per customer, (5) distributed locks during GPS event processing.
CDN & networking
CDN

CDN — Content Delivery Network

How it works, what to cache, edge computing

A CDN is a globally distributed network of edge servers that cache content close to end users. Instead of every request travelling to your origin server in Singapore, users in Europe hit a CDN node in Frankfurt — dramatically reducing latency.

What to put behind a CDN

Static assets: JS bundles, CSS, images, fonts (cache forever with content-hash filenames). Public API responses that change infrequently (fleet config, map tile data). HTML for SSR pages (short TTL). Large file downloads (vehicle footage exports).

What NOT to put behind a CDN

Real-time data: live GPS positions, WebSocket connections. Authenticated user-specific data (unless using private CDN with signed URLs). Anything requiring immediate invalidation (e.g. emergency geofence changes).

Cache-Control headers tell the CDN how long to cache:
# Static assets (content-hash in filename — cache forever)
Cache-Control: public, max-age=31536000, immutable

# API response — cache for 60s, allow stale for 5s while revalidating
Cache-Control: public, max-age=60, stale-while-revalidate=5

# Private user data — never cache at CDN level
Cache-Control: private, no-store
Interviewer questions
QWhat's the difference between a CDN and a reverse proxy like Nginx? When would you use each?
A reverse proxy (Nginx, HAProxy) sits in front of your servers, typically in the same data centre. It handles load balancing, SSL termination, request routing, and local caching. It's infrastructure you own and operate.

A CDN is a globally distributed network of edge nodes operated by a third party (Cloudflare, AWS CloudFront, Fastly). It caches content at locations worldwide, dramatically reducing latency for geographically distributed users. You configure it, they operate it.

For Cartrack serving fleet managers across 20+ countries: I'd use Nginx/HAProxy as the reverse proxy in our Singapore data centre for internal routing and load balancing, and put Cloudflare or CloudFront in front of it for global edge caching of static assets and public API responses. Both are complementary — the CDN catches what it can globally, the reverse proxy handles what reaches origin.
DNS

DNS, SSL/TLS & HTTPS

What happens between a browser URL and your server

1
DNS resolution
Browser checks local cache → OS cache → recursive resolver → authoritative DNS. Returns IP address for api.cartrack.sg. TTL controls how long this is cached.
2
TCP connection (3-way handshake)
Client SYN → Server SYN-ACK → Client ACK. Establishes a reliable connection before any data is sent. ~1 round-trip time (RTT).
3
TLS handshake (HTTPS)
Client hello (supported ciphers) → Server hello (chosen cipher + certificate) → Certificate validation → Key exchange → Symmetric session key established. ~1-2 extra RTTs. TLS 1.3 reduces this to 1 RTT.
4
HTTP request + response
Now the actual request is sent. HTTP/2 multiplexes multiple requests on a single connection, eliminating head-of-line blocking. HTTP/3 (QUIC) builds on UDP for even better performance on lossy mobile networks.
Performance implications for fleet apps: Fleet managers often use mobile networks in the field. TLS 1.3 + HTTP/2 + a CDN edge near them is the biggest latency win. WebSockets for live tracking avoids the per-request TLS overhead entirely.
APIs & messaging
API

REST vs GraphQL vs WebSocket vs gRPC

When to use each communication style

REST

Request-response over HTTP. Resource-oriented URLs. Cacheable, stateless, widely understood. Best for most CRUD APIs — fleet management config, driver profiles, historical reports. Simple to consume from any client.

GraphQL

Client specifies exactly what data it needs in a single query. Reduces over-fetching and under-fetching. Best for complex UIs where different views need different data shapes. Harder to cache (POST by default). Good for the Cartrack dashboard frontend.

WebSocket

Persistent bidirectional connection. Server can push data to client without polling. Best for real-time data: live GPS positions on the map, instant alert notifications, live driver score updates. Low overhead per message once connected.

gRPC

Binary protocol over HTTP/2. Strongly typed (Protocol Buffers). Very fast and efficient. Best for internal service-to-service communication. Not browser-friendly without a proxy layer. Used between microservices where performance is critical.

Interviewer questions
QHow would you design the live tracking feature in the Cartrack dashboard — 2.5M vehicles, positions updating every few seconds?
The problem with polling: If every fleet manager's browser polls GET /vehicles/positions every 3 seconds, and there are 10,000 concurrent users, that's 3,333 requests/second — just for position updates. Wasteful and doesn't scale.

WebSocket approach: When a user opens the fleet map, the client establishes a WebSocket connection to a dedicated presence server. The backend subscribes to GPS events from Kafka for the specific fleet's vehicles and pushes position updates over the WebSocket as they arrive.

Scale management: Users only see vehicles in the current map viewport — subscribe only to vehicle updates relevant to the visible area (spatial filtering). For customers with large fleets, cluster vehicles at high zoom levels rather than rendering all 500 markers individually.

Fallback: Server-Sent Events (SSE) as a fallback for environments where WebSocket is blocked — SSE is unidirectional (server to client) but simpler to implement and works over HTTP/2.
MQ

Message queues — Kafka vs RabbitMQ vs SQS

Pub/sub vs point-to-point, ordering, durability, and replay

Why use a message queue? Decouple producers from consumers — the GPS ingestion service doesn't need to wait for the scoring service to process before accepting the next ping. Handle traffic spikes — messages queue up during peaks and drain during off-hours. Enable replay — reprocess historical events with a new algorithm without touching the source.

Kafka

Type: Distributed log / pub-sub
Retention: Days to weeks (configurable)
Replay: Yes — consumers track their own offset
Ordering: Guaranteed within a partition
Use for: High-throughput event streaming, GPS ingestion, audit logs, event sourcing

RabbitMQ

Type: Message broker (AMQP)
Retention: Until consumed
Replay: No (message deleted after ack)
Ordering: Within a queue
Use for: Task queues, RPC patterns, complex routing (topic exchanges), email/SMS dispatch

AWS SQS

Type: Managed queue
Retention: Up to 14 days
Replay: No
Ordering: Standard (best-effort) or FIFO
Use for: Cloud-native workloads, decoupled Lambda functions, simple job queues without infrastructure to manage

For Cartrack: Kafka is the right choice for the GPS event pipeline — high throughput, ordered within a vehicle partition (partition key = vehicle_id), and replay capability to reprocess events if the scoring algorithm changes. RabbitMQ or SQS for lower-volume async tasks like sending email reports or scheduling maintenance reminders.
Reliability patterns
CB

Circuit breaker, retry, and bulkhead patterns

How to build services that fail gracefully

The problem: In a distributed system, any downstream service can become slow or unavailable. Without protection, slow dependencies cascade into full system failure — one unhealthy service brings down everything that calls it.
Retry with exponential backoff
On transient failure, retry after a delay. Double the delay on each retry (exponential). Add jitter (random delay) to avoid synchronised retry storms from many clients. Cap at a maximum number of retries — don't retry indefinitely. Only for idempotent operations. Don't retry a payment charge without idempotency keys.
Circuit breaker
Counts consecutive failures to a dependency. After a threshold (e.g. 5 failures in 10 seconds), the circuit "opens" — all further calls immediately fail without hitting the dependency. After a timeout, the circuit enters "half-open" state, allows one test request through, and either closes (recovers) or stays open. Prevents resource exhaustion while a dependency is down.
Bulkhead
Isolate resources for different consumers. Like watertight compartments in a ship — one flooding section doesn't sink the ship. Example: separate thread pools or connection pools for critical vs non-critical services. If the report-generation service consumes all DB connections, the GPS ingestion path is unaffected.
Timeout
Never call an external service without a timeout. A hanging connection that never returns is worse than a fast failure — it holds resources indefinitely. Set aggressive timeouts (100ms for cache hits, 2s for API calls, 30s max for heavy queries) and handle the timeout as an explicit error path.
Interviewer questions
QWhat is a circuit breaker and when would you implement one?
A circuit breaker is a pattern that prevents a service from repeatedly calling a failing dependency. It has three states: Closed (normal operation, requests flow through), Open (dependency is down, all calls fail immediately without attempting), and Half-open (testing if the dependency has recovered).

I'd implement one when: (1) the dependency is external (third-party mapping API, SMS gateway), (2) the dependency is a non-critical service whose failure shouldn't cascade, or (3) the dependency is known to occasionally spike in latency under load.

In Cartrack's context: the SMS alert service, external map tile provider, and vehicle registration API are all good candidates for circuit breakers. The GPS ingestion pipeline should not circuit-break on DB failures — it should queue and retry, since losing GPS data is unacceptable. Libraries like Resilience4j (Java), Polly (.NET), or a service mesh like Istio implement this without application-level code.
OBS

Observability — logs, metrics, traces

The three pillars and how they work together

Observability is the ability to understand what a system is doing internally from its external outputs. The three pillars are complementary — you need all three to debug production issues effectively.

Logs

What: Timestamped records of discrete events.
Tools: ELK stack (Elasticsearch, Logstash, Kibana), Grafana Loki, Datadog Logs.
Best practice: Structured JSON logs (not plain text). Include request_id, user_id, vehicle_id for correlation. Log at appropriate level — don't log DEBUG in production.

Metrics

What: Numeric measurements over time — counters, gauges, histograms.
Tools: Prometheus + Grafana, Datadog, CloudWatch.
Key metrics: Request rate (RPS), error rate (%), latency (p50/p95/p99), queue depth, cache hit ratio, DB connection pool usage.

Distributed Traces

What: End-to-end journey of a single request across multiple services.
Tools: Jaeger, Zipkin, Datadog APM, OpenTelemetry.
Use for: "This fleet manager's dashboard takes 3s to load — which service is slow?" Traces show the exact span of time each service component contributes.

For a tech lead: Define SLOs (Service Level Objectives) for your critical paths. For Cartrack: GPS event ingestion latency p99 < 500ms, dashboard load time p95 < 2s, alert delivery < 30s after trigger event. Alerts should fire when SLO burn rate exceeds threshold — not on every error spike.
RL

Rate limiting — algorithms and implementation

Rate limiting protects APIs from abuse and ensures fair usage across customers. Critical for a multi-tenant SaaS platform where one customer's excessive API calls shouldn't degrade performance for others.

Token bucket

A bucket holds N tokens. Each request consumes one token. Tokens refill at a constant rate. Allows short bursts up to the bucket size. Most flexible algorithm — handles bursty traffic naturally. Redis implementation: INCR + TTL per window.

Fixed window counter

Count requests in a fixed time window (e.g. 1000 req/min). Simple but has a boundary problem: 1000 requests in the last second of minute 1 + 1000 in the first second of minute 2 = 2000 in any 2-second window — double the limit.

Sliding window log

Record timestamp of each request. Count requests within the last N seconds. Precise but memory-intensive at high request rates — stores every request timestamp.

Sliding window counter

Hybrid of fixed window + rolling calculation. Memory efficient and more accurate than fixed window. Most production rate limiters (Nginx, Cloudflare) use this approach.

# Redis sliding window rate limit
function isRateLimited(customerId, limit=1000, windowSec=60):
  key = "rate:{customerId}:{floor(now/windowSec)}"
  count = INCR(key)
  if count == 1: EXPIRE(key, windowSec * 2)  # set TTL on first hit
  return count > limit

# Return these headers so clients can self-throttle
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1704067260
Generic pipeline design
Q1

How would you design the data pipeline for ingesting real-time GPS telemetry from millions of IoT devices?

Cloud-agnostic conceptual answer

1
Device → Ingestion Layer
IoT devices send GPS pings (lat/lng, speed, heading, odometer) every few seconds via MQTT or lightweight TCP. Stateless, horizontally scalable ingestion service receives them.
2
Message Queue (Kafka)
Events go into Kafka. Decouples ingestion from processing, handles traffic spikes, and provides replay capability for reprocessing. Partition key = vehicle_id ensures ordered delivery per vehicle.
3
Stream Processing
Kafka consumers or Flink jobs process events in real-time: geofence checks, speed violation detection, trip boundary detection, deduplication.
4
Storage
Raw events → time-series DB or PostgreSQL with monthly partitions. Computed aggregates (trip summaries, driver scores) → relational tables for fast dashboard queries.
5
API Layer
Fleet managers query pre-aggregated data. Real-time tracking uses WebSocket pushing latest vehicle positions.
Key NFRs to mention: Out-of-order event handling (GPS packets arrive late due to connectivity), idempotency on the processing side, and horizontal scaling on ingestion and consumer layers.
AWS architecture — GPS telemetry pipeline
AWS

Design the GPS telemetry ingestion pipeline on AWS

Full service mapping: IoT Core → Kinesis → Lambda/ECS → RDS/S3 → API Gateway

IoT DEVICES 2.5M vehicles MQTT / HTTPS MQTT AWS IoT Core Device registry X.509 cert auth Rules Engine Kinesis Data Streams ~500 shards key=vehicle_id 7-day replay storage path (async) AWS Lambda Geofence checks Speed alerts → SNS Live pos → Redis real-time path Kinesis Firehose Buffer 60s / 128MB Writes Parquet → S3 Glue ETL → Redshift S3 + Redshift Raw archive (Parquet) Analytics / reports AWS Glue ETL ElastiCache (Redis) Positions · Geofence Aurora PostgreSQL Trips · Scores · Alerts up to 15 read replicas API Gateway + ECS Fargate CloudFront + S3 (SPA) SNS / SQS Push · SMS · DLQ RDS Proxy conn pooling SECURITY IAM least-privilege roles Secrets Manager (DB creds) VPC private subnets KMS encryption at rest OBSERVABILITY CloudWatch metrics & logs X-Ray distributed tracing CW Synthetics canaries pg_stat_statements (RDS) SCALE @ 2.5M VEHICLES 500k events/sec ingestion ~13 TB/day raw → S3 ~500 Kinesis shards ~500 MB Redis hot data ── real-time path ╌╌ async / batch path
Layer-by-layer breakdown:
1
AWS IoT Core — device connectivity
The MQTT broker for all vehicle devices. Each device authenticates with an X.509 certificate — no shared secrets. IoT Core handles up to millions of concurrent device connections. The IoT Rules Engine evaluates SQL-like rules on incoming messages and routes them directly to Kinesis, SQS, Lambda, or S3 without needing an intermediary service. Device shadow provides last-known state even when a vehicle is offline.
2
Amazon Kinesis Data Streams — ordered event buffer
Partitioned stream (partition key = vehicle_id) guarantees ordered delivery for each vehicle's events. Supports up to 1MB/s write and 2MB/s read per shard — provision shards based on fleet size. 7-day data retention enables replaying historical events if a downstream consumer has a bug. Consumers use Enhanced Fan-Out for independent parallel reads without competing.
3
AWS Lambda — real-time stream processing
Lambda consumes from Kinesis with batch sizes of 100–1000 events. Each invocation handles: deduplication (check event_id in ElastiCache, skip if seen), geofence evaluation (load customer geofences from ElastiCache, run point-in-polygon), speed/behaviour alerts (publish violations to SNS for fan-out to push notifications, SMS via SNS, and SQS for async logging), live position cache (write latest lat/lng/speed to ElastiCache hash keyed by vehicle_id). Lambda scales automatically — no servers to provision.
4
Kinesis Data Firehose + S3 — raw event archive
Parallel consumer on the same Kinesis stream. Buffers events (60 seconds or 128MB) and writes Parquet files to S3. This is the cold storage layer — raw immutable history of every GPS ping. From S3, AWS Glue ETL jobs run nightly to compute trip summaries, fuel consumption, and driver scores, then write results to Amazon Redshift for analytics and reporting. S3 Intelligent-Tiering automatically moves old data to cheaper storage tiers.
5
Amazon Aurora PostgreSQL — structured operational data
Stores trips, driver scores, alerts, vehicles, drivers, and fleets. Aurora provides up to 15 read replicas — dashboard queries hit replicas, writes go to the primary. Auto-scaling storage. Global Database option replicates to other AWS regions for global customers (Cartrack operates in 20+ countries). Use RDS Proxy to pool connections from Lambda functions — Lambda cold starts can exhaust DB connection limits without a proxy.
6
ElastiCache (Redis) — hot data layer
Stores: latest vehicle positions (TTL 10s — stale if device goes offline), geofence definitions per customer (invalidated on update), driver score cache (TTL 5 min), API rate limits (sliding window counters), deduplication set (recent event_ids, TTL 60s). Redis Cluster mode for horizontal scaling. Multi-AZ with automatic failover.
7
ECS Fargate + API Gateway — API & real-time dashboard
REST API runs on ECS Fargate containers behind an Application Load Balancer — no servers to manage, scales on CPU/memory. API Gateway WebSocket API maintains persistent connections for the live map. When a vehicle position updates in ElastiCache, a Lambda publishes to the WebSocket API which pushes it to connected fleet managers. API Gateway handles auth (JWT via Cognito), throttling, and usage plans per customer tier.
8
CloudFront + S3 — frontend delivery
React SPA served from S3 via CloudFront edge nodes — zero server management for frontend. Content-hashed assets cached permanently at the edge. CloudFront's WAF integration blocks malicious traffic before it reaches the origin. Lambda@Edge can inject security headers and handle auth at the edge for sub-millisecond latency.
Supporting AWS services

Observability

CloudWatch: Metrics, logs, alarms for Lambda errors, Kinesis iterator age (lag), ECS CPU/memory, RDS query latency.
X-Ray: Distributed tracing across Lambda → RDS → ElastiCache — identifies where latency lives in a request.
CloudWatch Synthetics: Canary monitors the dashboard endpoint every minute from multiple regions.

Security & compliance

IAM roles: Each Lambda/ECS service has a least-privilege role — no credentials in code.
AWS Secrets Manager: DB passwords, API keys rotated automatically.
VPC: RDS and ElastiCache in private subnets, no public internet access.
KMS: Encrypt S3 data at rest, RDS storage, Kinesis stream.

Alerting & notifications

SNS: Fan-out for speed alerts, geofence events, maintenance reminders. One SNS topic can publish to email, SMS, SQS, Lambda, and HTTP endpoints simultaneously.
SQS: Dead-letter queues for failed alert deliveries — retry without data loss.
SES: Scheduled fleet reports emailed to managers.

Batch & analytics

AWS Glue: Nightly ETL from S3 (raw GPS Parquet) → compute trips, driver scores, fuel consumption.
Amazon Redshift: Data warehouse for business intelligence — fleet utilisation trends, compliance reports, customer-level analytics.
Athena: Ad-hoc SQL on raw S3 data without loading into Redshift — useful for debugging specific date ranges.

Scale estimates at Cartrack size
# 2.5M active vehicles × 1 ping every 5 seconds
Ingestion rate  = 500,000 events/sec
Event size      = ~300 bytes (lat, lng, speed, heading, odometer, timestamp, vehicle_id)
Ingestion volume= 150 MB/sec = ~13 TB/day → S3 cold storage

Kinesis shards  = 500k events/s ÷ 1000 events/s/shard = ~500 shards
Lambda          = Auto-scales — no capacity planning needed
ElastiCache     = 2.5M vehicle position hashes × 200 bytes = ~500 MB hot data
Aurora          = 2.5M vehicle rows, ~50M trips/year — read replicas handle dashboard load
S3              = ~4.7 PB/year raw GPS archive (compressed Parquet ~3–4x smaller)
Interviewer questions
QWhy use AWS IoT Core instead of a self-managed MQTT broker like Mosquitto?
A self-managed Mosquitto broker becomes a single point of failure and a scaling nightmare at 2.5M concurrent connections. You'd need to cluster it, manage certificates, handle reconnects, and provision capacity.

AWS IoT Core is fully managed — it scales to billions of devices with no infrastructure to operate. It integrates natively with the rest of AWS (direct Rules Engine routing to Kinesis, S3, Lambda without any glue code). Device certificates are managed through AWS Certificate Manager. It also supports device shadows — a JSON document representing the last-known state of a device, which is critical for fleet management since vehicles frequently go offline in tunnels or remote areas.

The cost trade-off is real — IoT Core charges per message, which at 500k events/sec is significant. But the operational savings and reliability justify it at Cartrack's scale.
QWhy Kinesis instead of SQS or MSK (Managed Kafka) for the event stream?
vs SQS: SQS is point-to-point — once a message is consumed it's gone. We need multiple independent consumers on the same stream (real-time Lambda processor AND Firehose archiver AND a monitoring consumer). Kinesis supports multiple independent consumers each tracking their own position. SQS doesn't support replay; Kinesis retains data for 7 days.

vs MSK (Managed Kafka): MSK gives you the full Kafka feature set (more partitioning flexibility, exactly-once semantics, richer ecosystem). Kinesis is simpler to operate — no broker management, auto-scaling, tighter AWS integration. For a team not deeply experienced with Kafka operations, Kinesis is safer. MSK becomes the better choice when you need more than 200 shards, need Kafka's connector ecosystem, or already have Kafka expertise in-house.

My recommendation for Cartrack: Start with Kinesis for operational simplicity. Migrate to MSK if partition limits become a constraint or if the team builds Kafka expertise.
QHow do you handle Lambda cold starts in the real-time processing path?
Lambda cold starts (50ms–1s) are a concern when latency matters. Strategies:

Provisioned Concurrency: Keep N Lambda instances pre-warmed and ready. For our Kinesis consumer, this guarantees zero cold start on the critical processing path. Cost: you pay for idle warm instances.

Optimize the deployment package: Smaller packages (avoid bundling unused libraries) cold-start faster. Use Lambda layers for shared dependencies. Prefer compiled runtimes (Go, Rust) over interpreted (Python, Node) for faster init.

RDS Proxy: Without RDS Proxy, each Lambda cold start opens a new DB connection. At burst scale, this exhausts the Aurora connection pool. RDS Proxy pools connections and serves Lambda from a warm pool — critical for any Lambda that touches the DB.

For this architecture: The Kinesis consumer Lambda hits ElastiCache (not RDS) on the hot path — Redis connections are lightweight. Cold starts only matter for infrequent operations. Provisioned Concurrency on the geofence alert Lambda ensures alert delivery latency stays under 5 seconds.
QHow would you ensure no GPS events are lost if Lambda fails mid-processing?
Kinesis provides at-least-once delivery. Lambda's Kinesis event source mapping tracks the shard iterator — if a batch fails, Lambda retries the entire batch until it succeeds or reaches the maximum retry count.

Failure handling setup:
• Set bisectBatchOnFunctionError: true — on failure, Lambda splits the batch in half and retries each half. This isolates the one bad record rather than blocking the entire shard.
• Configure a dead-letter queue (SQS) for events that fail all retries — they land in DLQ for manual inspection rather than being silently dropped.
• Set maximumRetryAttempts and maximumRecordAgeInSeconds so genuinely unprocessable old events don't block the shard forever.

The idempotency layer (ElastiCache event_id set) ensures that if a successfully-processed event is retried, it's a no-op — the "already seen" check prevents double-counting speed violations or duplicate geofence alerts.
COST

AWS cost optimisation for a high-volume IoT pipeline

The biggest cost levers at Cartrack's scale

Kinesis cost levers

Enhanced Fan-Out: Costs $0.015/shard-hour vs $0.014 for standard. Only use it for consumers that need low latency (<200ms) — Firehose can use standard polling.

Right-size shards: Each shard costs ~$21/month. At 500 shards that's $10,500/month just for Kinesis. Consider batching device pings client-side (5 pings per message instead of 1) to reduce shard count 5x at the cost of slightly higher latency.

S3 storage cost levers

Parquet + Snappy compression: Raw JSON GPS events → Parquet reduces size by 4–8x. 13 TB/day becomes ~2–3 TB/day stored.

S3 Intelligent-Tiering: Data not accessed for 30+ days automatically moves to cheaper tiers. GPS data older than 90 days is rarely queried — auto-tier saves ~60% on storage cost for archives.

Lifecycle policies: Delete raw events older than 3 years, keep aggregated data indefinitely.

Lambda cost levers

ARM (Graviton2) architecture: 20% cheaper than x86 for Lambda, often faster too. Easy switch — no code changes needed.

Optimise memory allocation: Lambda pricing is memory × duration. Sometimes doubling memory halves duration — net cost-neutral but faster. Use AWS Lambda Power Tuning tool to find the optimal memory setting for each function.

Database cost levers

Aurora Serverless v2: For workloads with variable demand (fleet managers in business hours only), Aurora Serverless scales down to minimum ACUs during off-hours. Can save 40–60% vs provisioned for non-24/7 workloads.

Read replicas in same AZ: Cross-AZ data transfer costs money. Place read replicas used by dashboards in the same AZ as the ECS services that query them.

IaC

Infrastructure as Code — provisioning this stack with AWS CDK or Terraform

How a tech lead would define, version, and deploy the entire architecture

Why IaC matters for a tech lead: The architecture diagram is only useful if the infrastructure it describes can be reproduced reliably. IaC means the entire stack — Kinesis shards, Lambda functions, Aurora cluster, ElastiCache, IAM roles — is defined in code, version-controlled in Git, and deployed through the same CI/CD pipeline as application code.

AWS CDK (TypeScript)

Best for: AWS-only stacks. Uses real programming language (TypeScript/Python) — loops, conditions, abstractions. Generates CloudFormation underneath.

Pros: First-class AWS support, type safety catches mistakes before deploy, L2/L3 constructs abstract boilerplate (e.g. new KinesisStream(this, 'GPS', { shardCount: 500 })).

Cons: AWS-only, learning curve if new to CDK constructs.

Terraform (HCL)

Best for: Multi-cloud or existing Terraform expertise. Declarative HCL syntax, massive ecosystem of providers.

Pros: Cloud-agnostic, mature state management (terraform.tfstate), excellent for teams already using it. terraform plan shows exactly what will change before applying.

Cons: HCL is less expressive than a real language for complex logic. State file must be kept in S3 + DynamoDB lock for teams.

// AWS CDK TypeScript — GPS pipeline stack (simplified)
export class GPSPipelineStack extends Stack {
  constructor(scope: Construct, id: string) {
    super(scope, id);

    // Kinesis stream — partition key will be vehicle_id at publish time
    const gpsStream = new kinesis.Stream(this, 'GPSStream', {
      shardCount: 500,
      retentionPeriod: Duration.days(7),
      encryption: kinesis.StreamEncryption.MANAGED,
    });

    // Lambda processor — auto-scales with Kinesis shard count
    const processor = new lambda.Function(this, 'GPSProcessor', {
      runtime: lambda.Runtime.NODEJS_20_X,
      architecture: lambda.Architecture.ARM_64, // 20% cheaper
      memorySize: 512,
      timeout: Duration.seconds(60),
      environment: {
        REDIS_HOST: redisCluster.attrPrimaryEndPointAddress,
        DB_PROXY_ENDPOINT: rdsProxy.endpoint,
      },
    });

    // Wire Kinesis → Lambda with bisect-on-error and DLQ
    processor.addEventSource(new KinesisEventSource(gpsStream, {
      batchSize: 100,
      startingPosition: lambda.StartingPosition.TRIM_HORIZON,
      bisectBatchOnError: true,
      onFailure: new SqsDlq(dlqQueue),
      maxRetryAttempts: 3,
    }));

    // Firehose to S3 with Parquet conversion via Glue schema
    new firehose.DeliveryStream(this, 'GPSArchive', {
      sourceStream: gpsStream,
      destinations: [new destinations.S3Bucket(archiveBucket, {
        dataOutputPrefix: 'gps/year=!{timestamp:yyyy}/month=!{timestamp:MM}/',
        bufferingInterval: Duration.seconds(60),
      })],
    });
  }
}
For the interview: Mentioning CDK or Terraform signals that you think about infrastructure as a first-class engineering concern — not something ops "handles later." Tech leads who can define and own the IaC are significantly more valuable than those who only architect on whiteboards.
Geofencing, deduplication & driver scoring
Q2

How would you design geofencing alerts at low latency for millions of vehicles?

The challenge: For each incoming GPS event, check if the vehicle crossed any of potentially thousands of geofence boundaries.
Approach:
Spatial indexing: Store geofences in PostgreSQL with the PostGIS extension. Use GiST indexes on the geometry column. A point-in-polygon check with proper indexing is fast even for millions of geofences.
In-memory cache: Load each customer's active geofences into Redis keyed by geohash/grid cell. On incoming GPS event, look up only the geofences relevant to that grid cell.
State tracking: Per vehicle, track "was inside geofence X at last ping?" — store this state in Redis to detect transitions (enter/exit) rather than re-evaluating absolute position on every event.
Event-driven alerts: Transition events publish to an alert queue (SNS on AWS), which then triggers notifications (push, SMS, email) asynchronously.
AWS mapping: Geofence definitions → ElastiCache (Redis). Vehicle state → ElastiCache hash per vehicle_id. Transition event → SNS topic → Lambda (push notification via FCM/APNs) + SQS (async DB write).
Q3

GPS data can arrive out of order or as duplicates. How do you handle this?

Out-of-order events: Each GPS event carries a device timestamp (not server arrival time). Always store and query by device timestamp. Use a short watermark window (e.g. 30 seconds) in the stream processor — hold events briefly to allow late arrivals to catch up. For trip reconstruction, sort events by device timestamp in the computation step regardless of ingestion order.
Duplicate packets: Each event should carry a unique event ID generated on the device (sequence number + device ID). Use an idempotency key on insert — PostgreSQL's ON CONFLICT DO NOTHING or upsert on the event ID. A short-lived Redis set of recently seen event IDs (TTL 60s) can block duplicates before they even hit the database.
Q4

How would you build a driver behaviour scoring system?

Raw events to capture: Speeding (vs road speed limit), harsh braking (deceleration threshold), harsh acceleration, sharp cornering (lateral g-force), excessive idling, seatbelt (from OBD-II).
Scoring model: Start each driver at 100. Each event deducts points based on severity and frequency. Weight events differently: speeding 30+ km/h over limit deducts more than mild speeding. Calculate a rolling score (last 30 days) for operational dashboards and a cumulative score for insurance/compliance.
Storage strategy: Raw events in partitioned driving_events table. Nightly AWS Glue job computes the 30-day score and writes to a driver_scores summary table. Dashboards query the summary table — never aggregate raw events on every page load.
Real-time consideration: Provide a live preview score in the driver mobile app computed in the Lambda stream processor, separate from the authoritative Glue-computed score.
High-pressure situations
INC

A critical bug causes incorrect GPS locations for 10,000 vehicles. You're the tech lead on call.

1
Acknowledge & communicate (0–5 min)
Confirm the incident, set up a war room (Slack channel), immediately notify stakeholders — support, customer success, engineering leadership. Don't investigate in silence.
2
Contain (5–15 min)
Can we roll back the recent deployment? Can we disable the affected feature for impacted customers? Stop the bleeding before root-causing.
3
Investigate (parallel)
Check recent deployments, config changes, and DB changes. Pull logs for the affected time window. Identify scope — is it all vehicles or a subset?
4
Fix & deploy
Make the smallest safe fix. Don't refactor under pressure. Get a second pair of eyes before deploying.
5
Post-mortem (24–48 hrs later)
Blameless post-mortem. Document timeline, root cause, contributing factors, and action items to prevent recurrence. Share with engineering and stakeholders.
Key phrase to include: "Communication is as important as the fix. Customers and stakeholders can handle bad news — they can't handle silence."
SCP

Product wants to ship a feature in 2 weeks that the team estimates at 6 weeks. How do you handle it?

"I'd never just say 'no, it takes 6 weeks' without engaging in the problem. First I'd ask: what's driving the 2-week deadline — is it a customer commitment, competitive pressure, a conference? Understanding the real constraint often reveals options.

Then I'd come back with options rather than a single answer: (a) Full feature in 6 weeks at current scope. (b) An MVP in 2 weeks that delivers the core value — here's specifically what we'd cut. (c) Full feature in 4 weeks if we add a developer or drop another priority. (d) Feature in 2 weeks but with known technical debt we address in the next sprint.

I'd make the tradeoffs explicit, let product make the call, and document whatever decision is made so there's shared accountability."
Never say you'll just work weekends to make it happen. That signals you'll absorb scope without pushback — a red flag for a lead role.
LGC

You inherit a legacy codebase with no documentation and high technical debt. What do you do?

Phase 1 — Understand before judging (weeks 1–3): Read the code and run it. Understand what it does before deciding it's "bad." Talk to anyone who knows the history. Map the critical paths — what are the highest-risk areas?
Phase 2 — Stabilise: Add tests around the highest-risk and most-changed areas first. You can't safely refactor without tests. Document as you learn — write the README, architecture notes, and runbook you wish existed.
Phase 3 — Incremental improvement: Don't do a big-bang rewrite. Strangle the old code with new patterns over multiple sprints. Budget 20–30% of each sprint for debt reduction. Frame it to stakeholders as reducing the cost of future features, not cleanup for its own sake.
Questions to ask Cartrack

What does the current tech stack look like, and where is the biggest technical debt?

Why ask this: Shows you're thinking about what you're walking into. Their answer signals how honest and self-aware the engineering organisation is. Also confirms whether the stack matches the JD (PHP/C#, React, PostgreSQL, Docker).

What does success look like for this role in the first 90 days?

Why ask this: Gets them to articulate concrete expectations rather than vague desires. Also helps you evaluate if their definition of success is realistic. Listen for: are they expecting you to immediately own delivery, or is there an onboarding period?

How do you balance shipping new features vs maintaining platform reliability?

Why ask this: For a fleet telematics SaaS, uptime is a contractual obligation — a vehicle tracking system going down has immediate business impact. Their answer reveals engineering culture: do they have SLOs? Error budgets? Incident response maturity?
Backend — PHP / C#
API

How would you structure a REST API for a fleet tracking service?

# Resource-oriented URLs
GET    /v1/fleets/{fleet_id}/vehicles
GET    /v1/vehicles/{vehicle_id}/trips?from=2024-01-01&to=2024-01-31
GET    /v1/vehicles/{vehicle_id}/position    # latest GPS fix
POST   /v1/alerts
PATCH  /v1/drivers/{driver_id}
Key design points: OAuth2 / JWT with fleet-scoped tokens. Version in URL (/v1/). Cursor-based pagination for GPS event endpoints — offset pagination breaks on large tables. Rate limiting per customer, not per IP.
ASY

(C#) Explain async/await and where you'd apply it in a high-throughput telematics API.

// Blocking (bad — thread parked while DB responds)
public Vehicle GetVehicle(int id) {
    return _db.Vehicles.Find(id);
}

// Non-blocking (good — thread returns to pool during DB wait)
public async Task<Vehicle> GetVehicleAsync(int id) {
    return await _db.Vehicles.FindAsync(id);
}
async/await allows the thread to be returned to the thread pool while waiting for I/O. For a telematics API where an endpoint fetches 30 days of GPS events from PostgreSQL, making it async means a single server can handle thousands of concurrent fleet dashboards refreshing in parallel without running out of threads.
Frontend — TypeScript + React
MAP

A live fleet map re-renders too frequently as vehicle positions update. How do you optimise this?

// Bad — new array reference every render triggers re-render
const markers = vehicles.map(v => ({ lat: v.lat, lng: v.lng }));

// Good — memoised, only recalculates when vehicles changes
const markers = useMemo(
  () => vehicles.map(v => ({ lat: v.lat, lng: v.lng })),
  [vehicles]
);
Additional optimisations: Wrap the map component in React.memo(). Throttle WebSocket position updates — GPS devices ping every few seconds but the map doesn't need to re-render at the full telemetry rate. Consider mutating marker positions directly on the map instance (Mapbox GL JS / Leaflet) rather than updating React state.
Docker
IMG

Images, containers & layers — why layer order matters

A Docker image is a read-only layered filesystem snapshot. Each Dockerfile instruction creates a new layer. Docker caches each layer — the moment any layer changes, all subsequent layers are invalidated. This is why you copy dependency files and install them before copying source code.
# WRONG: cache busted every time source code changes
COPY . .
RUN npm ci     # reinstalls on every code change!

# CORRECT: deps layer cached unless package.json changes
COPY package*.json ./
RUN npm ci --production   # cached if deps unchanged
COPY . .                  # source invalidates cache only here
Interviewer questions
QWhat's the difference between CMD and ENTRYPOINT in a Dockerfile?
ENTRYPOINT defines the executable that always runs — can't be overridden without --entrypoint. CMD provides default arguments that can be replaced at runtime.

Common pattern: ENTRYPOINT ["node"] and CMD ["server.js"]. Running normally starts node server.js. Running with docker run myimage debug.js starts node debug.js — CMD was overridden. If you only use CMD, someone can override the entire command including the executable.
QContainerise a PHP backend service wired to a PostgreSQL database.
FROM php:8.3-fpm AS base
RUN docker-php-ext-install pdo pdo_pgsql

FROM base AS deps
COPY composer.json composer.lock ./
RUN composer install --no-dev --optimize-autoloader

FROM base AS final
COPY --from=deps /app/vendor ./vendor
COPY . .
EXPOSE 9000

---
# docker-compose.yml
services:
  api:
    build: .
    environment:
      DB_HOST: postgres
      DB_PASS: ${DB_PASS}   # from .env — never hardcode
    depends_on:
      postgres:
        condition: service_healthy
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app"]
      interval: 5s
Key points: Multi-stage builds reduce image size. Use depends_on with healthcheck — not just depends_on (which doesn't wait for readiness). Never embed secrets in the image.
Kubernetes
K8S

Cluster architecture — control plane, worker nodes, and rolling deployments

Control plane

API Server: Front door. All kubectl commands go here.
etcd: Stores all cluster state.
Scheduler: Picks which node a new pod runs on.
Controller Manager: Watches for divergence and fixes it.

Worker nodes

kubelet: Node agent. Manages pods on the node.
kube-proxy: Manages network rules for Services.
Container runtime: Actually runs containers (containerd).

Interviewer questions
QWhat's the difference between a Deployment and a StatefulSet?
A Deployment manages stateless pods — the pods are interchangeable replicas. If pod-1 dies, a replacement is scheduled and it doesn't matter that it has a different name or IP.

A StatefulSet is for stateful applications. Pods get predictable names (app-0, app-1, app-2), a stable network DNS hostname, and their own PersistentVolumeClaim that survives pod restarts. Critical for databases where a pod needs to know "I am replica number 1."

I'd use a Deployment for web API servers and workers. I'd use a StatefulSet for PostgreSQL, Redis, Kafka — anything that maintains its own state that must persist.
QExplain liveness vs readiness probes. What happens if each one fails?
Liveness probe: "Is this container still alive?" If it fails, Kubernetes kills and restarts the container. Use this to detect deadlocked or stuck processes.

Readiness probe: "Is this container ready to accept traffic?" If it fails, Kubernetes removes the pod from the Service's endpoints — traffic stops going to it, but the container is NOT restarted. Use this for startup warmup or temporarily overloaded pods.

Common mistake: Setting liveness probe thresholds too aggressively — Kubernetes will keep restarting a pod that just needs time to warm up, causing a crash loop. Use readiness for traffic control, liveness as a last resort for truly stuck processes.
QHow does Kubernetes handle rolling deployments and how do you minimise downtime?
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0   # never take a pod down before a new one is ready
    maxSurge: 1         # allow one extra pod above desired count during update
To minimise downtime: Configure readinessProbe properly — K8s only routes traffic to a pod once the probe passes. Handle SIGTERM gracefully — finish in-flight requests before exiting. Set terminationGracePeriodSeconds to match your longest expected request time. Use PodDisruptionBudgets to guarantee a minimum number of healthy pods.
Core concepts
CI

CI vs CD (Delivery) vs CD (Deployment)

Three stages, three different levels of automation

Continuous Integration

Developers merge code to main frequently. Each merge triggers automated build + test run. Goal: detect integration problems early.

Continuous Delivery

Every passing change can be released to production at any time — but requires a human to press the deploy button. Software is always releasable.

Continuous Deployment

Every passing change is automatically deployed to production without human intervention. Requires very high confidence in tests and monitoring.

Interviewer questions
QContinuous Delivery vs Continuous Deployment — which would you recommend for Cartrack?
Continuous Delivery means every build is ready to deploy, but a human makes the final decision. Continuous Deployment removes that gate — every passing build goes to production automatically.

For Cartrack, I'd start with Continuous Delivery. Fleet telematics is a mission-critical system — unexpected production changes can directly impact 125,000+ commercial customers tracking vehicles in real time. The manual gate also gives product managers visibility into what's going out. I'd graduate to Continuous Deployment as confidence in the test suite and monitoring grows, potentially starting with lower-risk services first.
GitLab CI in practice
YML

Full GitLab CI pipeline for a fleet tracking backend

stages: [lint, test, build, deploy]

lint-code:
  stage: lint
  image: php:8.3
  script:
    - composer run phpcs       # PSR-12 coding standards
    - composer run phpstan     # static analysis
  cache:
    paths: [vendor/]
    key: $CI_COMMIT_REF_SLUG

unit-tests:
  stage: test
  services: [postgres:16]    # real DB for tests
  variables: {POSTGRES_DB: test_db}
  script:
    - composer run phpunit --coverage-text
  coverage: '/Lines:\s+(\d+\.\d+)%/'

build-image:
  stage: build
  image: docker:24
  services: [docker:dind]
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy-staging:
  stage: deploy
  script: [kubectl set image deployment/api api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA]
  environment: staging
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

deploy-production:
  stage: deploy
  when: manual         # human approval gate
  environment: production
  script: [kubectl set image deployment/api api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA]
  rules:
    - if: $CI_COMMIT_TAG    # only on tagged releases
Interviewer questions
QHow do you handle secrets in a CI/CD pipeline? What would you never do?
Secrets go in GitLab CI/CD Variables (Settings → CI/CD → Variables), marked as masked so they don't appear in job logs. They're injected as environment variables at runtime — never hardcoded in the YAML file.

For production deployments I'd integrate with a secrets manager like HashiCorp Vault or AWS Secrets Manager for rotation and audit trails. Variables scoped to specific environments prevent staging pipelines from accidentally accessing production credentials.

What I'd never do: Put a secret in a CI variable that isn't masked, echo a secret to the job log, store credentials in a Docker image layer (they can be extracted from the image history), or commit a .env file with secrets (even .gitignored — they end up in git history).
Scrum roles

Product Owner

  • Owns and prioritises the product backlog
  • Defines acceptance criteria for stories
  • Single voice of the customer
  • Decides what gets built and in what order
  • Approves completed work against the DoD

Scrum Master

  • Facilitates all Scrum ceremonies
  • Removes impediments and blockers
  • Protects team from scope creep
  • Coaches the team on Agile practices
  • No direct authority over developers

Development team

  • Self-organising and cross-functional
  • Estimates effort and owns sprint commitments
  • Pulls work from the sprint backlog
  • Collectively responsible for quality
  • Defines and upholds the DoD
Sprint ceremonies
Planning · 2–4 hrs

Sprint planning

Team selects items from the top of the prioritised backlog and commits to delivering them. PO clarifies requirements. Team estimates in story points, breaks stories into tasks. Output: a clear, committed sprint goal.

Daily · 15 min

Daily standup

Each member answers: what did I complete yesterday, what will I do today, any blockers? Goal is team coordination and early blocker detection — not a status report to management.

End of sprint · 1–2 hrs

Sprint review (demo)

Team demonstrates completed work to stakeholders. Only "done" work — that meets the DoD — is presented. Incomplete work is returned to the backlog. This is about the product.

After review · 1–1.5 hrs

Sprint retrospective

Team reflects on its own working process: what went well, what didn't, what specific actions to take next sprint. This is about the process — separate from the review. Output: actionable improvements with owners assigned.

Mid-sprint · ongoing

Backlog refinement

PO and team review upcoming items: clarify requirements, write acceptance criteria, split large epics, estimate effort. Goal: ensure the top of the backlog is always sprint-ready so planning meetings don't stall.

Interviewer questions
QWhat is the difference between the sprint review and the sprint retrospective?
The sprint review is about the product — the team demonstrates what was built to stakeholders, gets feedback, and adjusts the backlog. External people attend. The sprint retrospective is about the process — it's an internal team meeting to discuss how the team worked and identify specific improvements. Stakeholders don't attend retros.

Simple: review = "what did we build?", retro = "how did we work?" Mixing them up in an interview is a red flag — it suggests you haven't worked in Scrum.
QWhat's the Definition of Done and how does it differ from acceptance criteria?
The Definition of Done is the team's universal quality checklist — it applies to every story in every sprint. A typical DoD: code reviewed by at least one peer, all unit tests passing, CI pipeline green, deployed to staging, documentation updated. Until all of these are true, the story is not done regardless of whether the feature works.

Acceptance criteria are story-specific — they define what "correct" means for that particular feature. For a login story, AC might be: "given a valid email and password, the user is redirected to the dashboard; given an invalid password, an error is shown."

A story must meet both: AC confirms it does the right thing, DoD confirms it was done to the team's quality standard.
The 4-step framework
1
Acknowledge honestly
Say clearly and briefly that you haven't worked with that specific thing. Don't pad it. "I haven't used that directly" is better than "hmm, well, it depends, let me think..."
2
Bridge to what you know
Connect the unknown topic to something you do understand. "That's similar to X which I've used extensively. My understanding is that [concept] applies here because..."
3
Reason out loud
Even without the answer, show your thinking. "Based on what I know about [related area], I'd guess [logical inference]... but I'd want to verify that." Interviewers love people who reason systematically under uncertainty.
4
Show eagerness to learn
End forward: "That's something I haven't used yet but I'd pick it up quickly — could you tell me more about how your team uses it?" Turns a knowledge gap into genuine curiosity.
Ready-to-use response templates
S1

"Have you used [X technology]?" — when you haven't

Say this: "I haven't used [X] directly, but I'm familiar with the concept — it's similar to [Y] which I've worked with extensively. From what I understand, [X] solves [problem] by [approach]. I'd be comfortable picking it up quickly given my background in [related area]. Is it something your team uses heavily?"
The key is to immediately show related knowledge rather than just saying "no." You're demonstrating you can learn it — not pretending you already know it.
S2

"Explain how [deep concept] works" — you know it partially

Say this: "I'm not fully confident on all the internals of that. What I know is [related foundation]. Based on that, I'd reason that [logical inference] — but I'd want to verify that assumption rather than state it as fact. Am I in the right direction?"
Asking "am I in the right direction?" turns the question into a dialogue. The interviewer may correct you — which shows you're open to learning, not defensive.
S3

Complete blank — no starting point at all

Say this: "That's not something I've encountered yet, so I'd rather be upfront than guess. In practice, I'd check the official docs / spin up a quick test / ask a teammate who's used it. Could you give me a brief overview? I'm genuinely curious how it works."
Saying how you'd find the answer is almost as valuable as knowing the answer. It shows you're self-sufficient, resourceful, and not afraid to admit a gap.
S4

You know it but forgot the exact syntax

Say this: "I know exactly what it does and when to use it — [explain the concept and use case]. I can't recall the exact syntax right now, but in practice I'd check the docs. I find the important thing is understanding why a tool exists, not memorising its flags."
S5

The interviewer digs deeper than you can go

Say this: "I've reached the edge of what I know confidently here. I know [what you said], but beyond that I'd be speculating rather than speaking from experience. I'm curious — what does happen at that level? I'd like to understand it properly."
Knowing your own knowledge boundary is a sign of maturity. Interviewers often probe to find this boundary deliberately — they're testing self-awareness, not just depth.
Never say these things

What not to do

  • "I know this, let me think..." — then fill time with unrelated waffle
  • "It's basically the same as [completely wrong analogy]"
  • Giving a long confident-sounding answer that's actually incorrect
  • Saying "that's a great question!" — sounds hollow every time
  • Apologising excessively: "I'm so sorry, I should know this..."
  • Pretending a vague answer was complete

What to do instead

  • Be brief: "I haven't used that yet" — one sentence, then pivot
  • Show adjacent knowledge immediately after admitting the gap
  • Ask the interviewer to explain — turns a gap into a conversation
  • State your reasoning process even if you don't know the conclusion
  • Stay calm — a pause to think looks confident, not nervous
  • Be genuinely curious, not performatively so