Cartrack Tech Lead
Interview Prep
Everything in one place — must-have topics with Q&As, fleet domain knowledge, behavioural scenarios, and techniques for when you don't know the answer.
Tech Lead for a distributed team of 2–10 devs at Cartrack Singapore — a SaaS fleet telematics platform under Karooooo Ltd (NASDAQ: KARO) with 2.5M+ active subscribers across 20+ countries. ~80% leadership and coordination, ~20% hands-on coding. You own technical scope, timelines, code reviews, and cross-functional alignment with Product, Design, and QA.
- Team management (5+ yrs, 2–10 devs)
- Language & stack agnostic (15+ yrs)
- SQL — writing, optimising, DB design
- Linux — scripting, system management
- Git — branching, merging, workflows
- PHP / C# backend
- TypeScript + React frontend
- PostgreSQL query optimisation
- Docker / Kubernetes
- CI/CD pipelines (GitLab)
Leadership
Mentoring, delivery management, code review, removing blockers.
SQL & Databases
Query writing, EXPLAIN ANALYZE, index types, schema design, multi-tenancy, migrations.
Linux
System diagnostics, scripting, log management, service troubleshooting.
Git & Workflows
Branching, rebase vs merge, hotfixes, pull requests lifecycle, code review process, hooks & automation.
System Design
Scalability, load balancing, DB sharding, CAP theorem, microservices vs monolith, interview framework.
App Tech Fundamentals
Caching strategies, CDN, message queues, REST vs GraphQL, rate limiting, circuit breakers, observability.
Fleet / Telematics
GPS ingestion pipeline, AWS architecture deep-dive, geofencing at scale, driver scoring, multi-tenancy.
Scenarios
Incident response, scope negotiation, legacy codebases, questions to ask.
Docker / K8s
Images, layers, K8s architecture, Deployments vs StatefulSets, probes.
CI/CD & GitLab
Pipeline design, GitLab CI yaml, secrets, runners, environments.
Agile & Scrum
Roles, all five ceremonies, DoD vs acceptance criteria, Kanban vs Scrum.
If You Don't Know
4-step framework, ready-to-use phrases, what never to say.
Team Management & Leadership
The role is ~80% leadership. Expect deep probing here. Have 3–4 concrete STAR stories ready covering mentoring, delivery, conflict, and code quality.
Walk me through how you structured 1-on-1s and growth plans for junior developers.
Describe a situation where a developer was underperforming. How did you approach it?
How do you handle a team member who strongly disagrees with a technical decision you've made?
How do you balance giving autonomy to experienced devs while keeping visibility?
How do you define technical scope at the start of a project?
With less than 20% time for hands-on coding, how do you stay technically credible?
What does a constructive code review look like for you?
How do you coordinate between engineering, QA, and product when requirements change mid-sprint?
SQL & Database Design
Expect a practical component — live query writing or whiteboard schema design. Cartrack handles massive time-series GPS event data. Know your query fundamentals, optimisation techniques, and DB design patterns cold.
JOINs — types, algorithms, and common traps
INNER, LEFT, FULL OUTER — and the execution algorithms Postgres chooses
-- INNER JOIN: only rows with a match in BOTH tables SELECT o.id, c.name, o.total FROM orders o INNER JOIN customers c ON o.customer_id = c.id; -- LEFT JOIN: all rows from left, NULLs where no match on right -- Use to find customers who have NEVER placed an order SELECT c.name, o.id AS order_id FROM customers c LEFT JOIN orders o ON c.id = o.customer_id WHERE o.id IS NULL; -- anti-join pattern -- FULL OUTER JOIN: all rows from both, NULLs where no match -- Rarely used — good for reconciliation / diff reports -- CROSS JOIN: cartesian product — every row × every row -- Generates all possible driver-vehicle combinations for assignment SELECT d.name, v.plate FROM drivers d CROSS JOIN vehicles v;
-- Find each employee and their manager (same employees table)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
-- Find consecutive GPS events for the same vehicle (gap detection)
SELECT a.vehicle_id, a.recorded_at AS t1, b.recorded_at AS t2,
b.recorded_at - a.recorded_at AS gap
FROM gps_events a
JOIN gps_events b ON a.vehicle_id = b.vehicle_id
AND b.recorded_at = (
SELECT MIN(recorded_at) FROM gps_events
WHERE vehicle_id = a.vehicle_id AND recorded_at > a.recorded_at
);CTEs, subqueries, and window functions
WITH clauses, RANK, ROW_NUMBER, LAG, LEAD, running totals
OVER() clause defines the window: PARTITION BY groups rows, ORDER BY sets order within the window.-- Running total of fuel cost per vehicle (window SUM) SELECT vehicle_id, recorded_at, fuel_cost, SUM(fuel_cost) OVER ( PARTITION BY vehicle_id ORDER BY recorded_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM fuel_events; -- Compare each GPS ping speed to the previous ping (LAG) SELECT vehicle_id, recorded_at, speed, LAG(speed) OVER (PARTITION BY vehicle_id ORDER BY recorded_at) AS prev_speed, speed - LAG(speed) OVER (PARTITION BY vehicle_id ORDER BY recorded_at) AS delta FROM gps_events; -- RANK vs ROW_NUMBER vs DENSE_RANK -- ROW_NUMBER: unique 1,2,3,4 (no ties) -- RANK: ties get same rank, next rank skips: 1,1,3 -- DENSE_RANK: ties same rank, next rank doesn't skip: 1,1,2 -- Recursive CTE: all vehicles in a fleet hierarchy WITH RECURSIVE fleet_tree AS ( SELECT id, name, parent_fleet_id, 1 AS level FROM fleets WHERE parent_fleet_id IS NULL UNION ALL SELECT f.id, f.name, f.parent_fleet_id, ft.level + 1 FROM fleets f JOIN fleet_tree ft ON f.parent_fleet_id = ft.id ) SELECT * FROM fleet_tree ORDER BY level, name;
AVG(salary) GROUP BY dept_id gives one row per department. AVG(salary) OVER (PARTITION BY dept_id) keeps every employee row and adds the dept average as a column — letting you compare each employee to their department average in a single query. Window functions are perfect for rankings, running totals, moving averages, and sequential comparisons with LAG/LEAD.CTE: Use when the same result is referenced multiple times, or when you want to break complex logic into readable named steps. In PostgreSQL 12+, CTEs are inlined by default (treated like subqueries). Add
MATERIALIZED keyword to force execution and reuse: WITH data AS MATERIALIZED (...).Temporary table: Use for very large intermediate results that will be queried multiple times in the same session. Written to disk, can be indexed, persists for the session duration. Good for multi-step ETL within a stored procedure.
Aggregations, grouping sets, and pivot patterns
-- Standard aggregates SELECT fleet_id, COUNT(*) AS total_vehicles, AVG(speed) AS avg_speed, MAX(speed) AS top_speed, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY speed) AS p95_speed FROM gps_events GROUP BY fleet_id; -- GROUPING SETS: multiple GROUP BY levels in one query SELECT fleet_id, vehicle_id, COUNT(*) AS events FROM gps_events GROUP BY GROUPING SETS ( (fleet_id, vehicle_id), -- per vehicle subtotal (fleet_id), -- per fleet subtotal () -- grand total ); -- FILTER clause on aggregates (cleaner than CASE WHEN) SELECT COUNT(*) FILTER (WHERE speed > speed_limit) AS speeding_events, COUNT(*) FILTER (WHERE speed = 0) AS idle_events, COUNT(*) AS total_events FROM gps_events WHERE fleet_id = 42;
EXPLAIN ANALYZE — reading execution plans
How to diagnose slow queries systematically
-- Full diagnosis mode EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT v.plate, COUNT(*) AS events FROM gps_events g JOIN vehicles v ON g.vehicle_id = v.id WHERE g.recorded_at > NOW() - INTERVAL '7 days' GROUP BY v.plate; -- What to look for in the output: -- Seq Scan on gps_events → missing index on recorded_at -- rows=1000 (estimated) vs rows=50000 (actual) → stale stats, run ANALYZE -- Buffers: shared hit=0, read=50000 → data not cached, cold I/O is the bottleneck -- Hash Batches: 4 → hash doesn't fit in work_mem, spilling to disk -- Fix stale statistics ANALYZE gps_events; -- Increase work_mem for sort/hash operations (session-level) SET work_mem = '256MB';
1. Identify the query: Use
pg_stat_statements in production to find queries by total time or mean time — not just the one that felt slow right now.2. Run EXPLAIN (ANALYZE, BUFFERS): Look for Seq Scans on large tables (missing index), row estimate mismatches (stale stats), Hash Batches > 1 (work_mem too low), and high total cost nodes.
3. Check for index-defeating patterns: Functions on indexed columns (
DATE_TRUNC(col) — rewrite as range), implicit type casts (WHERE int_col = '42'), leading wildcard LIKE (LIKE '%foo' can't use B-Tree).4. Fix and verify: Create the index, rewrite the query, or add a partial/covering index. Re-run EXPLAIN ANALYZE to confirm improvement.
5. Monitor: Set up pg_stat_statements dashboard to catch regressions before users notice.
-- 1. Function on indexed column (index bypassed)
WHERE DATE_TRUNC('month', recorded_at) = '2024-01-01' -- BAD
WHERE recorded_at >= '2024-01-01' AND recorded_at < '2024-02-01' -- GOOD
-- 2. Implicit type cast
WHERE vehicle_id = '42' -- vehicle_id is INT, '42' is TEXT → cast prevents index
WHERE vehicle_id = 42 -- GOOD
-- 3. Leading wildcard
WHERE plate LIKE '%ABC' -- can't use B-Tree, full scan
WHERE plate LIKE 'ABC%' -- can use B-Tree index
-- 4. IS NULL / IS NOT NULL (often OK but check cardinality)
-- 5. OR across different columns (use UNION instead)
WHERE fleet_id = 1 OR driver_id = 99 -- may not use indexes well
-- Rewrite as:
SELECT * FROM vehicles WHERE fleet_id = 1
UNION
SELECT * FROM vehicles WHERE driver_id = 99SELECT vehicle_id, ROUND(AVG(speed), 2) AS avg_speed FROM gps_events WHERE recorded_at >= NOW() - INTERVAL '30 days' AND speed > 0 GROUP BY vehicle_id ORDER BY avg_speed DESC LIMIT 10;Making it fast at scale: Partition
gps_events by month — the 30-day filter hits only 1–2 partitions instead of 50M rows. Add composite index (recorded_at, vehicle_id, speed) as a covering index — no heap lookup needed. For a dashboard widget, consider a materialised view refreshed every 5 minutes rather than computing live.Index types and strategies
B-Tree, GIN, GiST, BRIN, partial, covering, expression indexes
B-Tree (default)
Balanced tree. Handles =, <, >, BETWEEN, LIKE 'prefix%', ORDER BY. Right choice for 90% of cases. Created automatically for PRIMARY KEY and UNIQUE.
GIN
Generalised Inverted Index. Best for JSONB, arrays, full-text search (tsvector). Stores each element and maps to rows containing it. Slow to build, fast to query.
GiST
Generalised Search Tree. For geometric types and PostGIS spatial data. Critical for geofencing — point-in-polygon queries use GiST on a geometry column.
BRIN
Block Range Index. Tiny (stores min/max per block range). Only useful for very large tables that are physically sorted (append-only time-series like GPS events partitions).
-- Composite index: column ORDER matters -- (fleet_id, recorded_at) serves: WHERE fleet_id=X AND recorded_at > Y -- Does NOT efficiently serve: WHERE recorded_at > Y (no fleet_id prefix) CREATE INDEX idx_gps_fleet_time ON gps_events(fleet_id, recorded_at DESC); -- Covering index: INCLUDE avoids a second heap lookup (index-only scan) CREATE INDEX idx_gps_cover ON gps_events(vehicle_id, recorded_at) INCLUDE (speed, lat, lng); -- dashboard queries get everything from index -- Partial index: smaller, faster, specific CREATE INDEX idx_active_alerts ON alerts(vehicle_id, triggered_at) WHERE resolved_at IS NULL; -- only unresolved alerts indexed -- Expression index: index the computation, not the raw column CREATE INDEX idx_plate_lower ON vehicles(LOWER(plate)); -- Now WHERE LOWER(plate) = 'sba1234a' uses the index -- Non-blocking index creation (never locks table) CREATE INDEX CONCURRENTLY idx_gps_speed ON gps_events(speed) WHERE speed > 0;
Avoid adding an index when: the table is write-heavy and the column is rarely queried; the column has very low cardinality (e.g. a boolean — the planner may prefer a seq scan anyway); the table is small enough that a seq scan is fast; or the index would duplicate an existing one. Always measure — create the index in a test environment, run EXPLAIN ANALYZE on both the write-heavy and read-heavy queries, and check if the net benefit is positive.
Normalization — 1NF through 3NF + when to denormalize
When to denormalize intentionally
Read-heavy reporting where JOINs across 5+ tables are a bottleneck. Duplicate the column (e.g. store fleet_name on trips table) or use materialised views. Accept the trade-off: writes are more complex, but reads are 10x faster.
BCNF — Boyce-Codd (stricter 3NF)
Every determinant must be a candidate key. Rarely needed in practice but good to know. Example: instructor determines room, but instructor is not the PK — violates BCNF. Most production schemas stop at 3NF.
Relationships — foreign keys, cardinality, junction tables
-- One-to-Many: one fleet → many vehicles vehicles (id, fleet_id REFERENCES fleets(id) ON DELETE CASCADE) -- Many-to-Many: drivers ↔ vehicles over time (junction table) driver_vehicle_assignments ( id BIGSERIAL PRIMARY KEY, driver_id INT NOT NULL REFERENCES drivers(id), vehicle_id INT NOT NULL REFERENCES vehicles(id), assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), unassigned_at TIMESTAMPTZ, -- NULL = currently assigned CONSTRAINT no_overlap EXCLUDE USING gist ( vehicle_id WITH =, tstzrange(assigned_at, unassigned_at) WITH && -- no overlapping time ranges ) ); -- ON DELETE options -- CASCADE: deleting fleet deletes all vehicles -- RESTRICT: error if fleet has vehicles (default-ish) -- SET NULL: vehicle.fleet_id becomes NULL -- NO ACTION: like RESTRICT but deferred until end of transaction
-- Multi-tenancy boundary
customers (id, name, plan, created_at)
fleets (id, customer_id, name) -- one customer → many fleets
-- Core operational entities
vehicles (id, fleet_id, plate, make, model, year, vin, status)
drivers (id, fleet_id, name, licence_no, licence_expiry, status)
-- Many-to-many assignment with time tracking
driver_vehicle_assignments (
id, driver_id, vehicle_id, assigned_at, unassigned_at
)
-- Trips reconstructed from GPS stream
trips (id, vehicle_id, driver_id, start_time, end_time,
distance_km, max_speed, avg_speed, fuel_litres,
start_lat, start_lng, end_lat, end_lng)
-- High-volume: PARTITION BY RANGE(recorded_at) monthly
gps_events (id, vehicle_id, trip_id, recorded_at,
lat, lng, speed, heading, odometer, ignition)
-- Flexible alert payload with JSONB
alerts (id, vehicle_id, driver_id, alert_type, severity,
triggered_at, resolved_at,
payload JSONB) -- e.g. {"geofence_id": 5, "zone": "depot"}
Key design decisions to explain: Fleets as the tenancy boundary (not customers directly on vehicles). GPS events partitioned monthly — mandatory at 50M+ rows. JSONB on alerts for flexible payload without schema migrations every time a new alert type is added. Trips computed/reconstructed, not streamed live into a trips table.
Zero-downtime database migrations
The expand-contract pattern for production schema changes
CREATE INDEX CONCURRENTLY — never locks the table.Isolation levels — Read Committed, Repeatable Read, Serializable
Read Committed (PostgreSQL default)
Each statement sees only committed data. Two statements in the same transaction can see different data if a commit happens between them. Most practical for OLTP. Allows non-repeatable reads.
Repeatable Read
All statements in the transaction see the same snapshot (taken at transaction start). Prevents non-repeatable reads. Still allows phantom reads in most DBs — not in PostgreSQL's implementation.
Serializable
Transactions execute as if they were run one at a time, even though they run concurrently. Prevents all anomalies. PostgreSQL uses SSI (Serializable Snapshot Isolation) — no read locks, but detects conflicts and may abort a transaction. Use for financial transfers, inventory management.
Common anomalies
Dirty read: read uncommitted data (prevented at Read Committed+).
Non-repeatable read: same row returns different values in same transaction.
Phantom read: new rows appear between two reads of the same range.
Lost update: two transactions overwrite each other's changes.
-- Set isolation level for a transaction BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT balance FROM accounts WHERE id = 1; -- snapshot taken here SELECT balance FROM accounts WHERE id = 1; -- same result even if updated COMMIT; -- Pessimistic locking: lock rows you intend to update BEGIN; SELECT * FROM accounts WHERE id = 1 FOR UPDATE; -- locks the row UPDATE accounts SET balance = balance - 100 WHERE id = 1; COMMIT; -- Optimistic locking: use version column, check before update UPDATE vehicles SET status = 'inactive', version = version + 1 WHERE id = 42 AND version = 7; -- fails if someone else updated first -- Check affected rows = 0 → conflict, retry
PostgreSQL detects deadlocks automatically (checks every deadlock_timeout, default 1s) and aborts one of the transactions with an error.
Prevention: Always acquire locks in a consistent order across all transactions — if code always updates vehicles before drivers, you can't deadlock between those two. Keep transactions short so locks are held briefly. Use
NOWAIT to fail immediately rather than wait if a lock is held, and retry at the application level.SELECT ... FOR UPDATE. Assumes conflicts are likely, so block other transactions immediately. Good for high-contention scenarios like a seat reservation system where two users might book simultaneously.Optimistic locking: Read without locking, perform changes, then check at write time whether someone else changed the data first (usually via a version column or timestamp). If conflict detected, retry. Good for low-contention scenarios where conflicts are rare — most Cartrack vehicle data updates (status changes, score updates) can use optimistic locking since concurrent updates to the same vehicle are uncommon.
Optimistic locking scales much better since it holds no DB locks during the read phase, but requires application-level retry logic.
MVCC, VACUUM, and table bloat
Why reads don't block writes in PostgreSQL — and the cost
-- Check for table bloat (high dead tuples = vacuum not keeping up) SELECT relname, n_live_tup, n_dead_tup, ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; -- Manual vacuum (non-blocking, runs in background) VACUUM gps_events; -- VACUUM FULL: rewrites table, reclaims disk space — but locks table! -- Use pg_repack extension instead for zero-downtime compaction -- Force autovacuum to run sooner on a specific table ALTER TABLE gps_events SET (autovacuum_vacuum_scale_factor = 0.01);
autovacuum_vacuum_cost_delay = 2ms and smaller scale factors on the partition tables. Monitor pg_stat_user_tables.n_dead_tup in your dashboards.Table partitioning — range, list, hash
CREATE TABLE gps_events ( id BIGSERIAL, vehicle_id INT NOT NULL, recorded_at TIMESTAMPTZ NOT NULL, lat NUMERIC(9,6), lng NUMERIC(9,6), speed NUMERIC ) PARTITION BY RANGE(recorded_at); CREATE TABLE gps_2024_01 PARTITION OF gps_events FOR VALUES FROM('2024-01-01') TO('2024-02-01'); -- Drop old partition instantly — no DELETE, no VACUUM needed DROP TABLE gps_2023_01; -- instant, removes millions of rows in milliseconds -- LIST partitioning: by category PARTITION BY LIST(region); -- 'APAC', 'EMEA', 'Americas' -- HASH partitioning: distribute write load evenly PARTITION BY HASH(customer_id); -- 8 equal-sized buckets
JSONB — when to use it and how to query efficiently
-- Arrow operators for extraction SELECT payload->'geofence_id' -- returns JSON type SELECT payload->>'geofence_id' -- returns TEXT (better for comparison) SELECT payload#>>ARRAY['location','city'] -- nested path -- Containment operator @> (uses GIN index) SELECT * FROM alerts WHERE payload @> '{"alert_type":"speeding","severity":"high"}'; -- GIN index makes @> very fast even on millions of rows CREATE INDEX idx_alerts_payload ON alerts USING GIN(payload); -- Update a specific key without replacing the whole document UPDATE alerts SET payload = payload || '{"acknowledged": true}' WHERE id = 99; -- Remove a key UPDATE alerts SET payload = payload - 'temp_debug_field';
Stored procedures, triggers, and generated columns
-- Generated column: computed from other columns, stored on disk ALTER TABLE gps_events ADD COLUMN speed_kmh NUMERIC GENERATED ALWAYS AS (speed * 3.6) STORED; -- Can be indexed, queried like a normal column -- Trigger: auto-update updated_at on any row change CREATE OR REPLACE FUNCTION set_updated_at() RETURNS TRIGGER AS $$ BEGIN NEW.updated_at = NOW(); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER vehicles_updated_at BEFORE UPDATE ON vehicles FOR EACH ROW EXECUTE FUNCTION set_updated_at(); -- Row Level Security: enforce tenant isolation at DB level ALTER TABLE vehicles ENABLE ROW LEVEL SECURITY; CREATE POLICY fleet_isolation ON vehicles USING (fleet_id IN ( SELECT id FROM fleets WHERE customer_id = current_setting('app.current_customer_id')::INT ));
SET app.current_customer_id = '42' at the start of each connection from the application.PostGIS — spatial data extension for PostgreSQL
What it is, the data types it adds, and why it's essential for fleet geofencing
-- Enable PostGIS on the database (run once as superuser) CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology; -- optional: topological relationships SELECT PostGIS_Version(); -- verify: '3.4.0 ...'
GEOMETRY
Planar (flat-earth) coordinate system. Uses Cartesian mathematics — fast but distorts over large distances. Coordinates are in arbitrary units (metres, degrees, etc.) depending on the SRID. Best for: local calculations within a city or country where curvature is negligible.
GEOGRAPHY
Spheroidal (curved-earth) coordinate system. Uses ellipsoidal mathematics (WGS 84). Coordinates must be longitude/latitude in degrees. Much more accurate for global-scale distance calculations. Slower than GEOMETRY because of the math. Best for: Cartrack's global fleet — distances between vehicles across countries.
POINT
A single coordinate: POINT(103.8198 1.3521) — that's (longitude, latitude). Used to represent a vehicle's current GPS position, a depot location, a driver's home address, or a point-of-interest.
LINESTRING
An ordered sequence of points connected by straight line segments: LINESTRING(103.8 1.35, 103.85 1.37, 103.9 1.38). Used to represent a vehicle route, a road segment, or a trip path.
POLYGON
A closed ring of points defining an area: POLYGON((103.8 1.3, 103.85 1.3, 103.85 1.35, 103.8 1.35, 103.8 1.3)). The first and last point must be the same (closing the ring). Used for geofences, delivery zones, restricted areas, city boundaries.
MULTIPOLYGON
A collection of polygons as a single geometry — e.g. Singapore's territory including outlying islands as one feature. Used for complex geofences with holes (exclusion zones inside a zone) or discontiguous regions.
GEOMETRYCOLLECTION
A heterogeneous collection of any geometry types. Rarely used in practice but useful for storing mixed geometry types — e.g. a delivery zone defined as a polygon plus specific point-of-interest markers.
MULTILINESTRING
Multiple disconnected line segments as one object. Used for road networks where a road splits and rejoins, or for representing multiple trip legs as a single geographic object.
SRID 4326 — WGS 84
The GPS standard. Coordinates are (longitude, latitude) in degrees. Every GPS device outputs WGS 84. This is what you store for all vehicle positions and global geofences.
SRID 3857 — Web Mercator
Used by Google Maps, OpenStreetMap, Mapbox tile layers. Coordinates in metres. Convert to this for map tile rendering, but store data as 4326.
-- Vehicle positions table (using GEOGRAPHY for accurate global distances) CREATE TABLE vehicle_positions ( id BIGSERIAL PRIMARY KEY, vehicle_id INT NOT NULL REFERENCES vehicles(id), recorded_at TIMESTAMPTZ NOT NULL, position GEOGRAPHY(POINT, 4326) NOT NULL, -- lon/lat in WGS 84 speed_kmh NUMERIC(5,1), heading SMALLINT -- 0-359 degrees ); -- Geofences table (using GEOMETRY for faster polygon operations) CREATE TABLE geofences ( id SERIAL PRIMARY KEY, fleet_id INT NOT NULL REFERENCES fleets(id), name TEXT NOT NULL, zone_type TEXT, -- 'depot', 'restricted', 'delivery' boundary GEOMETRY(POLYGON, 4326) NOT NULL, -- or MULTIPOLYGON for complex zones created_at TIMESTAMPTZ DEFAULT NOW() ); -- GiST spatial index — makes spatial queries fast CREATE INDEX idx_positions_geog ON vehicle_positions USING GiST(position); CREATE INDEX idx_geofences_boundary ON geofences USING GiST(boundary); CREATE INDEX idx_positions_fleet ON vehicle_positions(vehicle_id, recorded_at DESC);
-- Insert a GPS ping (vehicle position from device) INSERT INTO vehicle_positions (vehicle_id, recorded_at, position, speed_kmh) VALUES (42, NOW(), ST_MakePoint(103.8198, 1.3521)::geography, 65.5); -- ST_MakePoint(longitude, latitude) — note: lon FIRST, lat SECOND -- Distance between two points (returns metres when using GEOGRAPHY) SELECT ST_Distance( ST_MakePoint(103.8198, 1.3521)::geography, -- vehicle position ST_MakePoint(103.8500, 1.2800)::geography -- depot location ) AS distance_metres; -- Returns: 8432.7 (metres) -- Point-in-polygon: is a vehicle inside a geofence? SELECT g.name, g.zone_type FROM geofences g WHERE ST_Within( ST_MakePoint(103.8198, 1.3521)::geometry, -- vehicle position g.boundary -- geofence polygon ) AND g.fleet_id = 7; -- Alternative: ST_Contains(polygon, point) — same result, different argument order WHERE ST_Contains(g.boundary, ST_MakePoint(103.82, 1.35)::geometry) -- All vehicles within 500 metres of a depot SELECT v.plate, vp.speed_kmh, ST_Distance(vp.position, depot.location) AS distance_m FROM vehicle_positions vp JOIN vehicles v ON v.id = vp.vehicle_id JOIN depots depot ON depot.id = 5 WHERE ST_DWithin(vp.position, depot.location, 500) -- uses GiST index efficiently AND vp.recorded_at > NOW() - INTERVAL '30 seconds'; -- Create a circular geofence (200m radius around a depot) INSERT INTO geofences (fleet_id, name, zone_type, boundary) VALUES (7, 'Main Depot', 'depot', ST_Buffer( ST_MakePoint(103.8198, 1.3521)::geography, 200 -- 200 metres radius )::geometry ); -- Trip path as LINESTRING (reconstructed from GPS events) SELECT ST_MakeLine(position::geometry ORDER BY recorded_at) AS trip_path, ST_Length(ST_MakeLine(position::geometry ORDER BY recorded_at)::geography) AS distance_m FROM vehicle_positions WHERE vehicle_id = 42 AND recorded_at BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'; -- Bounding box of all geofences for a customer (for map viewport) SELECT ST_AsGeoJSON(ST_Extent(boundary)) AS bounding_box FROM geofences WHERE fleet_id IN (SELECT id FROM fleets WHERE customer_id = 42);
-- Convert to GeoJSON (for API responses to frontend map) SELECT ST_AsGeoJSON(boundary)::json AS geojson FROM geofences WHERE id = 1; -- Returns: {"type":"Polygon","coordinates":[[[103.8,1.3],[103.85,1.3],...]]} -- Convert from GeoJSON (from frontend map drawing tool) SELECT ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[...]}'); -- Convert to WKT (Well-Known Text) for debugging SELECT ST_AsText(boundary) FROM geofences WHERE id = 1; -- Returns: POLYGON((103.8 1.3, 103.85 1.3, ...)) -- Convert from WKT SELECT ST_GeomFromText('POLYGON((103.8 1.3, 103.85 1.3, 103.85 1.35, 103.8 1.35, 103.8 1.3))', 4326);
1. Accuracy: Calculating the distance between two GPS coordinates on the Earth's curved surface requires the Haversine formula. Doing this in application code is error-prone and many implementations use flat-earth approximations that become increasingly inaccurate over long distances. PostGIS's
ST_Distance on GEOGRAPHY types uses ellipsoidal math (WGS 84) that's accurate globally.2. Indexability: Two separate NUMERIC columns can't be spatially indexed together. You can't efficiently answer "find all geofences that contain this point" without a spatial index. PostGIS's GiST index on a GEOMETRY/GEOGRAPHY column turns a full table scan into a O(log n) lookup — essential at millions of GPS pings per day.
3. Richness: PostGIS provides 200+ spatial functions out of the box — union of polygons, line intersection, convex hull, nearest-neighbour queries, trip distance from a path — all done in the database, optimised in C, not implemented repeatedly across different services.
GEOGRAPHY uses a spheroidal (curved-earth) coordinate system — specifically WGS 84, the same as GPS. Coordinates must be longitude/latitude in degrees. Distance calculations are accurate globally — you get the correct great-circle distance between Singapore and London, not a distorted flat-plane approximation. ~10–15% slower than GEOMETRY due to the ellipsoidal math.
Cartrack recommendation: Store GPS vehicle positions and geofences as GEOGRAPHY(POINT, 4326) and GEOGRAPHY(POLYGON, 4326) respectively. The global accuracy is worth the minor performance cost, and
ST_DWithin on GEOGRAPHY still uses the GiST index efficiently. Use GEOMETRY if you need to call PostGIS functions that only work with GEOMETRY — cast with ::geometry.ST_Within(vehicle_point, geofence_boundary) against the entire geofences table. At 500k pings/sec and thousands of geofences per fleet, this is catastrophic.The right approach — three layers:
Layer 1 — GiST spatial index on the DB: The geofences table has a GiST index on
boundary. A point-in-polygon check with ST_Within + WHERE fleet_id = X only scans the relevant customer's geofences and uses the spatial index to prune non-intersecting polygons. This handles the database query efficiently.Layer 2 — In-memory state in Redis: For the real-time Lambda processor, cache each customer's active geofences in Redis (serialised as GeoJSON). On each GPS ping, load the geofence list from Redis (microseconds vs database milliseconds), do the point-in-polygon check in application code (using a spatial library like `turf.js` or Python's `shapely`), and compare against the vehicle's previous state (also in Redis).
Layer 3 — State transition events only: Only when a vehicle transitions (was outside → now inside, or vice versa) do we write to the database and trigger an alert. The majority of pings — vehicle still inside or still outside — generate zero DB writes. This reduces the geofence write load by 99%+.
-- Reconstruct trip as a path and compute total distance
WITH trip_points AS (
SELECT position, recorded_at
FROM vehicle_positions
WHERE vehicle_id = 42
AND recorded_at BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'
ORDER BY recorded_at
)
SELECT
ST_MakeLine(position::geometry ORDER BY recorded_at) AS path,
ST_Length(
ST_MakeLine(position::geometry ORDER BY recorded_at)::geography
) / 1000.0 AS distance_km,
COUNT(*) AS ping_count
FROM trip_points;
Key functions: ST_MakeLine aggregates points into a LINESTRING in ORDER BY sequence. Cast to GEOGRAPHY before calling ST_Length to get accurate metres. Divide by 1000 for kilometres. This is exactly how Cartrack computes trip distance in the driver behaviour report.Linux & System Management
Cartrack runs on Linux servers ingesting IoT data continuously. Expect practical scenario questions — CPU spikes, disk issues, network diagnosis, and scripting challenges.
Process management — diagnosis commands you must know cold
## CPU diagnosis top # live process list; press P to sort by CPU htop # visual, colour-coded; F6 to sort, F9 to kill ps aux --sort=-%cpu | head # snapshot sorted by CPU usage pidstat -u 1 5 # per-process CPU every 1s for 5 samples mpstat -P ALL 1 # per-core CPU stats (find unbalanced load) ## Memory diagnosis free -h # total/used/free/buffers/cache vmstat 1 5 # virtual memory stats: swpd, si, so = swap activity cat /proc/meminfo # detailed kernel memory breakdown ps aux --sort=-%mem | head # top memory consumers ## Disk I/O diagnosis iostat -xz 1 # per-disk util%, await (ms), r/s, w/s iotop -o # per-process I/O (like top for disk) lsof +D /var/lib/postgresql # which processes have DB files open df -h # disk space by filesystem du -sh /var/log/* # find log directories eating disk ncdu / # interactive disk usage explorer ## Process inspection strace -p <PID> -c # summarise syscalls (what is it doing?) lsof -p <PID> # all files/sockets open by a process cat /proc/<PID>/status # detailed process info including threads ps -eLf | grep php # list all threads of a process ps aux | grep Z # find zombie processes
top (sort by CPU with P key) or ps aux --sort=-%cpu | head -10. Note the PID and command name.Step 2 — Is it sustained or a spike?
pidstat -u 1 10 over 10 seconds tells me if it's a runaway process or a burst.Step 3 — What is the process doing?
strace -p <PID> -c for a 10-second syscall summary. If it's in a tight loop of read/write syscalls, it's I/O-bound. If mostly futex, it's waiting on a lock.Step 4 — Check for context: Is it a scheduled cron job? (
crontab -l, check /var/log/cron). A recent deployment? (check app logs, git log). A DB query running wild? (check pg_stat_activity).Step 5 — Mitigate: If it's an OOM-about-to-happen situation,
kill -9 <PID> to stop the bleeding. If it's a production service, graceful restart: systemctl restart service-name. Then investigate root cause from logs.task_struct. The difference is in what they share: threads within the same process share the same memory address space, file descriptors, and signal handlers — they are "lightweight" tasks. Separate processes have their own isolated memory space.A PHP-FPM worker is a separate process — isolated memory. A Go server using goroutines runs many goroutines on a pool of OS threads — one process, many concurrent lightweight threads.
You can inspect threads with
ps -eLf (each thread shows separately) or cat /proc/<PID>/status | grep Threads.systemd — service management and unit files
## Service management systemctl status my-service # current state + last 10 log lines systemctl start|stop|restart my-service systemctl enable my-service # start on boot systemctl reload my-service # reload config without full restart (if supported) systemctl list-units --type=service # all services and their states systemctl list-dependencies my-service ## Log investigation journalctl -u my-service -b # logs since last boot journalctl -u my-service -f # follow live journalctl -u my-service --since "2024-01-15 14:00" --until "14:30" journalctl -p err -b # only ERROR and above from this boot ## Writing a systemd unit file # /etc/systemd/system/gps-processor.service [Unit] Description=GPS Event Processor After=network.target postgresql.service Requires=postgresql.service [Service] Type=simple User=appuser WorkingDirectory=/opt/gps-processor ExecStart=/opt/gps-processor/bin/processor Restart=always RestartSec=5 EnvironmentFile=/etc/gps-processor/env StandardOutput=journal StandardError=journal # Memory/CPU limits MemoryLimit=512M CPUQuota=50% [Install] WantedBy=multi-user.target
systemctl daemon-reload then systemctl restart service-name. The Restart=always directive means systemd will restart the service if it crashes — essential for production services.File system management — permissions, ownership, disk usage
## Permissions chmod 750 /opt/gps-processor/bin/processor # 7=rwx (owner), 5=r-x (group), 0=--- (other) chmod u+x,g-w,o-rwx script.sh # symbolic mode chown appuser:appgroup /opt/app # change owner:group chown -R appuser /opt/app # recursive ## Special permissions chmod +s /usr/bin/sudo # SUID: run as file owner (not caller) chmod g+s /shared/data # SGID on dir: new files inherit group chmod +t /tmp # Sticky bit: only owner can delete their files ## Find files by permission/ownership find /opt -type f -perm /u+s # find SUID files (security audit) find / -user root -perm -4000 # world-writable SUID root files (risk!) ## Disk usage investigation df -h # filesystem usage df -i # inode usage (can be full even if space available) du -sh /var/log/* | sort -rh | head -20 # largest log directories find /var/log -name "*.log" -size +100M # large log files lsof | grep deleted # files deleted but still held open (disk not freed!)
rm), Linux removes the directory entry but the data isn't freed until all processes that have the file open close it. A log file can grow to 50GB, get deleted, but if the logging process still has it open, the disk space isn't freed.Diagnose:
lsof | grep deleted — lists all deleted-but-still-open files with their sizes and the PID holding them open.Fix: Option A: restart the process holding the file open (
systemctl restart service-name). Option B: truncate the file without closing it: : > /proc/<PID>/fd/<fd-number>. Option B is better if you can't afford to restart the service.Also check
df -i — sometimes the issue is inode exhaustion, not block usage. Millions of small files in /tmp or a mail spool can exhaust inodes even with free disk space.Network diagnosis commands
ss, netstat, tcpdump, curl, traceroute — and what each tells you
## Active connections and listening ports ss -tlnp # TCP listening ports + process name (fast, modern) ss -s # connection count summary by state ss -tnp state established # all established TCP connections netstat -tlnp # older equivalent (netstat is deprecated but still common) ## Is the port open / reachable? nc -zv 10.0.1.5 5432 # TCP port check (can I reach Postgres?) curl -v https://api.cartrack.sg/health # full HTTP trace including TLS handshake telnet 10.0.1.5 5432 # older alternative ## DNS resolution dig api.cartrack.sg # full DNS query with timing dig +short api.cartrack.sg # just the IP nslookup api.cartrack.sg # simpler alternative cat /etc/resolv.conf # which DNS servers the server uses ## Route tracing traceroute api.cartrack.sg # hop-by-hop latency to destination mtr --report api.cartrack.sg # continuous traceroute with packet loss % ## Packet capture (powerful but requires root) tcpdump -i eth0 port 1883 # capture MQTT traffic tcpdump -i eth0 host 10.0.1.5 -w /tmp/capture.pcap # save to file tcpdump -n -r /tmp/capture.pcap # read saved capture ## Connection count by state (useful for diagnosing connection exhaustion) ss -tn | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn # Shows: ESTABLISHED, TIME_WAIT, CLOSE_WAIT counts # Too many TIME_WAIT = normal for HTTP (short-lived connections) # Too many CLOSE_WAIT = app not closing connections properly (code bug)
1. Is the DB process even running? On the DB server:
systemctl status postgresql. ps aux | grep postgres.2. Is it listening on the right port?
ss -tlnp | grep 5432. If it's listening on 127.0.0.1 only, remote connections can't reach it — check postgresql.conf listen_addresses.3. Can the app server reach the DB server?
nc -zv <db-host> 5432 from the app server. If this fails, it's network — firewall rules, security groups (AWS), or routing.4. Is the connection being rejected by Postgres itself? Check
pg_hba.conf — the client's IP must match an allowed rule. Check the DB logs (journalctl -u postgresql) for "authentication failed" or "no pg_hba.conf entry".5. Check connection pool exhaustion:
SELECT count(*), state FROM pg_stat_activity GROUP BY state — if max_connections is reached, new connections are rejected with "too many connections". Solution: add RDS Proxy / PgBouncer or increase max_connections (with care).Firewall and iptables / nftables basics
## UFW (Ubuntu Uncomplicated Firewall — abstraction over iptables) ufw status verbose # current rules ufw allow 22/tcp # allow SSH ufw allow from 10.0.0.0/8 to any port 5432 # only internal IPs to Postgres ufw deny 5432 # block Postgres from public ufw enable ## iptables (lower-level, still common) iptables -L -n -v # list all rules with packet counts iptables -A INPUT -p tcp --dport 1883 -s 10.0.0.0/8 -j ACCEPT # allow MQTT from private network iptables -A INPUT -p tcp --dport 1883 -j DROP # drop everything else to MQTT port ## Check if a specific rule is blocking (count packets hitting it) iptables -L INPUT -v -n | grep 1883
Bash fundamentals — variables, conditionals, loops, functions
#!/bin/bash set -euo pipefail # e=exit on error, u=error on unset var, o pipefail=pipe fails matter trap 'echo "ERROR: Script failed at line $LINENO"' ERR ## Variables and quoting NAME="Cartrack" echo "Hello $NAME" # double quotes: variable expanded echo 'Hello $NAME' # single quotes: literal, no expansion FILES=$(ls /var/log/*.log) # command substitution COUNT=${#FILES} # string length ## Conditionals if [[ -f "/etc/config.env" ]]; then source /etc/config.env elif [[ -d "/etc/config/" ]]; then source /etc/config/main.env else echo "No config found" >&2; exit 1 fi [[ $COUNT -gt 0 ]] && echo "Has files" || echo "Empty" ## String tests: -z (empty), -n (not empty), == (equal), != (not equal) ## File tests: -f (file), -d (dir), -r (readable), -x (executable) ## Number tests: -eq -ne -lt -gt -le -ge ## Loops for file in /var/log/gps/*.log; do echo "Processing $file" done while IFS= read -r line; do echo "Line: $line" done < /etc/servers.txt ## Functions log() { local level="$1"; shift echo "$(date '+%Y-%m-%d %H:%M:%S') [$level] $*" >> /var/log/deploy.log } log INFO "Deployment started" log ERROR "Config file missing" ## Arrays SERVERS=("web1" "web2" "web3") for s in "${SERVERS[@]}"; do ssh "deploy@$s" 'systemctl restart app' done
Text processing — grep, awk, sed, sort, xargs
## grep — pattern matching grep -i "error" /var/log/app.log # case-insensitive grep -v "debug" /var/log/app.log # invert (exclude debug lines) grep -oP 'vehicle_id=\K[^\s]+' # extract just the match (-o) with Perl regex grep -c "CRITICAL" /var/log/app.log # count matching lines grep -A3 -B1 "ERROR" app.log # 3 lines after, 1 before match (context) ## awk — structured text processing awk '{print $1, $3}' /var/log/access.log # print columns 1 and 3 awk -F',' '$5 > 100 {print $1}' data.csv # filter CSV: col 5 > 100, print col 1 awk '{sum+=$NF} END {print sum}' file # sum the last column awk 'NR%1000==0 {print NR, $0}' bigfile # sample every 1000th line from huge file ## sed — stream editor for substitution sed 's/ERROR/CRITICAL/g' app.log # replace ERROR with CRITICAL sed -n '100,200p' bigfile # print only lines 100-200 sed '/^#/d' config.ini # delete comment lines sed -i 's/old_host/new_host/g' config # edit file in-place ## sort, uniq — counting and deduplication sort -t',' -k3 -n data.csv # sort by column 3 numerically sort | uniq -c | sort -rn | head -20 # frequency analysis (top 20 most common) LC_ALL=C sort -u bigfile # fast ASCII sort + dedup ## xargs — batch parallel execution find /data -name "*.gz" | xargs -P4 -I{} gzip -d {} # decompress 4 at a time cat server_list.txt | xargs -I{} ssh {} 'uptime' # run uptime on all servers ## Real-world: find top 10 IPs hammering the API awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10 ## Parse structured logs (JSON) cat app.log | python3 -c " import sys, json for line in sys.stdin: try: d = json.loads(line) if d.get('level') == 'ERROR': print(d['timestamp'], d['message']) except: pass "
Cron, scheduling, and automation
## crontab format: min hour day month weekday command crontab -e # edit current user's crontab crontab -l # list current user's crontab crontab -u postgres -l # list another user's crontab ## Cron examples 0 2 * * * /opt/scripts/backup_db.sh # daily at 2am */5 * * * * /opt/scripts/health_check.sh # every 5 minutes 0 0 1 * * /opt/scripts/monthly_report.sh # first of every month 0 8-18 * * 1-5 /opt/scripts/fleet_sync.sh # every hour, business hours, weekdays ## Best practices for cron scripts # 1. Use absolute paths — cron has a minimal PATH # 2. Redirect output to a log file 0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1 # 3. Add a lock file to prevent overlapping runs 0 */2 * * * flock -n /tmp/etl.lock /opt/scripts/etl.sh || echo "Already running" # 4. Set MAILTO="" to suppress email output (noisy by default) MAILTO="" ## /etc/cron.d/ — system-level cron with user specification # /etc/cron.d/gps-archive: 0 3 * * * postgres /opt/scripts/archive_gps_partitions.sh
Practical scripting — file watcher, health checker, deployment helper
#!/bin/bash ## Health check script with alerting set -euo pipefail ENDPOINT="https://api.cartrack.sg/health" SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}" THRESHOLD_MS=2000 check_health() { local start end elapsed http_code start=$(date +%s%N) http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$ENDPOINT") end=$(date +%s%N) elapsed=$(( (end - start) / 1000000 )) # ms if [[ "$http_code" != "200" ]]; then alert "API returned HTTP $http_code" elif [[ $elapsed -gt $THRESHOLD_MS ]]; then alert "API slow: ${elapsed}ms (threshold: ${THRESHOLD_MS}ms)" fi } alert() { local msg="$1" echo "$(date '+%Y-%m-%d %H:%M:%S') ALERT: $msg" >> /var/log/health-check.log curl -s -X POST "$SLACK_WEBHOOK" \ -H 'Content-type: application/json' \ --data "{\"text\":\"⚠️ $msg\"}" > /dev/null } check_health
SSH, key management, and secure remote access
## Generate a key pair (Ed25519 is modern and fast) ssh-keygen -t ed25519 -C "deploy@cartrack" -f ~/.ssh/cartrack_deploy ## Copy public key to server ssh-copy-id -i ~/.ssh/cartrack_deploy.pub user@server ## ~/.ssh/config for managing multiple servers Host cartrack-prod HostName 10.0.1.100 User deploy IdentityFile ~/.ssh/cartrack_deploy ServerAliveInterval 60 ## Then just: ssh cartrack-prod ## /etc/ssh/sshd_config hardening PermitRootLogin no # never allow root SSH PasswordAuthentication no # key-only authentication PubkeyAuthentication yes AllowUsers deploy appuser # whitelist specific users Port 2222 # non-standard port reduces script-kiddie noise ## SSH tunnelling (port forwarding) ssh -L 5432:db-server:5432 jump-host # forward local 5432 through jump-host to DB ssh -R 8080:localhost:3000 server # expose local port 3000 on remote server's 8080 ## scp and rsync for file transfer scp -i ~/.ssh/key file.tar.gz deploy@server:/opt/releases/ rsync -avz --delete /opt/app/ deploy@server:/opt/app/ # sync, delete removed files
1. User management: Disable root SSH (
PermitRootLogin no). Create a dedicated deploy user with sudo for specific commands only. Remove or lock default accounts.2. SSH hardening: Key-only authentication (
PasswordAuthentication no). Non-default port. Restrict AllowUsers. Enable fail2ban to block repeated failed login attempts.3. Firewall: Default deny all inbound. Allow only needed ports (22/2222 from known IPs, 443/80 from world, 5432 only from app server IP).
4. Updates: Apply security patches (
apt upgrade). Enable unattended-upgrades for security patches only.5. File permissions: Ensure application runs as a non-root user.
chmod 600 for secrets and key files. Audit SUID binaries with find / -perm /4000.6. Monitoring: Install auditd for syscall logging. Configure logwatch or send logs to central SIEM.
Environment variables and secrets management
## Setting and using environment variables export DB_PASSWORD="secret" # set for current session + subprocesses DB_HOST=localhost ./app # set only for one command printenv # show all env vars env | grep APP_ # filter to app-specific vars ## /etc/environment — system-wide, all users, loaded at login ## /etc/profile.d/*.sh — loaded for login shells ## ~/.bashrc — user-specific interactive shells ## Loading from a .env file in a script set -a # auto-export all variables source /etc/app/.env set +a ## Never commit .env files — use .gitignore ## In production: use systemd EnvironmentFile or AWS Secrets Manager ## Masking secrets in logs echo "DB_PASSWORD: ${DB_PASSWORD:0:3}***" # show only first 3 chars ## Check if a required env var is set (fail fast if not) : "${DB_HOST:?DB_HOST must be set}" # exits with error message if unset
ps aux output). Use a secrets manager: AWS Secrets Manager, HashiCorp Vault, or at minimum a EnvironmentFile in the systemd unit with chmod 600 so only root and the service user can read it.Git & Collaborative Workflows
Cartrack uses GitLab. Know branching strategies, merge request workflows, and how to enforce standards across a distributed team.
How Git stores data
Commits, trees, blobs — snapshots not diffs
git reset: Moves the branch pointer backward, rewriting history.
--soft keeps changes staged, --mixed keeps them unstaged, --hard discards changes. Only use locally before pushing — breaks teammates' branches if used on shared branches.git restore: For working directory and staging area only — doesn't touch commits.
git restore file.txt discards uncommitted changes. git restore --staged file.txt unstages a file.Git Flow — full model
5 branch types, their lifetimes, and strict merge rules
# Start a feature git checkout develop git checkout -b feature/geofence-alerts # ... work ... git checkout develop git merge --no-ff feature/geofence-alerts # --no-ff preserves branch history git branch -d feature/geofence-alerts # Cut a release (only bug fixes from here, no new features) git checkout develop git checkout -b release/2.0.0 # bump version number, fix bugs found in testing git checkout main && git merge --no-ff release/2.0.0 git tag -a v2.0.0 -m "Release 2.0.0" git checkout develop && git merge --no-ff release/2.0.0 # back-merge fixes # Hotfix in production git checkout main git checkout -b hotfix/critical-gps-null # fix the bug git checkout main && git merge --no-ff hotfix/critical-gps-null git tag -a v2.0.1 git checkout develop && git merge --no-ff hotfix/critical-gps-null # crucial!
When to choose Git Flow
Mobile/desktop apps with App Store releases. Libraries with semantic versioning. Products supporting multiple major versions simultaneously (v1.x and v2.x in production). Teams with formal QA and UAT cycles before each release. Compliance-heavy products needing audit trails per release.
Git Flow pitfalls
Merge conflicts accumulate — the longer feature branches live, the bigger the diff. The develop branch can become a "last stable mess." Hotfixes must be merged into two branches — easy to forget the back-merge to develop. Not suitable for multiple daily deployments — too much ceremony per release.
Trunk-Based Development (TBD)
One main branch, short-lived feature branches, always deployable
# The workflow is intentionally simple git switch -c feat/speed-alert-refactor # branch from main # make small, focused commits — push frequently git push origin feat/speed-alert-refactor # Open MR → CI passes → 1 approval → squash merge → delete branch # Feature flag in application code (incomplete work is merged but hidden) if (featureFlags.isEnabled('new-geofence-engine', customerId)) { // new unfinished code — only visible to internal testers } else { // old code — what all customers see }
When to choose TBD
SaaS products deployed multiple times per day. Small to medium teams where everyone knows the codebase. Products with good automated test coverage (tests must pass before merge). Teams using feature flags (LaunchDarkly, Unleash, or DB-backed flags). Cartrack's engineering model — continuous delivery to 2.5M subscribers.
TBD prerequisites
Solid CI pipeline that runs in <10 minutes. Feature flags infrastructure for dark deploys. Culture of small, reviewable commits. Test coverage sufficient to catch regressions quickly. Monitoring and rollback capability for production issues.
GitHub Flow
One long-lived branch: main. Everything else is a PR with preview deployment.
git switch -c feature/driver-coaching-alertsGitHub Flow vs TBD
GitHub Flow is essentially TBD with an explicit preview deployment step as part of the PR workflow. TBD is more prescriptive about keeping branches under 2 days. Both require main to always be deployable. GitHub Flow is often the more pragmatic starting point for teams transitioning from Git Flow.
GitHub Flow limitations
No built-in mechanism for supporting multiple production versions. No explicit release stabilisation stage. Merging to main means immediate production — requires high confidence in CI and monitoring. Not suitable if you need a "release candidate" phase.
Environment branching strategy
A branch per deployment environment — common in enterprise/regulated teams
When environment branching is used
Enterprises with separate QA, UAT, and production environments. Regulated industries (finance, healthcare) where manual approval gates are required between environments. Teams where different environment configs are managed in the branch (not recommended). Legacy organisations migrating from SVN-style trunk/branch workflows.
Problems with environment branching
Branches diverge — staging and main can be weeks apart in diff. Merge conflicts compound across environments. "Works in staging, broken in prod" becomes common. It's a deployment problem disguised as a branching problem — the real solution is environment parity and feature flags, not permanent environment branches.
Release branching — long-term version support
Used when multiple major versions must be maintained in production simultaneously
# Backport a security fix to v2.x release branch git checkout release/2.x git cherry-pick abc1234 # cherry-pick the specific fix commit from main git tag -a v2.8.1 -m "Security: fix GPS auth bypass" # DO NOT merge release/2.x back into main — it contains old code
Side-by-side strategy comparison
Choose based on release cadence, team size, and product type
STRATEGY RELEASE CADENCE BRANCH COUNT BEST FOR Trunk-Based Multiple/day Low (1-2) SaaS, continuous delivery GitHub Flow Daily Low (1+) Web services, small teams Git Flow Scheduled High (5 types) Mobile apps, OSS libraries Environment Br. Scheduled High (3+) Enterprise with manual gates Release Branches Multiple versions Medium SDKs, enterprise software
1. How often do you need to release? Multiple times a day → TBD. Once a week or less → Git Flow or GitHub Flow. Multiple versions in production → release branches.
2. How stable is the codebase and test suite? If CI is unreliable or test coverage is low, long-lived branches provide a safety buffer. If CI is solid and coverage is high, TBD works safely.
3. What's the team size and codebase familiarity? Larger teams benefit from more explicit rules (Git Flow). Small, experienced teams can operate efficiently with TBD's simplicity.
4. Are there external constraints? App Store review cycles force less frequent releases → Git Flow. Compliance requiring UAT before production → environment branching or release branches.
My default recommendation is always to start with GitHub Flow (simple, mainstream) and move to TBD as the team matures its CI and feature flag discipline. Reserve Git Flow for true scheduled-release products.
Merge strategies — merge commit vs squash vs rebase
The three ways to integrate a branch and what history each produces
git log looks like and how easy rollbacks are.Merge commit
Full history preserved. Two parents on the merge commit. git log --graph shows the branch shape.
Best for: Git Flow's develop branch, long-lived branch integration. Rollback: revert the merge commit.
Squash merge
All feature commits become one commit on main. Clean, readable history. Individual commits lost.
Best for: TBD and GitHub Flow — keeps main history clean. Rollback: revert the single squash commit. GitLab's recommended default.
Rebase merge
Each commit replayed onto main, linear history. No merge commit. Rewrites feature commit hashes.
Best for: Small teams with clean commit hygiene. Rollback: revert individual commits. Never rebase public/shared branches.
# GitLab MR merge strategies (Settings → General → Merge requests) # "Merge commit" — creates merge commit (default) # "Merge commit with semi-linear history" — rebase first, then merge commit # "Fast-forward merge" — rebase only, no merge commit, requires linear history # Team recommendation: Squash and merge # Every MR = one commit on main = easy to revert a feature = clean log
My response: the detailed commit history of a feature branch is available as long as the branch exists. We keep remote branches until the sprint retrospective — anyone who wants to review the commit-by-commit reasoning can do so before branch deletion. What lands on main is deliberately coarse-grained — one commit per feature — because main is a record of what shipped, not how it was built.
For the cases where granular commit history genuinely matters (complex algorithms, security-sensitive code), I'd encourage the developer to include that context in the MR description, which GitLab preserves permanently against the merge commit. If the team strongly disagrees with squash, semi-linear merge (rebase + merge commit) is a middle ground.
1. Decompose the feature: If a branch is more than 3 days old, it's too big. Break the epic into independently mergeable increments. Use feature flags to merge partial implementations safely — the UI button only appears when the flag is on.
2. Rebase frequently: If the branch must stay open, have the developer rebase onto main every day or two. This keeps the diff small and conflicts manageable rather than letting them compound for two weeks.
3. Pair programming: Long-running branches are often a sign someone is stuck or uncertain. Schedule a pairing session to unblock, then get it merged.
4. Set a branch age limit policy: Flag any branch older than 5 days in the weekly team standup. Not as a blame exercise — as an early warning that something needs attention.
The full PR / MR lifecycle
From branch creation to merge — every step and who owns it
feat/gps-deduplication, fix/null-speed-on-idle, chore/upgrade-php83. Short-lived — target 1–3 days max before merging.feat:, fix:, refactor:, docs:, test:. Never commit "WIP" to a shared branch.When reviewing: I look for correctness first — does it do what it claims? Then edge cases and error handling. Then readability and naming. I label my comments: blocking (must fix), suggestion (optional improvement), question (I want to understand the choice). I give specific, actionable feedback rather than vague comments like "this is confusing." I also call out things done well — review shouldn't just be a list of problems.
I try to approve within 24 hours. A PR waiting for review is a team bottleneck — it blocks the author and inflates WIP.
Keep PRs small. A 50-line PR gets reviewed in 10 minutes. A 500-line PR gets skimmed and rubber-stamped. I enforce a culture of frequent small PRs over infrequent large ones.
Use draft/WIP PRs for early feedback. Opening a draft MR early means you can get architecture feedback before writing all the code — much cheaper to change direction at that point.
Set review SLAs. For my team, the expectation was: review within 24 hours on working days. If you can't, re-assign.
Automate the automatable. Linting, formatting, test coverage thresholds — all enforced by CI, not reviewers. Reviewers focus on logic and design, not style.
Pair programming as a PR alternative. For complex or risky changes, I'd do a live pairing session instead of async review. Faster, more effective.
I respond to every comment — even "good catch, fixed" or "I see your point but kept my approach because X — happy to discuss." I never silently change code without acknowledging the comment.
If I genuinely disagree after understanding their reasoning, I state my position calmly with evidence — benchmarks, docs, a concrete example. I don't dig in defensively. The goal is to ship the best code, not to win.
If we're stuck, I escalate to a third reviewer or suggest we timebox a discussion. "Disagree and commit" is a valid outcome — sometimes you implement their approach while noting your preference for the record. What I don't do is leave a comment thread unresolved and merge anyway.
GitLab MR template + branch protection setup
The configuration that enforces quality gates automatically
# .gitlab/merge_request_templates/default.md
## What does this MR do?
<!-- One sentence summary -->
## Why?
<!-- Link to issue/ticket. What problem does this solve? -->
## How to test
<!-- Steps to verify the change works -->
## Checklist
- [ ] Tests added or updated
- [ ] CI pipeline passing
- [ ] No debug/commented-out code
- [ ] Documentation updated if needed
- [ ] Screenshots attached (for UI changes)
Protect main branch
Allowed to merge: Maintainers only. Allowed to push: No one (force-push disabled). Require MR approvals: 1 minimum. Require CI to pass before merge.
Useful MR settings
Delete source branch on merge (auto-cleanup). Squash commits by default (clean history). Require resolved discussions before merge. Enable "Suggested reviewers" based on file ownership (CODEOWNERS).
# CODEOWNERS file — auto-assign reviewers by path # GitLab reads this from the root of the repo /database/migrations/ @db-lead /api/auth/ @security-team /frontend/ @frontend-team * @tech-lead
Git hooks — enforcing quality at commit time
pre-commit, commit-msg, pre-push hooks and what to automate
.git/hooks/ but are not committed — use a tool like Husky (Node.js) or pre-commit (Python/any) to share hooks across the team via the repo.# package.json — Husky setup for Node/PHP projects { "husky": { "hooks": { "pre-commit": "lint-staged", // lint only staged files "commit-msg": "commitlint --edit $1", // enforce conventional commits "pre-push": "npm test" // run tests before push } }, "lint-staged": { "*.php": ["phpcs", "phpstan analyse"], "*.ts": ["eslint --fix", "prettier --write"] } } # commitlint.config.js — enforce Conventional Commits module.exports = { extends: ['@commitlint/config-conventional'], rules: { 'type-enum': [2, 'always', [ 'feat', 'fix', 'chore', 'refactor', 'docs', 'test', 'perf' ]] } };
--no-verify.Conventional commits + automated changelogs
# Format: type(scope): description feat(gps): add deduplication for duplicate device packets fix(auth): resolve null pointer on expired JWT refresh chore(deps): upgrade php-cs-fixer to 3.x refactor(trips): extract TripReconstructionService from GPSProcessor docs(api): add OpenAPI spec for /v1/vehicles endpoint test(geofence): add edge case for boundary crossing at exactly midnight perf(dashboard): cache driver score aggregations for 5 minutes # Breaking changes — triggers major version bump feat(api)!: remove deprecated /v1/fleet/all endpoint # Generate changelog automatically with standard-version or release-please npx standard-version # bumps version, generates CHANGELOG.md, tags commit
fix: = patch, feat: = minor, feat!: = major. Your release notes write themselves, and junior developers have a clear framework for commit messages without needing a style guide document.Advanced Git: bisect, reflog, stash, worktrees
## git bisect — binary search for a regression git bisect start git bisect bad # current = broken git bisect good v2.3.0 # this tag was working git bisect run ./run_tests.sh # automate (exit 0 = good) git bisect reset # clean up # On 1,000 commits: finds the bad one in ~10 steps ## git reflog — your safety net for "lost" commits git reflog # see every HEAD movement git checkout HEAD@{3} # go back 3 HEAD positions git branch recovery HEAD@{3} # recover "deleted" branch ## git stash — save work-in-progress without committing git stash push -m "half-done geofence fix" git stash list git stash pop # restore latest stash git stash apply stash@{2} # restore specific stash, keep it ## git worktree — check out two branches simultaneously git worktree add ../hotfix-branch hotfix/critical-fix # Work in ../hotfix-branch while staying on your feature branch git worktree remove ../hotfix-branch
git reflog is the most underused Git command. Even after a git reset --hard or accidental branch deletion, the commits are still in the object store for ~90 days. Reflog lets you recover them before they're garbage-collected.Tagging, releases & semantic versioning
MAJOR.MINOR.PATCH — e.g. 2.4.1. MAJOR = breaking API change. MINOR = new backwards-compatible feature. PATCH = backwards-compatible bug fix. Pre-release: 2.5.0-rc.1.# Lightweight tag (just a pointer to a commit) git tag v2.4.1 # Annotated tag (preferred — includes message, author, date) git tag -a v2.4.1 -m "Release 2.4.1: fix GPS null handling" git push origin v2.4.1 # tags don't push automatically git push origin --tags # push all local tags # List and filter tags git tag -l "v2.*" # GitLab: tag-triggered pipeline deploys to production # rules: - if: $CI_COMMIT_TAG
System Design & Architecture
As a tech lead you'll be expected to drive architecture decisions. These concepts underpin every system design discussion — from a whiteboard interview to a real technical spec.
Vertical vs horizontal scaling
Scale up vs scale out — and when each one breaks
Vertical scaling (Scale up)
Add more CPU/RAM/disk to a single server. Simple — no application changes needed. Has a hard ceiling (biggest available machine). Creates a single point of failure. Downtime during upgrades.
Good for: Databases (simpler than distributed), early-stage products, stateful services that are hard to distribute.
Horizontal scaling (Scale out)
Add more servers, distribute load across them. Theoretically unlimited. Requires stateless services (or external state management). More complex operationally — needs load balancing, service discovery.
Good for: Stateless API servers, microservices, read replicas. Cartrack's ingestion layer must scale horizontally.
Load balancing — algorithms and strategies
Round robin, least connections, consistent hashing, health checks
Round robin
Requests go to servers in rotation: 1, 2, 3, 1, 2, 3... Simple. Ignores server capacity or current load. Works well when servers are identical and requests are similar in weight.
Least connections
Each new request goes to the server with the fewest active connections. Better for variable-duration requests (e.g. a mix of quick API calls and long GPS history exports).
IP hash / Consistent hashing
Same client always reaches the same server. Useful for session affinity (sticky sessions). Consistent hashing minimises redistribution when servers are added/removed — critical for cache layers.
Layer 4 vs Layer 7
L4 (TCP): Fast, routes by IP/port, no request inspection. L7 (HTTP): Slower but content-aware — can route based on URL path, headers, or cookies. Use L7 for API gateways and microservice routing.
1. Sticky sessions (session affinity): The load balancer sends the same user to the same server. Simple, but you lose the flexibility of true horizontal scaling and a server crash takes down sessions with it.
2. Centralised session store: Store sessions in Redis or a database instead of in-memory. All servers share the same session state. This is the standard approach — stateless servers, stateful store. Sessions survive server restarts.
3. Token-based auth (JWT): No server-side session at all. The client holds a signed token containing claims. The server validates the signature without storing any state. Best for APIs — completely stateless, infinitely scalable. The tradeoff is that you can't instantly invalidate a JWT (it's valid until expiry), though a token blocklist in Redis solves this if needed.
CAP theorem — Consistency, Availability, Partition tolerance
Why you can only pick two, and what that means in practice
Consistency (C)
Every read returns the most recent write or an error. All nodes see the same data at the same time. Like a single source of truth.
Availability (A)
Every request gets a (non-error) response — but it might not contain the most recent data. The system stays online even if some nodes are unavailable.
Partition tolerance (P)
The system continues operating even when network messages are dropped or delayed between nodes. In a distributed system, P is non-negotiable — networks fail.
CP systems (consistent + partition tolerant)
Return an error rather than stale data during a partition. Examples: HBase, Zookeeper, traditional RDBMS with synchronous replication. Use when: correctness is critical — financial transactions, inventory management, GPS event deduplication.
AP systems (available + partition tolerant)
Return potentially stale data during a partition — eventually consistent. Examples: DynamoDB, Cassandra, DNS. Use when: uptime matters more than perfect accuracy — user profile reads, product catalogue, fleet dashboard overview metrics.
Database sharding & replication
When a single DB isn't enough — and the complexity trade-offs
Sharding strategies
Hash-based: shard = hash(customer_id) % N. Evenly distributed. Hard to add shards (rehashing).
Range-based: customers A–M on shard 1, N–Z on shard 2. Simple but can create hot shards.
Directory-based: A lookup table maps each key to a shard. Flexible but the directory becomes a bottleneck.
Problems sharding creates
Cross-shard JOINs are not possible (must be done in application code). Distributed transactions are extremely complex. Re-sharding when you outgrow the current split is painful. Operational complexity multiplies: N databases to backup, monitor, and maintain.
Microservices vs monolith — when each makes sense
Monolith
Pros: Simple to develop, test, deploy, and debug. No network latency between components. Easy to do cross-cutting concerns (transactions, logging).
Cons: Hard to scale individual parts. Large codebase becomes unwieldy. Deployments are all-or-nothing. Tech stack is locked in.
Best for: Early-stage products, small teams (<10 devs), domains that are still being understood.
Microservices
Pros: Independent deployment and scaling per service. Teams own their service end-to-end. Technology diversity per service.
Cons: Distributed systems complexity (network failures, eventual consistency). Operational overhead multiplies. Harder to debug across service boundaries.
Best for: Large teams with clear domain boundaries, different scaling requirements per component, high-maturity engineering orgs.
I'd extract services along domain boundaries — the classic Domain-Driven Design bounded contexts. The GPS event pipeline, driver scoring, alerting, and the customer-facing dashboard are natural candidates for separation in Cartrack's case.
What I'd avoid: extracting microservices purely because the codebase feels large, or because other companies do it. The operational cost is real — you need service discovery, distributed tracing, and much better monitoring before microservices become a net positive.
Event-driven architecture
Events, commands, queries — and when async beats synchronous
VehicleSpeedExceededLimitSendDriverAlertGetDriverScoreForLastMonthHow to answer a system design question in an interview
A structured approach that shows seniority
Application Technology Fundamentals
Core concepts every senior engineer and tech lead is expected to know — caching, CDN, messaging, APIs, reliability patterns, and observability.
Caching patterns — Cache-Aside, Write-Through, Write-Behind, Read-Through
Cache-Aside (Lazy loading)
App checks cache first. On miss, loads from DB, writes to cache, returns data. Most common pattern.
Pros: Only caches what's actually requested. Cache failure doesn't break reads.
Cons: First request always slow (cache miss). Cache can go stale.
Use for: Read-heavy data, fleet dashboard vehicle lists.
Write-Through
On every write, app writes to both cache and DB synchronously. Cache is always up to date.
Pros: No stale reads.
Cons: Every write is slower (two writes). Cache fills with data that may never be read.
Use for: Write-then-immediately-read patterns, user profiles.
Write-Behind (Write-Back)
App writes to cache only. Cache asynchronously flushes to DB later.
Pros: Very fast writes.
Cons: Risk of data loss if cache crashes before flush. Complex to implement.
Use for: High-frequency counters, analytics aggregation.
Read-Through
Cache sits in front of the DB. On miss, the cache (not the app) fetches from DB and populates itself.
Pros: App logic is simpler — just talk to the cache.
Cons: Cache vendor must support it (Redis doesn't natively).
Use for: ORM-level caching with libraries like Hibernate.
Strategies for invalidation:
TTL (time-based expiry): The simplest approach — cache entries expire after N seconds. Acceptable stale window for most fleet dashboard data (vehicle list can be 30 seconds stale). Terrible for anything requiring real-time accuracy.
Event-driven invalidation: When the underlying data changes, publish an event that explicitly clears or updates the cache entry. More complex but precise. In Cartrack, when a driver's details are updated in the DB, that event triggers a cache delete for that driver's cached profile.
Write-through invalidation: Every write updates the cache simultaneously. Zero staleness, but adds write latency.
The real challenge: Cache invalidation at the boundary between services. If service A caches data owned by service B, how does A know when B's data changes? This is where event-driven architecture (publish invalidation events to a message queue) beats direct cache management.
Prevention strategies:
1. Mutex/lock: When a cache miss occurs, only one request fetches from the DB and populates the cache. Other requests wait. Use a Redis distributed lock (SETNX) to implement this.
2. Probabilistic early expiry (PER): Before the TTL expires, start randomly refreshing the cache for a small percentage of requests. By the time the TTL hits, the cache is already warm.
3. Background refresh: Cache entries are refreshed asynchronously before they expire. The cache always serves (possibly slightly stale) data while a background job updates it.
4. Staggered TTLs: Add random jitter to TTLs — instead of all entries expiring at time T, they expire at T + random(0, 30s). Prevents synchronised expiry across many keys.
Redis — data structures, use cases & patterns
# String — simple key-value, TTL-based cache SET driver:score:123 "87.5" EX 300 # expires in 5 min GET driver:score:123 # Hash — store object fields (avoid multiple roundtrips) HSET vehicle:456 lat 1.3521 lng 103.8198 speed 65 HGETALL vehicle:456 # Sorted Set — leaderboards, rate limiting by score ZADD driver_scores 87.5 "driver:123" ZRANGE driver_scores 0 9 REV WITHSCORES # top 10 drivers # Distributed lock (prevent cache stampede) SET lock:refresh:driver_scores "1" NX PX 5000 # NX = only if not exists, PX = ms TTL # Rate limiting counter INCR rate:api:customer:789 EXPIRE rate:api:customer:789 60 # reset every minute # Pub/Sub — lightweight event fanout PUBLISH geofence_alerts '{"vehicle":456,"zone":"depot","event":"exit"}'
CDN — Content Delivery Network
How it works, what to cache, edge computing
What to put behind a CDN
Static assets: JS bundles, CSS, images, fonts (cache forever with content-hash filenames). Public API responses that change infrequently (fleet config, map tile data). HTML for SSR pages (short TTL). Large file downloads (vehicle footage exports).
What NOT to put behind a CDN
Real-time data: live GPS positions, WebSocket connections. Authenticated user-specific data (unless using private CDN with signed URLs). Anything requiring immediate invalidation (e.g. emergency geofence changes).
# Static assets (content-hash in filename — cache forever) Cache-Control: public, max-age=31536000, immutable # API response — cache for 60s, allow stale for 5s while revalidating Cache-Control: public, max-age=60, stale-while-revalidate=5 # Private user data — never cache at CDN level Cache-Control: private, no-store
A CDN is a globally distributed network of edge nodes operated by a third party (Cloudflare, AWS CloudFront, Fastly). It caches content at locations worldwide, dramatically reducing latency for geographically distributed users. You configure it, they operate it.
For Cartrack serving fleet managers across 20+ countries: I'd use Nginx/HAProxy as the reverse proxy in our Singapore data centre for internal routing and load balancing, and put Cloudflare or CloudFront in front of it for global edge caching of static assets and public API responses. Both are complementary — the CDN catches what it can globally, the reverse proxy handles what reaches origin.
DNS, SSL/TLS & HTTPS
What happens between a browser URL and your server
api.cartrack.sg. TTL controls how long this is cached.REST vs GraphQL vs WebSocket vs gRPC
When to use each communication style
REST
Request-response over HTTP. Resource-oriented URLs. Cacheable, stateless, widely understood. Best for most CRUD APIs — fleet management config, driver profiles, historical reports. Simple to consume from any client.
GraphQL
Client specifies exactly what data it needs in a single query. Reduces over-fetching and under-fetching. Best for complex UIs where different views need different data shapes. Harder to cache (POST by default). Good for the Cartrack dashboard frontend.
WebSocket
Persistent bidirectional connection. Server can push data to client without polling. Best for real-time data: live GPS positions on the map, instant alert notifications, live driver score updates. Low overhead per message once connected.
gRPC
Binary protocol over HTTP/2. Strongly typed (Protocol Buffers). Very fast and efficient. Best for internal service-to-service communication. Not browser-friendly without a proxy layer. Used between microservices where performance is critical.
GET /vehicles/positions every 3 seconds, and there are 10,000 concurrent users, that's 3,333 requests/second — just for position updates. Wasteful and doesn't scale.WebSocket approach: When a user opens the fleet map, the client establishes a WebSocket connection to a dedicated presence server. The backend subscribes to GPS events from Kafka for the specific fleet's vehicles and pushes position updates over the WebSocket as they arrive.
Scale management: Users only see vehicles in the current map viewport — subscribe only to vehicle updates relevant to the visible area (spatial filtering). For customers with large fleets, cluster vehicles at high zoom levels rather than rendering all 500 markers individually.
Fallback: Server-Sent Events (SSE) as a fallback for environments where WebSocket is blocked — SSE is unidirectional (server to client) but simpler to implement and works over HTTP/2.
Message queues — Kafka vs RabbitMQ vs SQS
Pub/sub vs point-to-point, ordering, durability, and replay
Kafka
Type: Distributed log / pub-sub
Retention: Days to weeks (configurable)
Replay: Yes — consumers track their own offset
Ordering: Guaranteed within a partition
Use for: High-throughput event streaming, GPS ingestion, audit logs, event sourcing
RabbitMQ
Type: Message broker (AMQP)
Retention: Until consumed
Replay: No (message deleted after ack)
Ordering: Within a queue
Use for: Task queues, RPC patterns, complex routing (topic exchanges), email/SMS dispatch
AWS SQS
Type: Managed queue
Retention: Up to 14 days
Replay: No
Ordering: Standard (best-effort) or FIFO
Use for: Cloud-native workloads, decoupled Lambda functions, simple job queues without infrastructure to manage
Circuit breaker, retry, and bulkhead patterns
How to build services that fail gracefully
I'd implement one when: (1) the dependency is external (third-party mapping API, SMS gateway), (2) the dependency is a non-critical service whose failure shouldn't cascade, or (3) the dependency is known to occasionally spike in latency under load.
In Cartrack's context: the SMS alert service, external map tile provider, and vehicle registration API are all good candidates for circuit breakers. The GPS ingestion pipeline should not circuit-break on DB failures — it should queue and retry, since losing GPS data is unacceptable. Libraries like Resilience4j (Java), Polly (.NET), or a service mesh like Istio implement this without application-level code.
Observability — logs, metrics, traces
The three pillars and how they work together
Logs
What: Timestamped records of discrete events.
Tools: ELK stack (Elasticsearch, Logstash, Kibana), Grafana Loki, Datadog Logs.
Best practice: Structured JSON logs (not plain text). Include request_id, user_id, vehicle_id for correlation. Log at appropriate level — don't log DEBUG in production.
Metrics
What: Numeric measurements over time — counters, gauges, histograms.
Tools: Prometheus + Grafana, Datadog, CloudWatch.
Key metrics: Request rate (RPS), error rate (%), latency (p50/p95/p99), queue depth, cache hit ratio, DB connection pool usage.
Distributed Traces
What: End-to-end journey of a single request across multiple services.
Tools: Jaeger, Zipkin, Datadog APM, OpenTelemetry.
Use for: "This fleet manager's dashboard takes 3s to load — which service is slow?" Traces show the exact span of time each service component contributes.
Rate limiting — algorithms and implementation
Token bucket
A bucket holds N tokens. Each request consumes one token. Tokens refill at a constant rate. Allows short bursts up to the bucket size. Most flexible algorithm — handles bursty traffic naturally. Redis implementation: INCR + TTL per window.
Fixed window counter
Count requests in a fixed time window (e.g. 1000 req/min). Simple but has a boundary problem: 1000 requests in the last second of minute 1 + 1000 in the first second of minute 2 = 2000 in any 2-second window — double the limit.
Sliding window log
Record timestamp of each request. Count requests within the last N seconds. Precise but memory-intensive at high request rates — stores every request timestamp.
Sliding window counter
Hybrid of fixed window + rolling calculation. Memory efficient and more accurate than fixed window. Most production rate limiters (Nginx, Cloudflare) use this approach.
# Redis sliding window rate limit function isRateLimited(customerId, limit=1000, windowSec=60): key = "rate:{customerId}:{floor(now/windowSec)}" count = INCR(key) if count == 1: EXPIRE(key, windowSec * 2) # set TTL on first hit return count > limit # Return these headers so clients can self-throttle X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 847 X-RateLimit-Reset: 1704067260
Fleet Telematics & Cartrack Product
Most candidates won't know the domain. Showing you've thought about the product and its technical challenges — including the cloud architecture behind it — signals initiative and seniority.
How would you design the data pipeline for ingesting real-time GPS telemetry from millions of IoT devices?
Cloud-agnostic conceptual answer
Design the GPS telemetry ingestion pipeline on AWS
Full service mapping: IoT Core → Kinesis → Lambda/ECS → RDS/S3 → API Gateway
Observability
CloudWatch: Metrics, logs, alarms for Lambda errors, Kinesis iterator age (lag), ECS CPU/memory, RDS query latency.
X-Ray: Distributed tracing across Lambda → RDS → ElastiCache — identifies where latency lives in a request.
CloudWatch Synthetics: Canary monitors the dashboard endpoint every minute from multiple regions.
Security & compliance
IAM roles: Each Lambda/ECS service has a least-privilege role — no credentials in code.
AWS Secrets Manager: DB passwords, API keys rotated automatically.
VPC: RDS and ElastiCache in private subnets, no public internet access.
KMS: Encrypt S3 data at rest, RDS storage, Kinesis stream.
Alerting & notifications
SNS: Fan-out for speed alerts, geofence events, maintenance reminders. One SNS topic can publish to email, SMS, SQS, Lambda, and HTTP endpoints simultaneously.
SQS: Dead-letter queues for failed alert deliveries — retry without data loss.
SES: Scheduled fleet reports emailed to managers.
Batch & analytics
AWS Glue: Nightly ETL from S3 (raw GPS Parquet) → compute trips, driver scores, fuel consumption.
Amazon Redshift: Data warehouse for business intelligence — fleet utilisation trends, compliance reports, customer-level analytics.
Athena: Ad-hoc SQL on raw S3 data without loading into Redshift — useful for debugging specific date ranges.
# 2.5M active vehicles × 1 ping every 5 seconds
Ingestion rate = 500,000 events/sec
Event size = ~300 bytes (lat, lng, speed, heading, odometer, timestamp, vehicle_id)
Ingestion volume= 150 MB/sec = ~13 TB/day → S3 cold storage
Kinesis shards = 500k events/s ÷ 1000 events/s/shard = ~500 shards
Lambda = Auto-scales — no capacity planning needed
ElastiCache = 2.5M vehicle position hashes × 200 bytes = ~500 MB hot data
Aurora = 2.5M vehicle rows, ~50M trips/year — read replicas handle dashboard load
S3 = ~4.7 PB/year raw GPS archive (compressed Parquet ~3–4x smaller)
AWS IoT Core is fully managed — it scales to billions of devices with no infrastructure to operate. It integrates natively with the rest of AWS (direct Rules Engine routing to Kinesis, S3, Lambda without any glue code). Device certificates are managed through AWS Certificate Manager. It also supports device shadows — a JSON document representing the last-known state of a device, which is critical for fleet management since vehicles frequently go offline in tunnels or remote areas.
The cost trade-off is real — IoT Core charges per message, which at 500k events/sec is significant. But the operational savings and reliability justify it at Cartrack's scale.
vs MSK (Managed Kafka): MSK gives you the full Kafka feature set (more partitioning flexibility, exactly-once semantics, richer ecosystem). Kinesis is simpler to operate — no broker management, auto-scaling, tighter AWS integration. For a team not deeply experienced with Kafka operations, Kinesis is safer. MSK becomes the better choice when you need more than 200 shards, need Kafka's connector ecosystem, or already have Kafka expertise in-house.
My recommendation for Cartrack: Start with Kinesis for operational simplicity. Migrate to MSK if partition limits become a constraint or if the team builds Kafka expertise.
Provisioned Concurrency: Keep N Lambda instances pre-warmed and ready. For our Kinesis consumer, this guarantees zero cold start on the critical processing path. Cost: you pay for idle warm instances.
Optimize the deployment package: Smaller packages (avoid bundling unused libraries) cold-start faster. Use Lambda layers for shared dependencies. Prefer compiled runtimes (Go, Rust) over interpreted (Python, Node) for faster init.
RDS Proxy: Without RDS Proxy, each Lambda cold start opens a new DB connection. At burst scale, this exhausts the Aurora connection pool. RDS Proxy pools connections and serves Lambda from a warm pool — critical for any Lambda that touches the DB.
For this architecture: The Kinesis consumer Lambda hits ElastiCache (not RDS) on the hot path — Redis connections are lightweight. Cold starts only matter for infrequent operations. Provisioned Concurrency on the geofence alert Lambda ensures alert delivery latency stays under 5 seconds.
Failure handling setup:
• Set
bisectBatchOnFunctionError: true — on failure, Lambda splits the batch in half and retries each half. This isolates the one bad record rather than blocking the entire shard.• Configure a dead-letter queue (SQS) for events that fail all retries — they land in DLQ for manual inspection rather than being silently dropped.
• Set
maximumRetryAttempts and maximumRecordAgeInSeconds so genuinely unprocessable old events don't block the shard forever.The idempotency layer (ElastiCache event_id set) ensures that if a successfully-processed event is retried, it's a no-op — the "already seen" check prevents double-counting speed violations or duplicate geofence alerts.
AWS cost optimisation for a high-volume IoT pipeline
The biggest cost levers at Cartrack's scale
Kinesis cost levers
Enhanced Fan-Out: Costs $0.015/shard-hour vs $0.014 for standard. Only use it for consumers that need low latency (<200ms) — Firehose can use standard polling.
Right-size shards: Each shard costs ~$21/month. At 500 shards that's $10,500/month just for Kinesis. Consider batching device pings client-side (5 pings per message instead of 1) to reduce shard count 5x at the cost of slightly higher latency.
S3 storage cost levers
Parquet + Snappy compression: Raw JSON GPS events → Parquet reduces size by 4–8x. 13 TB/day becomes ~2–3 TB/day stored.
S3 Intelligent-Tiering: Data not accessed for 30+ days automatically moves to cheaper tiers. GPS data older than 90 days is rarely queried — auto-tier saves ~60% on storage cost for archives.
Lifecycle policies: Delete raw events older than 3 years, keep aggregated data indefinitely.
Lambda cost levers
ARM (Graviton2) architecture: 20% cheaper than x86 for Lambda, often faster too. Easy switch — no code changes needed.
Optimise memory allocation: Lambda pricing is memory × duration. Sometimes doubling memory halves duration — net cost-neutral but faster. Use AWS Lambda Power Tuning tool to find the optimal memory setting for each function.
Database cost levers
Aurora Serverless v2: For workloads with variable demand (fleet managers in business hours only), Aurora Serverless scales down to minimum ACUs during off-hours. Can save 40–60% vs provisioned for non-24/7 workloads.
Read replicas in same AZ: Cross-AZ data transfer costs money. Place read replicas used by dashboards in the same AZ as the ECS services that query them.
Infrastructure as Code — provisioning this stack with AWS CDK or Terraform
How a tech lead would define, version, and deploy the entire architecture
AWS CDK (TypeScript)
Best for: AWS-only stacks. Uses real programming language (TypeScript/Python) — loops, conditions, abstractions. Generates CloudFormation underneath.
Pros: First-class AWS support, type safety catches mistakes before deploy, L2/L3 constructs abstract boilerplate (e.g. new KinesisStream(this, 'GPS', { shardCount: 500 })).
Cons: AWS-only, learning curve if new to CDK constructs.
Terraform (HCL)
Best for: Multi-cloud or existing Terraform expertise. Declarative HCL syntax, massive ecosystem of providers.
Pros: Cloud-agnostic, mature state management (terraform.tfstate), excellent for teams already using it. terraform plan shows exactly what will change before applying.
Cons: HCL is less expressive than a real language for complex logic. State file must be kept in S3 + DynamoDB lock for teams.
// AWS CDK TypeScript — GPS pipeline stack (simplified) export class GPSPipelineStack extends Stack { constructor(scope: Construct, id: string) { super(scope, id); // Kinesis stream — partition key will be vehicle_id at publish time const gpsStream = new kinesis.Stream(this, 'GPSStream', { shardCount: 500, retentionPeriod: Duration.days(7), encryption: kinesis.StreamEncryption.MANAGED, }); // Lambda processor — auto-scales with Kinesis shard count const processor = new lambda.Function(this, 'GPSProcessor', { runtime: lambda.Runtime.NODEJS_20_X, architecture: lambda.Architecture.ARM_64, // 20% cheaper memorySize: 512, timeout: Duration.seconds(60), environment: { REDIS_HOST: redisCluster.attrPrimaryEndPointAddress, DB_PROXY_ENDPOINT: rdsProxy.endpoint, }, }); // Wire Kinesis → Lambda with bisect-on-error and DLQ processor.addEventSource(new KinesisEventSource(gpsStream, { batchSize: 100, startingPosition: lambda.StartingPosition.TRIM_HORIZON, bisectBatchOnError: true, onFailure: new SqsDlq(dlqQueue), maxRetryAttempts: 3, })); // Firehose to S3 with Parquet conversion via Glue schema new firehose.DeliveryStream(this, 'GPSArchive', { sourceStream: gpsStream, destinations: [new destinations.S3Bucket(archiveBucket, { dataOutputPrefix: 'gps/year=!{timestamp:yyyy}/month=!{timestamp:MM}/', bufferingInterval: Duration.seconds(60), })], }); } }
How would you design geofencing alerts at low latency for millions of vehicles?
• Spatial indexing: Store geofences in PostgreSQL with the PostGIS extension. Use GiST indexes on the geometry column. A point-in-polygon check with proper indexing is fast even for millions of geofences.
• In-memory cache: Load each customer's active geofences into Redis keyed by geohash/grid cell. On incoming GPS event, look up only the geofences relevant to that grid cell.
• State tracking: Per vehicle, track "was inside geofence X at last ping?" — store this state in Redis to detect transitions (enter/exit) rather than re-evaluating absolute position on every event.
• Event-driven alerts: Transition events publish to an alert queue (SNS on AWS), which then triggers notifications (push, SMS, email) asynchronously.
GPS data can arrive out of order or as duplicates. How do you handle this?
ON CONFLICT DO NOTHING or upsert on the event ID. A short-lived Redis set of recently seen event IDs (TTL 60s) can block duplicates before they even hit the database.How would you build a driver behaviour scoring system?
driving_events table. Nightly AWS Glue job computes the 30-day score and writes to a driver_scores summary table. Dashboards query the summary table — never aggregate raw events on every page load.Situational Questions
Use STAR: Situation, Task, Action, Result. Have real examples ready — don't invent scenarios.
A critical bug causes incorrect GPS locations for 10,000 vehicles. You're the tech lead on call.
Product wants to ship a feature in 2 weeks that the team estimates at 6 weeks. How do you handle it?
Then I'd come back with options rather than a single answer: (a) Full feature in 6 weeks at current scope. (b) An MVP in 2 weeks that delivers the core value — here's specifically what we'd cut. (c) Full feature in 4 weeks if we add a developer or drop another priority. (d) Feature in 2 weeks but with known technical debt we address in the next sprint.
I'd make the tradeoffs explicit, let product make the call, and document whatever decision is made so there's shared accountability."
You inherit a legacy codebase with no documentation and high technical debt. What do you do?
What does the current tech stack look like, and where is the biggest technical debt?
What does success look like for this role in the first 90 days?
How do you balance shipping new features vs maintaining platform reliability?
PHP / C# & React / TypeScript
Frame your experience as stack-agnostic. Depth in one backend language and a modern frontend framework is enough — show breadth across 15 years to carry the language-agnostic requirement.
How would you structure a REST API for a fleet tracking service?
# Resource-oriented URLs GET /v1/fleets/{fleet_id}/vehicles GET /v1/vehicles/{vehicle_id}/trips?from=2024-01-01&to=2024-01-31 GET /v1/vehicles/{vehicle_id}/position # latest GPS fix POST /v1/alerts PATCH /v1/drivers/{driver_id}
/v1/). Cursor-based pagination for GPS event endpoints — offset pagination breaks on large tables. Rate limiting per customer, not per IP.(C#) Explain async/await and where you'd apply it in a high-throughput telematics API.
// Blocking (bad — thread parked while DB responds) public Vehicle GetVehicle(int id) { return _db.Vehicles.Find(id); } // Non-blocking (good — thread returns to pool during DB wait) public async Task<Vehicle> GetVehicleAsync(int id) { return await _db.Vehicles.FindAsync(id); }
async/await allows the thread to be returned to the thread pool while waiting for I/O. For a telematics API where an endpoint fetches 30 days of GPS events from PostgreSQL, making it async means a single server can handle thousands of concurrent fleet dashboards refreshing in parallel without running out of threads.A live fleet map re-renders too frequently as vehicle positions update. How do you optimise this?
// Bad — new array reference every render triggers re-render const markers = vehicles.map(v => ({ lat: v.lat, lng: v.lng })); // Good — memoised, only recalculates when vehicles changes const markers = useMemo( () => vehicles.map(v => ({ lat: v.lat, lng: v.lng })), [vehicles] );
React.memo(). Throttle WebSocket position updates — GPS devices ping every few seconds but the map doesn't need to re-render at the full telemetry rate. Consider mutating marker positions directly on the map instance (Mapbox GL JS / Leaflet) rather than updating React state.Docker & Kubernetes
How containers work, how to build images properly, and how Kubernetes orchestrates them at scale.
Images, containers & layers — why layer order matters
# WRONG: cache busted every time source code changes COPY . . RUN npm ci # reinstalls on every code change! # CORRECT: deps layer cached unless package.json changes COPY package*.json ./ RUN npm ci --production # cached if deps unchanged COPY . . # source invalidates cache only here
--entrypoint. CMD provides default arguments that can be replaced at runtime.Common pattern:
ENTRYPOINT ["node"] and CMD ["server.js"]. Running normally starts node server.js. Running with docker run myimage debug.js starts node debug.js — CMD was overridden. If you only use CMD, someone can override the entire command including the executable.FROM php:8.3-fpm AS base
RUN docker-php-ext-install pdo pdo_pgsql
FROM base AS deps
COPY composer.json composer.lock ./
RUN composer install --no-dev --optimize-autoloader
FROM base AS final
COPY --from=deps /app/vendor ./vendor
COPY . .
EXPOSE 9000
---
# docker-compose.yml
services:
api:
build: .
environment:
DB_HOST: postgres
DB_PASS: ${DB_PASS} # from .env — never hardcode
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app"]
interval: 5s
Key points: Multi-stage builds reduce image size. Use depends_on with healthcheck — not just depends_on (which doesn't wait for readiness). Never embed secrets in the image.Cluster architecture — control plane, worker nodes, and rolling deployments
Control plane
API Server: Front door. All kubectl commands go here.
etcd: Stores all cluster state.
Scheduler: Picks which node a new pod runs on.
Controller Manager: Watches for divergence and fixes it.
Worker nodes
kubelet: Node agent. Manages pods on the node.
kube-proxy: Manages network rules for Services.
Container runtime: Actually runs containers (containerd).
A StatefulSet is for stateful applications. Pods get predictable names (app-0, app-1, app-2), a stable network DNS hostname, and their own PersistentVolumeClaim that survives pod restarts. Critical for databases where a pod needs to know "I am replica number 1."
I'd use a Deployment for web API servers and workers. I'd use a StatefulSet for PostgreSQL, Redis, Kafka — anything that maintains its own state that must persist.
Readiness probe: "Is this container ready to accept traffic?" If it fails, Kubernetes removes the pod from the Service's endpoints — traffic stops going to it, but the container is NOT restarted. Use this for startup warmup or temporarily overloaded pods.
Common mistake: Setting liveness probe thresholds too aggressively — Kubernetes will keep restarting a pod that just needs time to warm up, causing a crash loop. Use readiness for traffic control, liveness as a last resort for truly stuck processes.
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never take a pod down before a new one is ready
maxSurge: 1 # allow one extra pod above desired count during update
To minimise downtime: Configure readinessProbe properly — K8s only routes traffic to a pod once the probe passes. Handle SIGTERM gracefully — finish in-flight requests before exiting. Set terminationGracePeriodSeconds to match your longest expected request time. Use PodDisruptionBudgets to guarantee a minimum number of healthy pods.CI/CD & GitLab Pipelines
How automated pipelines work, how GitLab CI is configured, and how to design pipelines that are fast, secure, and reliable.
CI vs CD (Delivery) vs CD (Deployment)
Three stages, three different levels of automation
Continuous Integration
Developers merge code to main frequently. Each merge triggers automated build + test run. Goal: detect integration problems early.
Continuous Delivery
Every passing change can be released to production at any time — but requires a human to press the deploy button. Software is always releasable.
Continuous Deployment
Every passing change is automatically deployed to production without human intervention. Requires very high confidence in tests and monitoring.
For Cartrack, I'd start with Continuous Delivery. Fleet telematics is a mission-critical system — unexpected production changes can directly impact 125,000+ commercial customers tracking vehicles in real time. The manual gate also gives product managers visibility into what's going out. I'd graduate to Continuous Deployment as confidence in the test suite and monitoring grows, potentially starting with lower-risk services first.
Full GitLab CI pipeline for a fleet tracking backend
stages: [lint, test, build, deploy] lint-code: stage: lint image: php:8.3 script: - composer run phpcs # PSR-12 coding standards - composer run phpstan # static analysis cache: paths: [vendor/] key: $CI_COMMIT_REF_SLUG unit-tests: stage: test services: [postgres:16] # real DB for tests variables: {POSTGRES_DB: test_db} script: - composer run phpunit --coverage-text coverage: '/Lines:\s+(\d+\.\d+)%/' build-image: stage: build image: docker:24 services: [docker:dind] script: - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA . - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA deploy-staging: stage: deploy script: [kubectl set image deployment/api api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA] environment: staging rules: - if: $CI_COMMIT_BRANCH == "main" deploy-production: stage: deploy when: manual # human approval gate environment: production script: [kubectl set image deployment/api api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA] rules: - if: $CI_COMMIT_TAG # only on tagged releases
For production deployments I'd integrate with a secrets manager like HashiCorp Vault or AWS Secrets Manager for rotation and audit trails. Variables scoped to specific environments prevent staging pipelines from accidentally accessing production credentials.
What I'd never do: Put a secret in a CI variable that isn't masked, echo a secret to the job log, store credentials in a Docker image layer (they can be extracted from the image history), or commit a .env file with secrets (even .gitignored — they end up in git history).
Agile & Scrum
How Scrum works in practice — roles, all five ceremonies, and the mechanics of sprint-based development. Often tested as a leadership question.
Product Owner
- Owns and prioritises the product backlog
- Defines acceptance criteria for stories
- Single voice of the customer
- Decides what gets built and in what order
- Approves completed work against the DoD
Scrum Master
- Facilitates all Scrum ceremonies
- Removes impediments and blockers
- Protects team from scope creep
- Coaches the team on Agile practices
- No direct authority over developers
Development team
- Self-organising and cross-functional
- Estimates effort and owns sprint commitments
- Pulls work from the sprint backlog
- Collectively responsible for quality
- Defines and upholds the DoD
Sprint planning
Team selects items from the top of the prioritised backlog and commits to delivering them. PO clarifies requirements. Team estimates in story points, breaks stories into tasks. Output: a clear, committed sprint goal.
Daily standup
Each member answers: what did I complete yesterday, what will I do today, any blockers? Goal is team coordination and early blocker detection — not a status report to management.
Sprint review (demo)
Team demonstrates completed work to stakeholders. Only "done" work — that meets the DoD — is presented. Incomplete work is returned to the backlog. This is about the product.
Sprint retrospective
Team reflects on its own working process: what went well, what didn't, what specific actions to take next sprint. This is about the process — separate from the review. Output: actionable improvements with owners assigned.
Backlog refinement
PO and team review upcoming items: clarify requirements, write acceptance criteria, split large epics, estimate effort. Goal: ensure the top of the backlog is always sprint-ready so planning meetings don't stall.
Simple: review = "what did we build?", retro = "how did we work?" Mixing them up in an interview is a red flag — it suggests you haven't worked in Scrum.
Acceptance criteria are story-specific — they define what "correct" means for that particular feature. For a login story, AC might be: "given a valid email and password, the user is redirected to the dashboard; given an invalid password, an error is shown."
A story must meet both: AC confirms it does the right thing, DoD confirms it was done to the team's quality standard.
When You Don't Know the Answer
Interviewers are testing honesty, thinking process, and whether you'd be safe to work with. Saying "I don't know" confidently and constructively signals high self-awareness. Bluffing signals poor judgment — both are disqualifying.
"Have you used [X technology]?" — when you haven't
"Explain how [deep concept] works" — you know it partially
Complete blank — no starting point at all
You know it but forgot the exact syntax
The interviewer digs deeper than you can go
What not to do
- "I know this, let me think..." — then fill time with unrelated waffle
- "It's basically the same as [completely wrong analogy]"
- Giving a long confident-sounding answer that's actually incorrect
- Saying "that's a great question!" — sounds hollow every time
- Apologising excessively: "I'm so sorry, I should know this..."
- Pretending a vague answer was complete
What to do instead
- Be brief: "I haven't used that yet" — one sentence, then pivot
- Show adjacent knowledge immediately after admitting the gap
- Ask the interviewer to explain — turns a gap into a conversation
- State your reasoning process even if you don't know the conclusion
- Stay calm — a pause to think looks confident, not nervous
- Be genuinely curious, not performatively so