Backend Performance Tuning

  1. Put a squid proxy in front of your mod_perl / Tomkat servers if possible.
  2. Consider using a CDN (akamai) if possible.
  3. Use persistent technologies such as mod_perl, Tomkat, share memory, opcode caching / optimizer / analyzer.
  4. Reduce load on backend database by using memcache.
  5. Optimize SQL queries and database machines.
  6. Use indexes
  7. Faster SAN (Hitachi)
  8. Reduce system calls.
  9. Consider using threads or event driven programming.
  10. Consider using share memory so that mod_perl/apache does not have to handle sending data.
  11. Consider using 64bit operating system, add more memory, and using hardware RAID
  12. Reduce the number of data transfer on network interface (compress it on the fly, or store it on the file system already compressed)
  13. Reduce the amount of data written to disk if we can.
  14. Traditional 3-tier architecture with hardware load balancer in front of databases
  15. Clusters based on types / features: ad, app, photo, monitoring, gallery, profile, userinfo, IM status cache, message, testinomial, friends, graph servers, gallery search, object cache
  16. No persistent database connection
  17. Don't go after the biggest problem first. Tackle the low hanging fruit first. Or tackle both in parallel if your team have sufficient man power.
  18. Optimize without downtime
  19. Move sorting query types into the application and limit the amount of data returned to the browser.
  20. Reduce range
  21. Range on primary keys.
  22. Benchmark, make change, benchmark, make change (cycle of improvement)
  23. Always have a plan to rollback.
  24. Incremental rollout (rollout to a small set of users before rolling it out to all users).
  25. Horizontal scalability. Rather than buying big, centralized servers, buy a lot of thin, cheap servers. If one fails, rollover to another servers. Spawn up a new server and join the cluster. Use the AWS auto-scale approach.
  26. Work with a team. Access the current problems. Define the issues. Determine available hardware.
  27. Have a road map.

The database server should never be on the same box as the web server, even when the box does not have a lot of hits. This makes it easy to isolate resource contention problem, and do performance tuning.

tuning Apache
tuning PHP
tuning MySQL

Things to keep in mind

Operating system impose certain limits on your program.

  1. On Linux, you can only have approximately 32,000 files inside a directory. That is a lot of files, but in reality, if you have more than 1000 files per directory, you should see performance degraded. If you have a lot of files to be stored on the same filesystem, split them up, and store them in nested directories where the names of the directories are 2 or 3 letters long.
  2. The same limitation mentioned above also dictate how many tables your mysql database can have. For each table, mysql use 3 files. So at the max, you can only have 32000 / 3 tables.

pmap - reports memory map for a process. Check out this blog for some explaination.


How to cluster mail servers?
Solaris 2.x - Tuning your TCP/IP Stack and More
TCP-Friendly Unicast Rate-Based Flow Control
A HOWTO on Optimizing PHP
Linux Virtual Server

Google Architecture 2008
GoogleTalk Architecture 2007
Flikr 2007
YouTube Architecture 2008

Friendster Architecture 2005
MySpace Architecture 2009
Amazon Architecture
Feedburner Architecture
Boost application performance using asynchronous I/O
The Linux Page Cache and pdflush:Theory of Operation and Tuning for Write-Heavy Loads
Linux Disk Performance/File IO per process
Disk I/O Analysis - Determining I/O Bottlenecks
Troubleshooting Memory Usage
I/O statistics per process
How to Check Memory Usage in Linux based Server
Is there a way to find out what application using most of bandwidth in Linux?
Linux find the memory used by a program / process using pmap command
Choosing the right Linux File System Layout using a Top-Bottom Process
Introduction to nginx.conf scripting
MessagePack: A cross-language efficient binary-based serialization library—“It’s like JSON, but very fast and small”. Claims to outperform protocol buffers for at least some benchmarks.
ZeroMQ: Modern & Fast Networking Stack

Friendster: 1 billion queries per day.
MySpace - 2009: Windows ASP.NET, IIS, SQL Server. 300 millions users. 100 gigabitss/second out to the Internet. 10GB/sec is HTML content. 4500+ web servers, 1200+ cache servers running 64-bit Windows 2003 with 16GB of object cache in RAM. 500+ database servers running 64-bit Windows and SQL Server 2005.

  1. Keep it all in memory: I/O will kill your latency, so make sure all of your data is in memory. This generally means managing your own in-memory data structures and maintaining a persistent log, so you can rebuild the state after a machine or process restart. Some options for a persistent log include Bitcask, Krati, LevelDB and BDB-JE. Alternatively, you might be able to get away with running a local, persisted in-memory database like redis or MongoDB (with memory » data). Note that you can loose some data on crash due to their background syncing to disk.
  2. Keep data processing colocated: Network hops are faster than disk seeks but even still they will add a lot of overhead. Ideally, your data should fit entirely in memory on one host. With AWS providing almost 1/4 TB of RAM in the cloud and physical servers offering multiple TBs this is generally possible. If you need to run on more than one host you should ensure that your data and requests are properly partitioned so that all the data necessary to service a given request is available locally.
  3. Keep context switches to a minimum: Context switches are a sign that you are doing more compute work than you have resources for. You will want to limit your number of threads to the number of cores on your system and to pin each thread to its own core.
  4. Keep your reads sequential: All forms of storage, wither it be rotational, flash based, or memory performs significantly better when used sequentially. When issuing sequential reads to memory you trigger the use of prefetching at the RAM level as well as at the CPU cache level. If done properly, the next piece of data you need will always be in L1 cache right before you need it. The easiest way to help this process along is to make heavy use of arrays of primitive data types or structs. Following pointers, either through use of linked lists or through arrays of objects, should be avoided at all costs.
  5. Batch your writes: This sounds counterintuitive but you can gain significant improvements in performance by batching writes. However, there is a misconception that this means the system should wait an arbitrary amount of time before doing a write. Instead, one thread should spin in a tight loop doing I/O. Each write will batch all the data that arrived since the last write was issued. This makes for a very fast and adaptive system.
  6. Use your cache wisely: With all of these optimizations in place, memory access quickly becomes a bottleneck. Pinning threads to their own cores helps reduce CPU cache pollution and sequential I/O also helps preload the cache. Beyond that, you should keep memory sizes down using primitive data types so more data fits in cache. Additionally, you can look into cache-oblivious algorithms which work by recursively breaking down the data until it fits in cache and then doing any necessary processing.
  7. Non blocking as much as possible: Make friends with non blocking and wait free data structures and algorithms. Every time you use a lock you have to go down the stack to the OS to mediate the lock which is a huge overhead. Often, if you know what you are doing, you can get around locks by understanding the memory model of the JVM, C++11 or Go.
  8. Async as much as possible: Any processing and particularly any I/O that is not absolutely necessary for building the response should be done outside the critical path.
  9. Parallelize as much as possible: Any processing and particularly any I/O that can happen in parallel should be done in parallel. For instance if your high availability strategy includes logging transactions to disk and sending transactions to a secondary server those actions can happen in parallel.


Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License