Here are some techniques that have served me well in writing efficient programs. I learned to program on small slow machines; the standard Computer Science machine in grad school, on which much of the original famous research was done in the 1970s, was a PDP-10 with about 1MB of memory running under 1 MIPS. Paradoxically, when running large problems on current machines, efficiency is still important because external storage is perhaps {$10^5$} slower than main memory. Also, with larger {$N$}, you are farther out on the asymptotic curve. When {$N=10^9$}, as for some of my implementations, {$\lg N$} is a very large factor, i.e., 30.

My implementations include:

1. Finding Connected Components on a {$1000\times1000\times1000$} array of bits in {$E^3$}.
2. Preprocessing almost {$2\cdot10^8$} points in {$E^3$} to support efficient nearest neighbor queries.
3. Computing the volume and surface area of the Union of {$2\cdot10^7$} cubes.

Here are the tips.

1. Do not dynamically allocate, 1-by-1, millions of elements, such as points. That appears to take superlinear time. Either allocate one array of them, or better:
2. Allocate storage at compile time. Advantages:
1. Fewer pointers are needed, e.g., for multi-dimensional arrays.
2. The storage overhead for heap objects is avoided.
3. Those 2 items allow more data to be stored in core.
4. Addresses are known at compile time, allowing possible optimizations.
3. If the size of, say, a grid, is variable, then allocate a maximum size at compile time, and recompile the program if that is too small.
4. Then you can use boost arrays, which are especially good for multi-dimensional arrays.
There are other good things in boost; study it.
5. Also study the several books on C++ programming techniques.
6. Rather than using a pointer to a small object, consider storing the object locally in place of the pointer.
E.g., consider a balanced binary search tree of 4-byte objects. With {$N$} objects, there are {$N-1$} internal nodes and {$2N-2$} internal pointers, at 4 bytes each. Note that, as {$N\rightarrow\infty$}, the pointers take twice as much space as the objects.
Alternatively, you could pack the objects sequentially, w/o any pointers, saving 2/3 of the total space. If the tree is not balanced, you have to store its topology, which takes 2 bits per internal node. In either case, however, access time rose from {$\log N$} to {$N$}.
You can replace the bottom layer of pointers by storing the objects inside the bottom layers of internal nodes, in place of the pointers to the objects. That deletes 1/2 of the pointers and 1/3 of the total space. Access time, including insertion and deletion, is still {$\log N$}. You need 1 bit per field to distinguish a pointer from an object.
A similar process is possible for the bottom {$k$} levels of the tree, to trade off time and space. The result is a pointer-based tree of smaller compact trees.
7. With millions of objects, minimize the size of each object, e.g. with short int instead of int and float instead of double (when reasonable).
8. Instead of using a pointer to one of many objects of the same type, if they are all part of a global array, use the index in that array. This may not be faster, but is easier to debug, since the index is a smaller number and so easier to understand.
9. Here's how I optimize (in time and space) an array of sets of objects, such as a grid with a different number of points in each cell.
1. Read the points in, and keep a count of the number in each cell. Do not store the points yet.
2. Allocate a ragged array just big enough.
3. Read the points a second time and populate the ragged array.
The obvious alternative would be a linked list for the points in each cell. In contrast:
1. My method uses no pointers, and so is smaller. For 4-byte objects, the space reduction is a factor of 2.
2. The points in each cell can be randomly accessed in constant time since you don't have to trace along a linked list.
3. However, objects cannot be added later; the data structure is constant after creation.