You almost certainly don't need to write actual assembly. If you want to use SIMD or other architecture-specific features, there are intrinsics available, which translate directly to specific hardware instructions but allow you to keep using C++ syntax (and all the other benefits of automated code generation). For example, here is the Intel x86 intrinsics reference (MMX, SSE 1-4, a few other things).
Moreover, optimizing algorithms and data layout (cache locality) are almost certainly going to be more effective than optimizing at the instruction level. If you want to improve your engine's performance, those would be the place to start. Be sure to profile your engine to actually measure what's taking the most time, so you can make intelligent optimization decisions based on real numbers rather than assumptions!