Here's an in-depth analysis of the NVIDIA G80 architecture, which still represents their newer chips quite well.
Basically, they have shader clusters that work with vectors that are 16 elements wide. Each cluster has 16 scalar multiply-add units, so these operations take just one clock cycle. Note that something like a dp4 instructions takes 4 cycles to compute for 16 different pixels or vertices. Each cluster also has 4 special function units (SFU). These have two roles in one: interpolation to compute input values (colors and texture coordinates), and computing transcendental operations (anything not mul or add). Since there are only 4 SFU's it takes 4 clock cycles to process a 16-element vector.
The actual cost really depends on your shader's ratio between multiply-add instructions and other instructions. For instance if you have an m4x4 instruction, that gives you room for doing 4 special operations in parallel, for free! The same principle is true for texture lookup:
Each cluster is further equipped with four texture address units, and eight texture filter units. These run at lower clock frequency though. But if you can keep the number of texture lookups low, they can execute in parallel with the arithmetic operations.
As Reedbeta already mentioned though, texture performance also depends on cache behaviour. Also the G80's register file is relatively cramped so this can influence performance in complex ways as well. And to top it off other chips behave somewhat different too.
But anyway, try not to think in terms of clock cycles, but in terms of the right instruction mixture to avoid bottlenecks. A balanced shader has several texture lookups, several transcendental operations, and many more multiplies and additions. Oh and don't panic if one shader doesn't have the right mixture; the GPU will try to run several shaders concurrently and maximize the use of each unit independently.