I'm not an expert, but my guess is because it makes it really easy to materializ...

j2kun · on April 15, 2014

Efficiency is the expected answer. I'm just wondering if there's a more theoretical reason, such as "every function that can be computed by a non-layered acyclic network can be computed by a complete layered network using only a small number of extra nodes/layers."

jbooth · on April 15, 2014

I think that it can. With some weights of 0 and some weights of 1, you can trivially map 'jumps' that skip from a node in one layer to a node a couple layers distant, by means of some incorporate-no-other-inputs intermediate nodes, right? Sigmoid function on 1 is still 1? Once you have those, it's just a matter of how many layers you need for any acyclic structure, I think.

Although if you wanted to come up with difficult scenarios, it's not hard to think of structures that would make some of those middle layers really tall, or add a lot of middle layers.

j2kun · on April 16, 2014

As I mentioned in another branch of this thread, selectively choosing edges between nodes isn't an option, because in the standard model you have complete incidence between nodes in adjacent layers.