You're right that if the activation function is linear, like the identity functi...

You're right that if the activation function is linear, like the identity function, then it doesn't matter how many layers you have. But with the step function two layers is enough.

We can manually derive a network that can classify the sample data using the step function:

    import numpy as np
   
    # Input: [x1, x2]
    # First layer: 4 nodes
    W1 = np.array([
      [ 1,  0, -0.2],
      [-1,  0,  0.8],
      [ 0,  1, -0.2],
      [ 0, -1,  0.8]
    ])
    # Second layer: 1 node
    W2 = np.array([[1, 1, 1, 1, -4]])

    # The step activation function
    def step(x):
      return x >= 0

    # Forward pass
    def f(x1, x2):
      one = np.ones((1, 1))
      v = np.array([x1, x2]).reshape((2, 1))
      v = step(W1 @ np.r_[v, one])
      v = step(W2 @ np.r_[v, one])
      return v[0, 0]
  
    >>> np.array([
    ...     [f(0, 1  ), f(0.5, 1  ), f(1, 1  )],
    ...     [f(0, 0.5), f(0.5, 0.5), f(1, 0.5)],
    ...     [f(0, 0  ), f(0.5, 0  ), f(1, 0  )]
    ... ])
    array([[False, False, False],
           [False,  True, False],
           [False, False, False]])

The four nodes in the first layer define four lines, tangents to the square 0.2 < x1 < 0.8 and 0.2 < x2 < 0.8, and the step function effectively checks which side of the line the point lies. The second layer just counts the number of "successful" line checks and yields True if all four pass. If the square is too rough of a shape then we can add more lines to the first layer to approximate any convex shape.

If the regions are concave then we can split them up into convex parts and add nodes to the second layer, one for each convex region. A third layer could then check if any of the convex region neurons activate. While in theory two layers with a non-linear activation function is enough to approximate this function, its structure would be harder to interpret.

But how do you find the right parameters without back propagation? The reason we don't use the step function is because its derivative is zero.