Re: Sydney train commuters to get free transport day after rail network outage causes chaos | Sydney | The Guardian
  Matthew Geier

On 10/3/23 09:32, TP wrote:
> Typically, with a failure of automated tech, there's a human being or

> three in the chain that led to the failure. So, yes, technology is

> great, until humans start interfering in it, from setting up through

> to operation.


It was also designed and built by failure prone humans.

But Wednesday problem in Sydney was entirely avoidable. It was only
because some one defined the train radio system as 'vital' (as it's an
important part of the incident management system) that caused the shutdown.

A 'vital' system had crashed, so they stopped everything with out any
real understanding of where this system sat in the safety system chain.
Traffic ontrol was working, all interlockings were working. All what we
consider 'traditional' safety systems were all operating properly.

There is also an expectation that such systems will be 'redundant' and
'fail over' to backups when there are problems. And when the fail over
doesn't happen, people sit there dazed and confused as they were told
it's 'fail safe' and they never practiced this scenario of an actual
systems failure.

 My personal experience with corporate IT systems and 'fail over
redundancy' is that the redundancy system components introduce
additional complexity and fail more often than the underlying systems
they are supposed to protect.

But true redundant system design and implementation is fiendishly
difficult. (and expensive.) In particular you really need to do all
development twice with different teams with each system component being
able to swap in for a component developed by the other team. (And all
this is exhaustively tested at all stages of the development process).
If you just take the primary system and build a duplicate using the same
hardware and software, more often than not a system bug that takes out
the primary will also take out the secondary when it takes over and gets
to the same scenario that killed the primary. Seen that happen and
cripple my work organization when the central 'redundant data store'
went down. The primary controller hit a situation it couldn't handle and
crashed - the secondary took over, hit the SAME situation and also
crashed. The secondary controller was just a copy of the primary, so it
had exactly the same bugs.