Model Train-related Notes Blog -- these are personal notes and musings on the subject of model train control, automation, electronics, or whatever I find interesting. I also have more posts in a blog dedicated to the maintenance of the Randall Museum Model Railroad.
2023-08-12 - Conductor 2: Fixing an Unreliable Mainline Run
Category Rtac
A few months ago, I changed the automation to use the new Conductor 2 scripting engine as the default. It worked very nicely for a while, and in the last few weeks the behavior had degraded: the mainline automation was ending up in error recovery almost every day. The passenger train would be running fine, yet would end up with the automation in error.
There’s an interesting side effect here: One of the points of Conductor 2 is to have error recovery, being able to recover when things go wrong. Thus from an outsider’s point of view, the automation was “working” as in “doing something”, although from my point of view it was operating in a degraded running mode.
Eventually I had enough, reverted the automation computer to run the Conductor 1 script, while I investigated the issue to find the root cause.
First issue was figuring out what was not working, ending with the automated route in error recovery mode? I realized I simply did not have enough information in the logs to understand that after the fact.
That led me to enhance my logging: log error cases, log the root cause of an error, and log how long each sensor is triggered. I updated the logging, let it run for a while, and then captured one typical error:
14:56:20.416 D 8312 : Horn
14:56:21.117 D 8312 : -12
14:56:34.925 S S/NS771 B321 : ON
14:56:34.926 B S/NS773 B330 : TRAILING after 22.56 seconds
14:56:34.926 B S/NS771 B321 : OCCUPIED
14:56:35.696 S S/NS773 B330 : OFF
14:58:15.785 S S/NS771 B321 : OFF
14:58:35.012 R Sequence Mainline #3 Passenger (8312) : ERROR Sequence Mainline #3 Passenger (8312) current block {B321 [NS771]} suddenly became non-active after 120 seconds.
14:58:35.012 R Sequence Mainline #3 Passenger (8312) : ERROR
14:58:35.012 R Sequence Mainline #3 Passenger (8312) : ERROR Sequence Mainline #3 Passenger (8312) current block {B321 [NS771]} still occupied after 120 seconds.
In this log, 8312 is the DCC address of the UP train we’re controlling. We start by activating the horn, run in reverse at speed 12, and we have a transition from block B330 to block B321. The actual route is B330 → B321 → B503a (the station). The log indicates we’ve been running on B330 for 22 seconds. Then something odd happens: the script engine complains B321 became “non-active” (aka non-occupied) after 2 minutes, yet it also complains the block is still occupied after 2 minutes. We can’t have both. One of these cannot be right!
Some background is needed here: Conductor 1 was “dumb and simple”. It just automated the running state, with little to no validation (essentially “running blind”). In Conductor 2, I keep track of where an engine is supposed to be and compare that with actual sensors. I also have timing validation in Conductor 2: I define that a train must stay “at least N seconds” on a block, and “at most N seconds” on a block. My current default values are 10 seconds minimum, and 120 seconds maximum. The Randall blocks are fairly uneven in length, but generally we run 20-60 seconds on each so 2 minutes per block seemed like a good default. If a train takes longer to cross a block, there’s something odd going on.
So right there, B321 goes off 1 minute 40 seconds after it turns on. It’s close to the limit yet reasonably well under the 2 minute deadline.
So what’s going on?
Remember how in past posts I complained about how “flaky” the B321 block detector was? I simply could not adjust the NCE BD20 to compensate properly so I ended up changing it for an Iowa Scaled BD1 that did the trick. But I also built “flaky sensor” support in the Conductor 2 engine: as long as we’re under our timeout value (2 minutes here), we accept that a sensor goes off and on and ignore it. We only need one “on activation” signal to trigger the block occupation, then we guess the block is still occupied till the next one becomes active, or the timeout expires, whichever comes first. So that’s what is going on there: although the B321 sensor goes off after 1 minute 40 seconds, we assume the train is still on that block because the next one has not activated yet. And if after 2 minutes it has not actived, we end up in error mode. That’s what we’re seeing here and what the last error log “B321 still occupied after 120 seconds” means. At the same time, the flakiness handling code was accepting that B321 was off and was ok about it for up to 2 minutes, but once that time has expired, the flakiness handling code reports that B321 should not be off since we still think the train is on that block, and that’s what “B321 suddenly become non-active after 120 seconds” means.
So really the problem is that we have not moved on to the next block within the 2 minutes deadline.
To understand why, let’s look at normal run, one with no errors, as reported when running now with Conductor 1:
09:41:20.736 S S/b321 : ON
09:43:00.681 S S/b321 : OFF
09:43:20.664 S S/b503a : ON
Interestingly, it takes just under 20 seconds to reach B503a, the next block in the sequence after B321. And more importantly, the time between B321 ON and B503a ON is 0.1 s under 120 seconds. In that case, Conductor 2 would accept the transition as being under the timeout. If the train were 0.1s slower, it would go into error mode.
So that’s why the Conductor 2 automation worked sometimes: we were so close to the limit. Any slow down in the engine by 0.1 seconds on that long run would cause it to go overboard.
What does “it takes 20 seconds to reach B503a from B321” mean?
Well, it turns out that B503a is not really the next block after B321:
- Script as programmed right now: B330 → B321 → B503a
- Reality of the track at Randall: B330 → B321 → B504 → B503a
There’s a block in between, B504. I typically ignore it in the script because it does not have a block detection sensor “yet”. It’s been on my todo list since 2017 to add it. Real soon nowtm. After all we only spend 20 seconds on that block, do we really care? Well…
Thus we have various solutions available:
- The easy “cheat” one: extend the max-time-on-block timeout to be longer than 120 seconds. Even just 130 seconds would work (with about 10 seconds of margin). This is the lowest effort.
- Create a virtual block for B504 between B321 and B503a. That would match the current reality. This is a small modest effort, and the most logical one. ⇐ That’s what I did.
- Actually wire a block sensor for B504. That’s more involved, and can be a logical second step to move away from a virtual block. ⇐ Still on my todo list. Real soon nowtm.
So what’s a “virtual block” anyway? As the name implies, it’s a block in the automation script, except it’s not backed by a physical block detection sensor. It’s “virtual”. The scripting engine tracks its occupancy based on the track schematics and the previous and next block occupancy:
- Given a track schema of B330 → B321 (real sensor) → B504 (virtual) → B503a (real sensor)
- If a train is on B321 and that block goes off, we assume the train moved to the next block, namely B504, and we mark it occupied.
- Once B503a becomes active, we know the train left B504, and we can mark it empty.
However there’s a trick here… This assumes the block sensors are reliable. If B321 were a flaky block sensor as it was before, then the scripting engine would trigger the “move to block B504” prematurely as soon as B321 becomes off due to flakiness. That’s the reason I never bothered with creating a virtual block here when I rewrote the script for Conductor 2 -- it would not have worked well.
And to be clear, trying to wire a block sensor for B504 may not work. Due to its origin as a DC layout, power routing for some blocks is often done via the turnout’s power routing. That’s especially the case in lead in/out interchange tracks where they are powered via relays from whichever blocks they are connected to. E.g. throwing a turnout to enter a yard from the mainline will often cause the entire mainline block + the turnout + the interchange track to be powered together as a whole, thus totally breaking any block detection sensing. That’s a good behavior in DC but a poor one in DCC with block detection.