Progress: it seems I had to endian-swap the length fields in the last GCM block vs the STM32F7. Now it's getting all the way up to the point of the client seeing a SSH_MSG_CHANNEL_SUCCESS that I sent after successful password authentication, but the contents of the packet seem garbled so it aborts.

It lives!

Somehow I was sending replies to SSH_MSG_CHANNEL_REQUEST packets by writing *into the incoming packet buffer*, not the reply.

And by dumb luck I guess whatever uninitialized garbage was in the reply buffer happened to resemble a valid SSH_MSG_CHANNEL_SUCCESS message before, but not now? Lol.

Anyway, this is a great success. Kid is up from her nap so that's it for a while, tonight after bedtime I'll hook up the Curve25519 accelerator on the FPGA to speed session creation a bit, then work on a bunch of CLI commands to dump PHY information and such.

Bumped the optimization level on my firmware up from -O0 to -O2 because creating a SSH session was too slow.

But the FPGA curve25519 accelerator is still over 48x faster than the software implementation. Pretty happy with that :)

Ephemeral ECDH key generation and shared secret calculation now use the FPGA accelerator and SSH session creation now feels about as fast as it does when logging into a regular PC.

I can probably extend the same accelerator block (with some minimal tweaks) to also support the public key side of signing, but for now crypto_sign() is still being done entirely in software and only the two crypto_scalarmult() calls in the SSH session creation are accelerated.

Still a massive improvement in responsiveness, it cut about 400ms of latency off session creation.

Added some code to poll PHY MDIO register state (not using irq pins yet) and the SGMII PHYs seem less happy. One refuses to report link up over MDIO (even if the LEDs are on and the link partner says it's up), the other reports link up but flaps.

Just when I thought that was working...

So I guess the first question is if I'm actually addressing the correct PHY and if it's in fact reporting link flaps. And how the MDIO link state compares to the SGMII autonegotiation state.
After some soldering, i think I'm ready to start debugging!

So, here's the basic setup.

Blue and green wires go to the MDIO bus, which is slow enough (2.5 Mbps with very low drive strength) that I'm not worried about reflections off a few inches of flying wire. Standard 10x passive probes clip to the other end of each.

The two black probes are Teledyne LeCroy QuickLink solder in probe tips. One is going to a D420-A and the other to a D1330; both have way more bandwidth than I need to see SGMII clearly.

I'd use my own AKL-PT5 probes for this measurement (well within their capabilities) except that I'd need to AC couple the measurement and somehow I only have one SMA DC block on the shelf right now. That will be rectified by the end of the week.

Initial observations: The SGMII RX waveform looks decent enough and passes the eye mask required for the FPGA to decode it. I've seen better, but I'm in no hurry to rework the board because of this.

Swing and drive strength on the TX seem a bit excessive, I should probably turn it down. The eye is wide open but the PHY could probably hear the FPGA from the next room over!

Valid MDIO traffic is present. This particular waveform has two packets at the start, then four, then two more.

The MCU reads four registers per polling cycle: basic control and status of PHY 0, then of PHY 1. After polling each PHY, it checks for a link state change and logs a message to the UART.

The long delay after the last packet suggests this is being detected as a link state change.

The SGMII link appears to be up the whole time, and at no point does it fall back into the negotiation state. So the link is probably *not* actually flapping; this smells like a bug on the microcontroller side.

We expect g13 (phy address 1) to be down, and g12 (phy address 0) to be up at 1 Gbps.

Looking at the actual MDIO bus traffic in this capture, we have:

* PHY 1 ctl: 1G/full
* PHY 1 stat: Down
* PHY 0 ctl: 1G/full
* PHY 0 stat: Up
* PHY 1 ctl: 1G/full
* PHY 1 stat: Down
* PHY 0 ctl: 1g/full
* PHY 0 stat: Up

Nothing seems obviously wrong here.

OK, this is starting to smell like an FPGA issue.

We know the actual MDIO traffic on the wire is fine, but sometimes we're reading 0x7949 for the basic control register on port 12.

Interestingly, this is the same value we just read from the basic status register on port 13.

And then at 22.420, we read the basic control register for port 13 as 0x116d. 0x6d is the value we just read from the basic status register on port 12.

So I think there's some kind of bug in the FPGA MDIO-to-QSPI bridge where sometimes it will return a previous value instead of what was actually read.

New FPGA bitstream with some additional debug logic, as well as changing the FPGA output buffer from DIFF_HSTL_I_DCI_18 to LVDS.

TX data eye measured at the PHY side is still reasonably open, but way lower amplitude than before. I'll double check the spec later but this should be plenty open enough.

And here's the bug, caught red handed.

We start with the MDIO transceiver being busy with a read of address 0x00. The read data register is still 0x7949, the previous value, because the read is still in progress.

At T=7862 the MCU begins a 4-word burst read of REG_DP_MDIO (0x004c). This is a 32-bit little endian register with the read value in the low 16 bits, a bunch of write-only configuration, and a busy flag in the MSB.

By T=7887 when we read SPI address 0x004f (where the busy flag is) the read has just finished.

So the MCU thinks it's successfully read the whole register.

The fix is pretty simple: latch the busy flag when address 0x4c is read (the entire 32-bit register has to be read in one go, byte access is not supported). The MCU will then read {busy, 0x7949} just like it did on the previous poll, then read the correct value on the subsequent polling cycle.

Yay, no more flapping!

Tomorrow's problem: while g12 links up fine at gigabit speed, last time I tried g13 would struggle a bit then come up at 100 Mbps (verified by link partner).

That's probably a hardware issue of some sort since g12 and g13 are supposed to be identically configured. The RJ45 pinout is mirrored because of the tab-up vs tab-down jacks, but that should (famous last words) be fine because the DP83867 has a register to enable ABCD -> DCBA mirroring, which I think I've set correctly.

I made a quick pass over the schematic and nothing seemed otherwise different, it was largely copy-pasted other than the PHYADDR strap pins.

Yet another command that I wish "real" switches had.

There will of course be fancy commands that include nice detailed decodes of port state. But sometimes there's no substitute for getting close to the metal.

Well, that was a slightly larger yak than I originally expected but it's been thoroughly shaved.

SSH clients on the switch can now see log messages. For now this is enabled by default, although long term I might have this controlled by a per-unit configuration setting or off by default with a Cisco-style "terminal monitor" command to start seeing log messages.

During development I want ALL the logs so I'll leave it like this for now.

Next step will be to implement some of the commands I copied over (commented out) from the Ethernet tap board, and make any tweaks needed to support the additional PHY chipsets on the board.

In particular, I want to be able to send test patterns out both DP83867's to check for soldering issues before I debug the 100mbit-only link issue further.

Ok, I should sleep...

But on the plus side, I have the code to send test patterns working (including the three special test patterns that the DP83867 specifies in addition to the IEE-defined ones).

Won't be able to actually debug the g13 100mbit issue until tomorrow after work but I should have all the groundwork laid now.

Oh I'm sorry you wanted *less* cable spaghetti? i swear you said you wanted *more*. I even bought a new roll of ESD tape to wrangle it all.

Got the baseT test fixture cabled up so I can troubleshoot g13's link issues after work, but didn't have time to collect any data yet.

If you haven't seen it before, this is a handy dandy little gizmo consisting of two RJ45 jacks connected back to back by dual directional couplers.

This gives me 16 SMA outputs with 10 dB attenuated views of each of the 8 wires in the twisted pair cable, seen from both directions. I'm using an 8 channel PicoScope 6824E to look at all 8 lines coming out of the DP83867, ignoring the inbound data from the other side.

Hmmm.

Set up a test pattern on g13 and expected to see it coming out all pairs of the link, but only seeing it on pair A.

Thought this pointed to a soldering issue, except I'm seeing it on g12 as well (which links up just fine).

So I guess I need to read the datasheet and see if there's a test pattern mux register or something I'm missing...

Yep, there is. MMD 0x1f register 0x25, TMCH_CTRL, defaults to only sending the test pattern out pair A.

With that fixed, on g12 I'm seeing the test pattern on all pairs. So we know our register config is correct there.

And now here's what we see on g13. One of these is not like the other.

Probably a solder defect but I'll need to pull the board to investigate. Decabling this will take a while...

I wish it was a solder defect. The truth is worse.

Not sure how this got through design review...

Looking at the layout, bodging this is going to be fuuuun.

g10, g8, and g4 have pair D routed on layer 6 of 8. Getting to them (assuming I come from the back of the PCB to avoid desoldering the connector) will mean drilling down 250 μm - annoying but not too bad.

g13, g6, g2, and g0, all have pair D router on layer 3 of 8. Getting to *this* from the back side will mean drilling down almost 1.3 mm. That will be decidedly less fun.

The good news is that I have almost 1mm of width and as much length as I need to play with. There's basically nothing on other layers that I'm likely to hit.

And worst case, this isn't a fatal issue for a prototype. Having half the ports only run in 100baseTX mode, or even not work at all, would surely be annoying. But it wouldn't prevent me from using the board as a development platform for the full scale 24 port switch, which was the real goal.

But I'd like to make it fully functional if I can.

Not happening tonight, though. I've got too much else on my plate with time constraints.

Actually I might try some fixturing work and a preliminary cut while waiting for stuff to run on another project.

My microscope ring light was too fat to clear so I bodged up an LED headlamp with some tape.

First test cut. Through layers 8 (back) and 7 (ground plane). There's an LED trace on layer 6 we might get close to, but if it's damaged not a huge deal, plenty of other places to reconnect if required. Layers 5 and 4 are power planes we need to not short, then 3 is where the actual bodge will happen.
Down to layer 5.

First connector (on the DP83867s) bodged. Not attempting the rest (on the VSC8512) until I've brought it up.

Ended up milling all the way down and cutting the track then reconnecting on the surface. there's a small stub off a via which isn't great but it'll probably be fine on a prototype.

I'll save the other six for later. If the phy doesn't work, no point spending time reworking the RJ45s.

Looks like that fixed it at least.

Initial signs of life out of the QSGMII PHY!

It's responding to MDIO with the correct address, but twice (?) and at 8 addresses (this is a 12 port PHY). Suspecting a timing issue related to the level shifters on the MDIO bus, but not sure yet. Dropping the MDIO clock frequency by 10x from 2.5 MHz to 250 kHz didn't fix it.

The actual PHY side seems OK, it links up with my laptop on every port I've tried (aside from the known pair D issue on the upper row of ports).

Also whoops I misspoke. The Ethernet test fixture is 16 dB couplers not 10. The directional coupler I use for TDR stuff is 10 dB and I mixed them up.

Too much RF hardware :p

Reading the programming guide in the VSC8512 datasheet.

Why??? IEEE has a perfectly well defined way to access up to 2^16 extended registers. You don't need to roll your own way to do it.

Loaded an FPGA bitstream that instantiates the QSGMII transceivers on the FPGA.

Power consumption climbed to 12.7W and the FPGA die temperature is up to 48.5C.

The 1V0 rail for the GTXes is sagging to 975.5 mV under load, since it's just pi filtered off of the main FPGA 1V0 rail without an independent remote sense. This is within spec... barely. But definitely something I will want to work on in the future. The full LATENTRED switch (with eight transceivers) will definitely need a dedicated SERDES power rail with independent regulation.

The FPGA 1V0 rail is doing just fine, 1.0015V at the test point and 0.996V measured by the on die ADC.

The thermal pad and heatsink pressure seem fine. Heatsink surface temperature is only 5C below die temperature so not much of a gradient there.

FPGA logic reports none of the QSGMII links are up.

Not entirely surprising since I've never actually tested the QSGMII block in hardware, but still a bit annoying.

I think that's it for today. Tomorrow I'll decable the whole setup (again), and probably try to bodge one or more of the VSC8512 RJ45s as long as i have it off the bench.

Then get test leads on the VSC8512 MDIO bus (to see if anything funky is happening with timing there, I still can only talk to 8 of the 12 PHYs... might be a register misconfiguration too though), and probably land a high BW probe on one or more of the QSGMII lanes to see what's happening with that.

Quick handheld probe measurement off the QSGMII TX line from the FPGA.

Definitely some logic bugs, we're supposed to have K28.1 in lane 0 and all I'm seeing is K28.5.

The eye (measured at the PHY side of the coupling capacitor) is pretty wide open, but I will definitely want to tweak driver settings given the closure in the right half. Need to check this against the QSGMII eye mask but I don't have the specs for that in ngscopeclient yet (also a job for tomorrow).

Seems like drive on my QSGMII TX is just a little bit over the top. Left eye has the transmitter mask, right has the receiver.

This is a mid-channel measurement (at the AC coupling cap) so we need to be better than the RX mask but don't need to pass the TX.

Back to the lab for the evening and continuing switch bringup.

Double checking pins on the VSC8512 and so far not seeing any issues.

I did notice that the thermal diode is tied off to ground, which is in retrospect a mistake. I should have provided a means to monitor it externally. Now I have no way to tell if the PHY is overheating other than by pointing a FLIR camera at the heatsink and adding a couple of degrees to the reading.

Signal integrity tweaking on the QSGMII.

Took initial measurements with an AKL-PT5 and a D1330, then cross checked the PT5 measurements against a D1605.

After some tweaking, the QSGMII TX waveform isn't overshooting.

But when I soldered an AKL-PT5 on, I saw a huge dip around T=25ps that I don't remember seeing in the handheld probe view (maybe it didn't have enough BW to show it?)

I repeated the same measurement with a D1605 (shown here) just in case it was an artifact of the PT5. Other than a bit less noise, the eye looked identical.

Need to check and see if the remaining QSGMII lanes have similar issues or if this is the only one, or what. It technically passes the QSGMII eye mask so it *should* work but I wouldn't want to field it looking like this!

RX drive strength is a bit higher than spec, but the FPGA will happily eat it so I'm not concerned.

Looking at the QSGMII link state, it seems that the FPGA is sending autonegotiation codeword 0x4001 (SGMII mode, no remote fault etc, no next page).

The PHY is sending K28.5 D16.2 which is IDLE 2, so I think this means it's waiting for the FPGA to go "ok, link is up"?

Reading register 19E3 from the PHY (link partner clause 37 ability) shows 0x4001, the same thing the FPGA is sending. This means that the PHY is seeing my autonegotiation traffic and decoding it correctly.

Register 17E3 is 0x0409: no SGMII alignment error or remote fault, no full duplex advertised by MAC (seems wrong), no half duplex advertised by MAC, link partner AN capable, link not connected, AN not complete, signal present.

But... bit 5 of the AN advertisement (which means full duplex capable) is *reserved, must be zero* in SGMII mode. So I'm not sure if this is a problem or not.

Well here's a problem. My SGMII MAC isn't properly dropping ordered sets when the RX FIFO fills up.

Fixed a bunch of bugs in the SGMII block, the QSGMII-SGMII bridge, and even in ngscopeclient.

And the TX eye still isn't very pretty, I need to investigate that more.

But the QSGMII links are now alive! Let's see if I can actually pass traffic...

And it looks like the PHY is able to receive traffic! Haven't tested if it decodes properly in the FPGA etc, but the PHY is sending well formed QSGMII, the FPGA sees the link as up, and the decode in libscopehal is making sense of it.

Not sending anything yet. A lot more work needed on the switch logic in the FPGA to make *that* happen.

Continuing switch bringup work.

All ports (except the four VSC8512 interfaces which aren't responding over MDIO) have link state/speed working and queriable via the MCU.

Something is wonky with the basic status register, it's saying the link is half duplex even though it's negotiated to full duplex (in fact, only advertising full duplex). Not sure if this is a bug or what. Might have something to do with the 8051 microcode patch I haven't yet applied?

Spent a while today debugging on live hardware and finally reproduced the issue in simulation.

Packets more than 32 128-bit words in length will max out the prefetch FIFO but I never continue to fetch traffic after that point. There's a big giant TODO comment I never implemented. Oops.

Found and fixed a few more bugs (including one that hadn't bit me yet, but would have become bad under heavier network traffic). Timing is getting a bit tricky, this one path (basically arbitration to decide which input FIFO to pop into the shared bus) is going to have to get reworked before I scale up to 24 ports.

Did a bunch of timing fixes and added some more pipeline stages. Latency is higher than I'd like now and I'll definitely want to work on reducing it, but it should do for a starting point.

Also did some per-link power estimates: about 13.3W in the current test configuration (management port, SFP+ uplink, and two VSC8512 edge ports active at 1 Gbps, no packet traffic).

This climbs to about 13.8W (+0.5W, so 0.25W per interface) if looping back two DP83867 interfaces, and 14W (+0.7W, so 0.35W per interface) looping back two VSC8512 interfaces.

With all links up, I thus project that the total board power consumption would climb to about 17.3W. This would likely increase a bit further with heavy traffic due to increased toggles on the SRAM bus etc.

Not too bad for a ~16 port switch (counting management and uplink ports). I've also put zero effort into optimizing the FPGA design for power to date, so there's probably things I can do to improve there.

Off the top of my head:

* If an entire group of four baseT links is down or disabled, I can shut down the QSGMII SERDES
* If there's no traffic on the read side of the SRAM bus, I can disable the input terminations
* If there's no traffic on the write side of the SRAM bus, I might be able to tristate the bus except for control signals
* It might be possible to consolidate/optimize PLL configuration to use less PLLs
* There's definitely work to be done to use less long range high fanout clocks on the FPGA
* Improve gating of unused signals on wide buses etc to avoid propagation of toggles that don't do useful work

Always a fun day when you have to write code like this...

Hopefully this will give me a trigger condition that will let me figure out why my switch fabric is deadlocking trying to forward a packet without actually doing anything to it.

Welp. Somehow I'm trying to start forwarding from port #15.

Except I only have 15 ports (14 plus the uplink) and port numbers are zero based.

Looks like I was incrementing the round robin counter but forgot to add the "mod portcount" bit.

And apparently whatever logic Vivado synthesizes for accessing the 16th element of a 15-element vector resulted in the arbiter thinking it had data to send, entering the busy state, but then never getting a done signal.

And after a few more fixes, it's working!

Here an ARP frame shows up on port 0 (g0), is received via QSGMII, transferred to the core clock domain, processed through the SRAM FIFO (all offscreen).

Then at T=32 it's looked up in the MAC address table. At T=35 the table returns "not found", which makes sense since the destination is a layer 2 broadcast.

At T=39 a forwarding decision is made: the frame should be broadcast to all of VLAN 99 except for g0, where the frame came from. In this example config that's ports 5 (g5) and 14 (xg0).

Then at T=41 after some pipeline latency, data begins flowing.

It ends up in /dev/null for now because there's no exit queues between the frame_* control signals and the TX-side MAC IPs. But that's the only missing piece to make this a fully functional, if very basic, switch!

FPGA resource usage is growing, but things are still looking good in terms of being able to finish the job - and hopefully fit a full 24 port design in the same FPGA.

Current total fabric usage including the logic analyzer IP is 34% LUT, 23% FF, 39% BRAM, 6% DSP, 100% SERDES (duh), 65% IO, 53% global clocks, 25% MMCM/PLL.

One big unknown is how to scale the architecture up to 24 ports, since the current shared bus architecture is running close to its max performance with 14 ports and assumes a single memory channel. Refactoring this to work with a dual channel RAM controller will be interesting.

One "easy" option is to have essentially two independent sub-switches and a high bandwidth interconnect between them. But that might mean duplicating resources like the MAC address table.

Added exit queues and it's getting fuller. 38% LUT, 25% FF, 48% BRAM, 6% DSP, 100% SERDES, 65% IO, 53% BUFG, 25% MMCM / PLL.

Still missing VLAN tag insertion for outbound trunk ports (and some other logic to propagate VLAN tag information to support that) but in theory it should be capable of switching between access ports now. About to try in hardware, wish me luck!

And no go. My pings aren't being seen and I'm seeing no transmit activity on the QSGMII link.

But at least I have some idea of where to add on-chip debug probes to troubleshoot further.

Ok, turns out there is transmit activity but it's gibberish. Skipping data bytes or something.

Upon closer inspection it seems I had incorrect TX clock configuration (feeding TXUSRCLK with 156.25 MHz instead of 125) due to some confusing GTX configuration. Hopefully this will fix it...

It's alive!! First light on the switch passing packets!

When I ping flooded through it, it locked up and stopped forwarding traffic until I reloaded the FPGA. Probably related to one of the dozens of FIFO-full error handling code paths I haven't tested or fully implemented.

Still lots more work to do: VLAN tag insertion on outbound trunk interfaces, 10/100 support in the SGMII MAC, performance counters, tons of error handling, lots of CLI commands, investigating SI on the QSGMII TX diffpair, figuring out why g8-g11 aren't responding on MDIO, power integrity validation...

Found a few more thermometers on the board. Turns out in addition to the externally pinned out thermal diode on the VSC8512 (which I didn't hook up to anything) there is an (undocumented, but used in some example code I dug up) internal digital temperature sensor.

There's also one on the STM32.

Fixed a bunch of bugs and reduced latency of the QDR-II+ controller. End to end latency from read request to full burst data in hand - including PCB trace delays and clock domain crossing but not the additional pipeline stage for ECC - is now down to nine clocks at 187.5 MHz (48 ns). Probably more room to improve further on that but it's already way better than the 11-17 cycles I was seeing before with a less efficient CDC structure.

It no longer falls over instantly when ping flooded, however sustained floods (especially with preload) still make it start corrupting packets. So I've fixed the easiest-to-trigger bug and there's still more.

Debating how much time I want to spend chasing bugs in the current fabric architecture since I know it won't scale to 24 ports and barely makes timing as-is. Might just blow away everything between the input FIFOs and the MAC table and redo it clean slate.

@azonenberg is that just a coincidence that the SFP+ is the same temperature to 2 decimal places?
@azonenberg absolutely amazing progress, congratulations!
@azonenberg congratulations! I am following your progress with considerable interest, one of my favourite things to do here on mastodon.
@azonenberg I guess there was a warning in Vivado's log. (Among hundreds of totally useless ones.)
@azonenberg my favourite nomenclature for this pattern is `++fuse == blown` :D
@azonenberg Also check if you've got the EEE features turned on in the TI phys. That saves an eighth to quarter watt each depending on supply choices.

@AMS Both the DP83867 and VSC8512 (and I think the KSZ9031 on the management port too?) have EEE support, but I don't think I've poked the registers to mess with the setup. Unsure if it's on by default.

At this stage I'm just happy that I'm on track to hit ~30W for a 24 port switch vs the 80W typ / 160W of my ancient Ciscos! If I can get even lower, that's great.

@azonenberg out of curiosity, why do you feed the channel into the clock recovery PLL but the data to theough a threshold filter? Why not threshold the clock? Does it help to (maintain) lock even with weaker signals?

@anotherandrew The CDR PLL filter does internal thresholding on analog inputs with sub sample interpolation (currently linear but may switch to cubic eventually) to find zero crossings with high accuracy.

While it can work with a digital input if necessary, jitter performance degrades due to the lack of interpolation (essentially the phase detector block has its input rounded to the nearest integer sample index). With 5 Gbps data and a 40 Gsps sample rate, you only have eight samples per UI so that sub sample precision makes a significant difference in stability of the recovered clock.

The protocol decode block just needs a digital waveform you can sample on the edges of the recovered clock.