Mastodawn

Show thread

Andrew Zonenberg Jul 27, 2023

And now here's what we see on g13. One of these is not like the other.

Probably a solder defect but I'll need to pull the board to investigate. Decabling this will take a while...

Show thread

Andrew Zonenberg Jul 27, 2023

I wish it was a solder defect. The truth is worse.

Not sure how this got through design review...

Show thread

Andrew Zonenberg Jul 27, 2023

Looking at the layout, bodging this is going to be fuuuun.

g10, g8, and g4 have pair D routed on layer 6 of 8. Getting to them (assuming I come from the back of the PCB to avoid desoldering the connector) will mean drilling down 250 μm - annoying but not too bad.

g13, g6, g2, and g0, all have pair D router on layer 3 of 8. Getting to *this* from the back side will mean drilling down almost 1.3 mm. That will be decidedly less fun.

Show thread

Andrew Zonenberg Jul 27, 2023

The good news is that I have almost 1mm of width and as much length as I need to play with. There's basically nothing on other layers that I'm likely to hit.

Show thread

Andrew Zonenberg Jul 27, 2023

And worst case, this isn't a fatal issue for a prototype. Having half the ports only run in 100baseTX mode, or even not work at all, would surely be annoying. But it wouldn't prevent me from using the board as a development platform for the full scale 24 port switch, which was the real goal.

But I'd like to make it fully functional if I can.

Not happening tonight, though. I've got too much else on my plate with time constraints.

Show thread

Andrew Zonenberg Jul 27, 2023

Actually I might try some fixturing work and a preliminary cut while waiting for stuff to run on another project.

My microscope ring light was too fat to clear so I bodged up an LED headlamp with some tape.

Show thread

Andrew Zonenberg Jul 27, 2023

First test cut. Through layers 8 (back) and 7 (ground plane). There's an LED trace on layer 6 we might get close to, but if it's damaged not a huge deal, plenty of other places to reconnect if required. Layers 5 and 4 are power planes we need to not short, then 3 is where the actual bodge will happen.

Show thread

Andrew Zonenberg Jul 27, 2023

Down to layer 5.

Show thread

Andrew Zonenberg Jul 27, 2023

First connector (on the DP83867s) bodged. Not attempting the rest (on the VSC8512) until I've brought it up.

Ended up milling all the way down and cutting the track then reconnecting on the surface. there's a small stub off a via which isn't great but it'll probably be fine on a prototype.

I'll save the other six for later. If the phy doesn't work, no point spending time reworking the RJ45s.

Show thread

Andrew Zonenberg Jul 27, 2023

Looks like that fixed it at least.

Show thread

Andrew Zonenberg Jul 27, 2023

Initial signs of life out of the QSGMII PHY!

It's responding to MDIO with the correct address, but twice (?) and at 8 addresses (this is a 12 port PHY). Suspecting a timing issue related to the level shifters on the MDIO bus, but not sure yet. Dropping the MDIO clock frequency by 10x from 2.5 MHz to 250 kHz didn't fix it.

The actual PHY side seems OK, it links up with my laptop on every port I've tried (aside from the known pair D issue on the upper row of ports).

Show thread

Andrew Zonenberg Jul 27, 2023

Also whoops I misspoke. The Ethernet test fixture is 16 dB couplers not 10. The directional coupler I use for TDR stuff is 10 dB and I mixed them up.

Too much RF hardware :p

Show thread

Andrew Zonenberg Jul 28, 2023

Reading the programming guide in the VSC8512 datasheet.

Why??? IEEE has a perfectly well defined way to access up to 2^16 extended registers. You don't need to roll your own way to do it.

Show thread

Andrew Zonenberg Jul 28, 2023

Loaded an FPGA bitstream that instantiates the QSGMII transceivers on the FPGA.

Power consumption climbed to 12.7W and the FPGA die temperature is up to 48.5C.

The 1V0 rail for the GTXes is sagging to 975.5 mV under load, since it's just pi filtered off of the main FPGA 1V0 rail without an independent remote sense. This is within spec... barely. But definitely something I will want to work on in the future. The full LATENTRED switch (with eight transceivers) will definitely need a dedicated SERDES power rail with independent regulation.

The FPGA 1V0 rail is doing just fine, 1.0015V at the test point and 0.996V measured by the on die ADC.

Show thread

Andrew Zonenberg Jul 28, 2023

The thermal pad and heatsink pressure seem fine. Heatsink surface temperature is only 5C below die temperature so not much of a gradient there.

Show thread

Andrew Zonenberg Jul 28, 2023

FPGA logic reports none of the QSGMII links are up.

Not entirely surprising since I've never actually tested the QSGMII block in hardware, but still a bit annoying.

I think that's it for today. Tomorrow I'll decable the whole setup (again), and probably try to bodge one or more of the VSC8512 RJ45s as long as i have it off the bench.

Then get test leads on the VSC8512 MDIO bus (to see if anything funky is happening with timing there, I still can only talk to 8 of the 12 PHYs... might be a register misconfiguration too though), and probably land a high BW probe on one or more of the QSGMII lanes to see what's happening with that.

Show thread

Andrew Zonenberg Jul 28, 2023

Quick handheld probe measurement off the QSGMII TX line from the FPGA.

Definitely some logic bugs, we're supposed to have K28.1 in lane 0 and all I'm seeing is K28.5.

The eye (measured at the PHY side of the coupling capacitor) is pretty wide open, but I will definitely want to tweak driver settings given the closure in the right half. Need to check this against the QSGMII eye mask but I don't have the specs for that in ngscopeclient yet (also a job for tomorrow).

Show thread

Andrew Zonenberg Jul 28, 2023

Seems like drive on my QSGMII TX is just a little bit over the top. Left eye has the transmitter mask, right has the receiver.

This is a mid-channel measurement (at the AC coupling cap) so we need to be better than the RX mask but don't need to pass the TX.

Show thread

Andrew Zonenberg Jul 29, 2023

Back to the lab for the evening and continuing switch bringup.

Double checking pins on the VSC8512 and so far not seeing any issues.

I did notice that the thermal diode is tied off to ground, which is in retrospect a mistake. I should have provided a means to monitor it externally. Now I have no way to tell if the PHY is overheating other than by pointing a FLIR camera at the heatsink and adding a couple of degrees to the reading.

Show thread

Andrew Zonenberg Jul 29, 2023

Signal integrity tweaking on the QSGMII.

Took initial measurements with an AKL-PT5 and a D1330, then cross checked the PT5 measurements against a D1605.

Show thread

Andrew Zonenberg Jul 29, 2023

After some tweaking, the QSGMII TX waveform isn't overshooting.

But when I soldered an AKL-PT5 on, I saw a huge dip around T=25ps that I don't remember seeing in the handheld probe view (maybe it didn't have enough BW to show it?)

I repeated the same measurement with a D1605 (shown here) just in case it was an artifact of the PT5. Other than a bit less noise, the eye looked identical.

Need to check and see if the remaining QSGMII lanes have similar issues or if this is the only one, or what. It technically passes the QSGMII eye mask so it *should* work but I wouldn't want to field it looking like this!

RX drive strength is a bit higher than spec, but the FPGA will happily eat it so I'm not concerned.

Show thread

Andrew Zonenberg Jul 29, 2023

Looking at the QSGMII link state, it seems that the FPGA is sending autonegotiation codeword 0x4001 (SGMII mode, no remote fault etc, no next page).

The PHY is sending K28.5 D16.2 which is IDLE 2, so I think this means it's waiting for the FPGA to go "ok, link is up"?

Reading register 19E3 from the PHY (link partner clause 37 ability) shows 0x4001, the same thing the FPGA is sending. This means that the PHY is seeing my autonegotiation traffic and decoding it correctly.

Register 17E3 is 0x0409: no SGMII alignment error or remote fault, no full duplex advertised by MAC (seems wrong), no half duplex advertised by MAC, link partner AN capable, link not connected, AN not complete, signal present.

But... bit 5 of the AN advertisement (which means full duplex capable) is *reserved, must be zero* in SGMII mode. So I'm not sure if this is a problem or not.

Show thread

Andrew Zonenberg Jul 29, 2023

Well here's a problem. My SGMII MAC isn't properly dropping ordered sets when the RX FIFO fills up.

Show thread

Andrew Zonenberg Jul 30, 2023

Fixed a bunch of bugs in the SGMII block, the QSGMII-SGMII bridge, and even in ngscopeclient.

And the TX eye still isn't very pretty, I need to investigate that more.

But the QSGMII links are now alive! Let's see if I can actually pass traffic...

Show thread

Andrew Zonenberg Jul 30, 2023

And it looks like the PHY is able to receive traffic! Haven't tested if it decodes properly in the FPGA etc, but the PHY is sending well formed QSGMII, the FPGA sees the link as up, and the decode in libscopehal is making sense of it.

Not sending anything yet. A lot more work needed on the switch logic in the FPGA to make *that* happen.

Show thread

Andrew Zonenberg Jul 31, 2023

Continuing switch bringup work.

All ports (except the four VSC8512 interfaces which aren't responding over MDIO) have link state/speed working and queriable via the MCU.

Something is wonky with the basic status register, it's saying the link is half duplex even though it's negotiated to full duplex (in fact, only advertising full duplex). Not sure if this is a bug or what. Might have something to do with the 8051 microcode patch I haven't yet applied?

Show thread

Andrew Zonenberg Aug 1, 2023

Spent a while today debugging on live hardware and finally reproduced the issue in simulation.

Packets more than 32 128-bit words in length will max out the prefetch FIFO but I never continue to fetch traffic after that point. There's a big giant TODO comment I never implemented. Oops.

Show thread

Andrew Zonenberg Aug 2, 2023

Found and fixed a few more bugs (including one that hadn't bit me yet, but would have become bad under heavier network traffic). Timing is getting a bit tricky, this one path (basically arbitration to decide which input FIFO to pop into the shared bus) is going to have to get reworked before I scale up to 24 ports.

Show thread

Andrew Zonenberg Aug 3, 2023

Did a bunch of timing fixes and added some more pipeline stages. Latency is higher than I'd like now and I'll definitely want to work on reducing it, but it should do for a starting point.

Also did some per-link power estimates: about 13.3W in the current test configuration (management port, SFP+ uplink, and two VSC8512 edge ports active at 1 Gbps, no packet traffic).

This climbs to about 13.8W (+0.5W, so 0.25W per interface) if looping back two DP83867 interfaces, and 14W (+0.7W, so 0.35W per interface) looping back two VSC8512 interfaces.

With all links up, I thus project that the total board power consumption would climb to about 17.3W. This would likely increase a bit further with heavy traffic due to increased toggles on the SRAM bus etc.

Not too bad for a ~16 port switch (counting management and uplink ports). I've also put zero effort into optimizing the FPGA design for power to date, so there's probably things I can do to improve there.

Show thread

Andrew Zonenberg Aug 3, 2023

Off the top of my head:

* If an entire group of four baseT links is down or disabled, I can shut down the QSGMII SERDES
* If there's no traffic on the read side of the SRAM bus, I can disable the input terminations
* If there's no traffic on the write side of the SRAM bus, I might be able to tristate the bus except for control signals
* It might be possible to consolidate/optimize PLL configuration to use less PLLs
* There's definitely work to be done to use less long range high fanout clocks on the FPGA
* Improve gating of unused signals on wide buses etc to avoid propagation of toggles that don't do useful work

Show thread

Andrew Zonenberg Aug 3, 2023

Always a fun day when you have to write code like this...

Hopefully this will give me a trigger condition that will let me figure out why my switch fabric is deadlocking trying to forward a packet without actually doing anything to it.

Show thread

Andrew Zonenberg Aug 3, 2023

Welp. Somehow I'm trying to start forwarding from port #15.

Except I only have 15 ports (14 plus the uplink) and port numbers are zero based.

Looks like I was incrementing the round robin counter but forgot to add the "mod portcount" bit.

And apparently whatever logic Vivado synthesizes for accessing the 16th element of a 15-element vector resulted in the arbiter thinking it had data to send, entering the busy state, but then never getting a done signal.

Show thread

Andrew Zonenberg Aug 3, 2023

And after a few more fixes, it's working!

Here an ARP frame shows up on port 0 (g0), is received via QSGMII, transferred to the core clock domain, processed through the SRAM FIFO (all offscreen).

Then at T=32 it's looked up in the MAC address table. At T=35 the table returns "not found", which makes sense since the destination is a layer 2 broadcast.

At T=39 a forwarding decision is made: the frame should be broadcast to all of VLAN 99 except for g0, where the frame came from. In this example config that's ports 5 (g5) and 14 (xg0).

Then at T=41 after some pipeline latency, data begins flowing.

It ends up in /dev/null for now because there's no exit queues between the frame_* control signals and the TX-side MAC IPs. But that's the only missing piece to make this a fully functional, if very basic, switch!

Show thread

Andrew Zonenberg Aug 3, 2023

FPGA resource usage is growing, but things are still looking good in terms of being able to finish the job - and hopefully fit a full 24 port design in the same FPGA.

Current total fabric usage including the logic analyzer IP is 34% LUT, 23% FF, 39% BRAM, 6% DSP, 100% SERDES (duh), 65% IO, 53% global clocks, 25% MMCM/PLL.

One big unknown is how to scale the architecture up to 24 ports, since the current shared bus architecture is running close to its max performance with 14 ports and assumes a single memory channel. Refactoring this to work with a dual channel RAM controller will be interesting.

One "easy" option is to have essentially two independent sub-switches and a high bandwidth interconnect between them. But that might mean duplicating resources like the MAC address table.

Show thread

Andrew Zonenberg Aug 6, 2023

Added exit queues and it's getting fuller. 38% LUT, 25% FF, 48% BRAM, 6% DSP, 100% SERDES, 65% IO, 53% BUFG, 25% MMCM / PLL.

Still missing VLAN tag insertion for outbound trunk ports (and some other logic to propagate VLAN tag information to support that) but in theory it should be capable of switching between access ports now. About to try in hardware, wish me luck!

Show thread

Andrew Zonenberg Aug 6, 2023

And no go. My pings aren't being seen and I'm seeing no transmit activity on the QSGMII link.

But at least I have some idea of where to add on-chip debug probes to troubleshoot further.

Show thread

Andrew Zonenberg Aug 6, 2023

Ok, turns out there is transmit activity but it's gibberish. Skipping data bytes or something.

Upon closer inspection it seems I had incorrect TX clock configuration (feeding TXUSRCLK with 156.25 MHz instead of 125) due to some confusing GTX configuration. Hopefully this will fix it...

Show thread

Andrew Zonenberg Aug 6, 2023

It's alive!! First light on the switch passing packets!

When I ping flooded through it, it locked up and stopped forwarding traffic until I reloaded the FPGA. Probably related to one of the dozens of FIFO-full error handling code paths I haven't tested or fully implemented.

Still lots more work to do: VLAN tag insertion on outbound trunk interfaces, 10/100 support in the SGMII MAC, performance counters, tons of error handling, lots of CLI commands, investigating SI on the QSGMII TX diffpair, figuring out why g8-g11 aren't responding on MDIO, power integrity validation...

Show thread

Andrew Zonenberg Aug 7, 2023

Found a few more thermometers on the board. Turns out in addition to the externally pinned out thermal diode on the VSC8512 (which I didn't hook up to anything) there is an (undocumented, but used in some example code I dug up) internal digital temperature sensor.

There's also one on the STM32.

Show thread

Andrew Zonenberg Aug 8, 2023

Fixed a bunch of bugs and reduced latency of the QDR-II+ controller. End to end latency from read request to full burst data in hand - including PCB trace delays and clock domain crossing but not the additional pipeline stage for ECC - is now down to nine clocks at 187.5 MHz (48 ns). Probably more room to improve further on that but it's already way better than the 11-17 cycles I was seeing before with a less efficient CDC structure.

It no longer falls over instantly when ping flooded, however sustained floods (especially with preload) still make it start corrupting packets. So I've fixed the easiest-to-trigger bug and there's still more.

Debating how much time I want to spend chasing bugs in the current fabric architecture since I know it won't scale to 24 ports and barely makes timing as-is. Might just blow away everything between the input FIFOs and the MAC table and redo it clean slate.

Show thread

Andrew Zonenberg

Welp, seems I have a new bug: I'm reading a frame out of the input FIFO that's shifted by one word.

The first word of the packet (src/dest MAC address, ethertype, and first 4 bytes of payload) is gone (sent as part of the previous packet<, then there's another word that I assume is the start of the subsequent packet at the end.

Seems to be triggered by heavy traffic like ping floods, but haven't caught it happening on the write side yet.

So far not sure if fifo pointers are getting desynced or if I'm writing bad data out of the CDC.

Show thread

Andrew Zonenberg Aug 9, 2023

Nope, the SRAM FIFO is fine. Garbage in, garbage out.

So the problem is happening earlier on, in the CDC or maybe as I'm filling buffers to be written to SRAM?

Show thread

Andrew Zonenberg Aug 9, 2023

Yeeep, it's something in the CDC FIFO (or the logic interfacing with it).

When the packet that actually goes sideways starts, there's already six words of data in the CDC buffer. But all of the other state - most notably packet metadata with length, vlan ID, etc - is missing, so that data gets ignored and isn't popped until more data shows up, at which point you get a hodgepodge of both packets.

Still don't know which clock domain the actual bug is in so this will be fun...

Show thread

Andrew Zonenberg Aug 9, 2023

Oops it's 3:30 AM and I have to be awake for work tomorrow... But I think I found the bug.

If I'm right it's one of those "how did this ever work" moments. Very confused as to how ping flooding makes it fail, it seems like it should *always* fail with packets of a certain length mod 16.

Show thread

Andrew Zonenberg Aug 9, 2023

Nope, that wasn't it. But it put me on the trail of the actual bug.

Not one but *two* packets before SHTF, something goes wrong. There's nothing in the metadata fifo, there's nothing visible on the read side of the data fifo, but the *write* side of the data fifo shows 506 free words, out of a capacity of 512.

Meaning something pushed six words into it, then (for at least the few hundred clocks I have data captured for), never asserted the "commit" flag.

This CDC FIFO has a commit/rollback mechanism intended to be used for store-and-forward packet processing; the write side maintains a private write pointer that is only pushed to the read side when you hit "commit". Until then, the available space is decreased but the read side still shows empty.

The intent is to commit on end of packet with valid FCS and roll back on end of packet with invalid FCS, or if the FIFO fills prior to the end of a packet. Having stale data in the buffer that never gets commited/rolled back SHOULD be impossible...

Show thread

Andrew Zonenberg Aug 10, 2023

And here's the root cause: https://github.com/azonenberg/latentpacket/commit/15a9c4359809ae00801205d9f1fa73a02463f06d

The VLAN tag removal logic on the input side, between the MAC and the CDC FIFO, was failing to forward the "drop" flag. So any time a packet had a FCS failure, the metadata would be discarded and the packet content would be prepended to the next valid packet.

This solves the "ping -f" hang; I just did a test of 100K pings with only 25 drops and it was still working fine after that.

This now raises two new questions:

1) Why did I still lose 25 packets? Judging by the previous bug, at least some are getting FCS errors. Is this signal integrity on the QSGMII link, a logic bug in the MAC, or something else?

2) When I ping flood with preload, i.e. ping -f -l 50, the switch still hard locks up pretty quickly. So I have a second, likely unrelated bug caused by a lot of packets in quick succession.

Fixed bug where VlanUntagger fails to forward "drop" flag · azonenberg/latentpacket@15a9c43

The LATENTPACKET network infrastructure platform. Contribute to azonenberg/latentpacket development by creating an account on GitHub.

GitHub

Show thread

Andrew Zonenberg Aug 10, 2023

Looks like the incoming data is occasionally (25 of 100K packets in my last test) getting corrupted somewhere between the upstream switch MAC and my 32 bit MAC data bus.

In between:
* Switch PHY
* On rack patch cable
* Plant cable
* Bench patch cable
* Magjack and PCB
* VSC8512
* QSGMII link to 7 series GTX
* My QSGMII to SGMII demux
* My SGMII PCS
* My GMII MAC

Suspecting something in the serdes/QSGMII region, but not sure yet.

Show thread

Andrew Zonenberg Aug 10, 2023

Closing in on this bug.

The data coming off the PHY is fine, verified by sniffing and protocol decoding the QSGMII link.

The data entering the decode side of the PCS (after elastic buffer shifting from SERDES clock domain to MAC clock domain) is wrong.

First guess: something in that buffer is borked and it's filling up, rather than dropping idles between packets when it gets too full like it's supposed to. If the remote side of the link has a clock a few ppm faster than the FPGA, the FPGA will have to occasionally drop idles to rate match. If that logic is broken we'll just see random bytes of data not show up when they should.

Show thread

Andrew Zonenberg Aug 10, 2023

Hmmmm. It helps if your elastic buffer drops extra idle ordered sets when it's almost *full*.

Not when almost *empty*. 🤦‍♂️

Show thread

Andrew Zonenberg Aug 10, 2023

And fixed. Now I can start chasing the "switch fails when ping flooded *with preload*" bug.

Show thread

Andrew Zonenberg Aug 10, 2023

OK, this one is interesting.

The switch is forwarding packets that are completely correct except for the first 16 bytes, which at first glance appear to be gibberish.

The 16 byte size is a clue, since most of the fabric and the external packet buffer SRAM are using a 128-bit datapath, while the MAC/PCS blocks are narrower (8-32 bits at various spots).

So the problem here is likely a lot closer to the core than the previous bug.

Show thread

Andrew Zonenberg Aug 11, 2023

When your 16K entry FIFO has 16388 free spots in it, that's awesome!

It's a TARDIS or something, bigger on the inside than the outside. ... right?

Show thread

Andrew Zonenberg Aug 11, 2023

Fixed that one with a complete rewrite of the FIFO pop logic, but there's still a (rarer) bug somewhere else. Lovely.

Show thread

Andrew Zonenberg Aug 12, 2023

Switch fabric reliability is improving! I'm now needing heavier and heavier loads and triggering less frequent bugs.

The one I'm chasing now involves a port getting stuck in the PREFETCH state, indicating it's asked for data from external RAM but it got less data than it expected.

I'm actually getting up to a pretty decent link utilization with this ping flood. Far from saturating the pipe, but looks like maybe 20-30% ish?

Show thread

Andrew Zonenberg Aug 12, 2023

Pretty sure I have a root cause on this one already. Just took a few P&R runs to get probes on the right signals.

I cleared the prefetch-in-progress combinatorially on the last cycle of a prefetch to enable gap-free transitioning to a second prefetch on a different port.

But when I started a prefetch I'd also start a read request to the RAM that cycle. So if this happened the second prefetch would steal the bus cycle from the first.

The fix is simply to not do that, and wait until next cycle to fetch the next word. As a bonus, this eliminates a critical path I was worried about.

Show thread

Andrew Zonenberg Aug 12, 2023

Yep, that was the bug. Seemed to fix the other packet corruption problem I had been chasing as well.

So at this point there are no known bugs in the fabric and it's time to work on building other stuff.

I still need a bazillion performance counters to evaluate how things are going as I push the fabric to heavier loads, plus a lot of debug features for things like printing PHY status registers in human readable form.

Show thread

Andrew Zonenberg Aug 12, 2023

Adding performance counters and a bunch of other debug features is gradually increasing FPGA resource usage to a concerning level.

Fitting the rest of LATENTPINK is not going to be a problem, but there won't be a whole lot of free space.

I could probably... probably... shoehorn a full 24+1/24+2 port LATENTRED design into the 7k160t if I really squeezed. But I'd have to start cutting features and I'd have no room for e.g. potential layer 3 processing or ACLs in the future.

The question then becomes, what do I replace it with? I want "comfortably more" than the 100K LUTs of the 7k160t, enough high performance IO for two channels of QDR-II+, and at least eight transceivers.

The XC7K325T is out, I want to stay with free Vivado for F/OSS friendliness reasons (and to avoid increasing the already significant project budget by another $3K), so there's no path forward using 7 series.

Assuming I stay Xilinx, that means UltraScale or UltraScale+.

Show thread

Andrew Zonenberg Aug 12, 2023

And if I limit myself to parts supported by free Vivado, that leaves five options: XCAU25P, XCKU025, XCKU035, XCKU3P, XCKU5P.

The AU25P is by far the least expensive (XCAU25P-1FFVB676E is $427 at Digikey) and I have two in inventory already. It's got 40% more LUT capacity than the 7k160t, but slightly *less* block RAM, and a lot less IO: 208 HP and 96 HD. I'd need 196 HP for the RAM, leaving 12 left: enough for clock and Vref and that's about it.

Which leaves me HD pins for interfacing with the MCU, maybe driving some indicator LEDs, and boot flash. But for a 24+2 port design I only need 6 GTs for QSGMII and 2 for 10G, so I'd have four extras.

Which is good because RGMII would really be pushing limits for HD I/O, and free GTs would let me use a SGMII PHY instead.

So as long as I can get by with 300 BRAMs (I'm using 157 in LATENTPINK including the management engine and MAC table which don't scale with interface count, so should be doable?) I think I've got a good shot.

Show thread

Andrew Zonenberg Aug 12, 2023

The XCKU025 is a lot pricier (XCKU025-1FFVA1156C is $1288 at Digikey). 45% bigger than the 7k160t, so almost the same size as the au25p, but has 360 BRAMs - a nice increase over the AU25P.

It also has 208 HP IOs, but has 104 HR IOs instead of slow UltraScale+ HD IOs (which should have no trouble doing RGMII for the management port).

Fabric performance might actually be a little slower than the AU+ since it's 20nm rather than 16nm, but both should be comfortably faster than the 28nm 7k160t.

Also, the AU25P is the biggest AU+ device so there's no upgrade path if I outgrow it, while the KU025 FFVA1156 package is pin compatible with the KU035.

Interestingly, though, the KU025 is *not* offered in any of the lower pin count packages like FBVA676. So if I went with the Kintex UltraScale route I'd need a PCB with enough layer capacity to fan out an 1156 ball package.

Show thread

Andrew Zonenberg Aug 12, 2023

The XCKU3P is even more expensive (XCKU3P-1FFVB676E is $1491 at Digikey), and 60% larger than the 7k160t, also with 360 BRAMs (same capacity as the KU025), but it also has 48 UltraRAMs so the total usable on-die memory capacity is more than doubled.

Most interestingly, the FFVB676 package is pin compatible with the XCAU25P if I'm reading the docs correctly (but has only 72 HD IOs vs 96, so if I wanted the PCB to be compatible I'd need to avoid the last 24 sites).

But this leaves open the possibility that I could design LATENTRED with the intention of using the AU25P, with potential to scale up to the KU3P or even KU5P if I ran out of fabric resources without having to respin the PCB.

Show thread

Andrew Zonenberg Aug 13, 2023

Well that was weird. Something I did apparently resulted in Vivado unplacing all of my I/O pins?? Never had that happen before.

I have all of the old pinout constraints in Git so it's not a huge deal, but wasted a P&R run finding it out after bitstream generation failed.

Show thread

I love this, so I Aug 12, 2023

@azonenberg iperf test next?

Show thread

Andrew Zonenberg Aug 12, 2023

@jpm Iperf will happen once I'm ready to stress it to the max.

Ping is easier for debug since the packets are serialized and i get nice feedback as to which ones didn't make it, which I can cross-check against scope/LA captures to figure out where things went bad.

Show thread

Andrew Zonenberg Aug 12, 2023

@jpm This is also a single port pair test (upstream -> g2, laptop -> g0) with no other ports participating.

For a more proper stress test I'll make a bunch of vlans and add daisy-chain cables so a frame might come in g0, out g2, in g4, out g6, in g8, out g10, in g12, out g14. This will create a lot more load on the fabric without me having to hook up a dozen separate machines running separate iperf servers etc.

But I still can't run the fabric beyond 50% load until i finish reworking all of the odd-numbered ports (in the upper row) to fix the pin swaps. I did one as a test to confirm that this was the only problem, but still have to do the other six.

Show thread

Martin Roukala (né Peres)Aug 11, 2023

@azonenberg Off by 4, that's rare!

Show thread

Andrew Zonenberg Aug 11, 2023

@mupuf I think i know where it's going wrong (increments meant for one port are being directed to another). Just not why yet.

Show thread

Andrew Zonenberg Aug 11, 2023

@mupuf This makes sense wrt it being triggered by high traffic. Two packets on different ports need to arrive very close together then some state gets jumbled.

Show thread

AMS Aug 10, 2023

@azonenberg At least try far-end loopback and see if the cabling, jack, and MDI end is good because that's one MDIO command and some ping and iperf on the other end.

"far-end loopback testing feature is enabled by setting register bit 23.3 to 1"

Show thread

Manuel Wick Aug 9, 2023

@azonenberg
May rule of thumb evolved over the years: it's always the fifo *sigh* but on the other hand CDC is a pretty close 2nd ;)

Show thread

Andrew Zonenberg Aug 9, 2023

@mwick83 The CDC is a FIFO too.

It was gonna be the fifo one way or another, just a question of which one :)

The overall switch fabric architecture is roughly one small input CDC FIFO per port, one big FIFO in external RAM, one small exit CDC FIFO per port.

Right now I'm pointing fingers somewhere between the input MAC and the data written to the external RAM FIFO but haven't tracked down exactly where yet.