Mastodawn

Ah yes, because the world was desperately incomplete without a way to hash a 25-byte string in merely 68 clock cycles. 😴🔧 Meanwhile, the rest of us are still waiting for the riveting sequel where we parallelize the #parallelization of parallelizing. 🚀💼
https://www.controlpaths.com/2025/06/29/parallelizing_sha256-calculation-fpga/ #hashing #innovation #tech #humor #developer #life #HackerNews #ngated

Parallelizing SHA256 calculation on FPGA

A few weeks ago, I wrote an article where I developed a hash calculator on an FPGA. Specifically, I implemented an SHA-256 calculator. This module computes the hash of a string (up to 25 bytes) in 68 clock cycles. The design leverages the parallelism of FPGAs to compute the W matrix and the recursive rounds concurrently. However, it produces only one hash every 68 clock cycles, leaving most of the FPGA underutilized during that time. In this article we are going to elevate the performance of that system by adding a set of hash calculators to be able of computing several hashes at the same time. The next diagram shows the structure of the project. I needed to change the hash calculator module to optimize it. If you remember the SHA-256 algorithm, it needs a set of pre-computed values, the K matrix. In this project, that matrix is not inside the SHA core, instead it is in a top level, where all the hash cores have access. This way only one K matrix has to be stored. In addition, the initialization of the W matrix values is performed in parallel, eliminating the AXI Stream interface. This two changes reduce the logic used by the core, and elevate its performance. This new SHA core is named sha256_core_pif (pif means parallel interface). module sha256_core_pif ( input wire aclk, input wire aresetn, /* input data channel */ input wire [31:0] string_w0, input wire [31:0] string_w1, input wire [31:0] string_w2, input wire [31:0] string_w3, input wire [31:0] string_w4, input wire [31:0] string_w5, input wire [31:0] string_w6, input wire [31:0] string_w7, input wire [31:0] string_w8, input wire [31:0] string_w9, input wire [31:0] string_w10, input wire [31:0] string_w11, input wire [31:0] string_w12, input wire [31:0] string_w13, input wire string_dv, output wire string_ready, input wire [7:0] string_size, output reg [6:0] round, input wire [31:0] k_round, /* output data channel */ output reg sha256_dv, output reg [255:0] sha256_data ); Then, a module called SHA256_manager was added to coordinate all the cores and feed them with the appropriate input values. The application I implemented is a simple hash cracker or password cracker. It receives a SHA-256 hash and attempts to recover the original string that generated it. This cannot be solved analytically; instead, the SHA256_manager iteratively hashes candidate strings, starting from the first printable character. It then increments the character until it reaches the last one, at which point it appends a new character and restarts the process. There are 95 printable ASCII characters. This means the system must compute 95 hashes for strings of length 1, 95^2 = 9 025 for two-character strings, and 95^3 = 857 375 for three-character strings. In general, the number of required hashes is 95^n for strings of length n. All the sha256_core_pif returns the hash calculated, and the SHA256_manager compares all with the received hash. If one of them is the same, then the hash sent to the first sha256_core_pif is sent to the host computer, and also the number of the sha256_core_pif that computes the correct hash. This way, the host computer can obtain the correct string. The project uses the Litefury board connected to a Raspberry Pi 5 over PCIe. In the next diagram you can find the block design of Vivado. To meet the timing requirements, I needed to reduce the AXI clock speed to 62.5 MHz. Using this configuration, I was able of integrate 12 sha256_core_pif modules. Regarding the utilization of the FPGA, you will see that it is not close to be full, but the problem was to met the timing requirements. Using 12 accelerators, and a clock speed of 62.5MHz, all the requirements were met. In the host side, I created a Python driver to manage the LiteFury. I used the xDMA drivers from Xilinx with the modification we made in this article. Now, the Python driver just needs to open the /dev/xdma0_user peripheral, and write the registers according the register map of the AXI peripheral. def __init__(self, uio_path="/dev/xdma0_user", map_size=0x20000): self.fd = os.open(uio_path, os.O_RDWR | os.O_SYNC) self.map_size = map_size self.m = mmap.mmap(self.fd, self.map_size, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=0) def close(self): self.m.close() os.close(self.fd) def write(self, addr, value): self.m.seek(addr+self.AXI_PERIPH_OFFSET) self.m.write(struct.pack("<I", value)) # Little endian def read(self, addr): self.m.seek(addr+self.AXI_PERIPH_OFFSET) return struct.unpack("<I", self.m.read(4))[0] As I mentioned before, to obtain the final string, we need to read the resulting string addresses, and add the number of the winner module. def get_password(self, winner): pw = b'' for addr in self.REG_R: word = self.read(addr) pw += word.to_bytes(4, 'big') # Add the value of the winner as integer to the resulting string pw_int = int.from_bytes(pw, 'big') + winner # Convert the result to bytes pw_bytes = pw_int.to_bytes(len(pw), 'big') # Invert the order of the result pw_bytes = pw_bytes[::-1] # ASCII decodingssh p return pw_bytes.rstrip(b'\x00').decode('ascii', errors='ignore') To test the project, I created another Python script that calculates the SHA-256 of a string (It also can be done using the openSSL library). Then, the hash calculated is sent to the accelerator, and it returns the initial string. ~/pass_cracker/python $ python3 sha256_comp.py eoi SHA-256 of 'eoi': 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b ~/pass_cracker/python $ sudo python3 pass_cracker.py 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b Password: eoi Projects like this can be quite impressive to engineers unfamiliar with FPGAs. The ability to accelerate SHA-256 computation by performing different tasks in parallel — and even using multiple hash calculators simultaneously — often sparks curiosity and interest in FPGA technology. The role of FPGAs in fields like cryptography and cybersecurity is expected to grow significantly in the coming years, as increasingly faster and more flexible systems are required. All the files of this project are shared in the controlpaths GitHub Are you involved in a cryptography project and wants to know if an FPGA could help? Contact me.

controlpaths.com

Rowan the Selfsame Jun 21

Link: https://mediatum.ub.tum.de/?id=601795 (It took digging to find this from the Wikipedia article [1] and the unsecured HTTP homepage for "BMDFM".)

```bibtex
@phdthesis{dissertation,
author = {Pochayevets, Oleksandr},
title = {BMDFM: A Hybrid Dataflow Runtime Parallelization Environment for Shared Memory Multiprocessors},
year = {2006},
school = {Technische Universität München},
pages = {170},
language = {en},
abstract = {To complement existing compiler-optimization methods we propose a programming model and a runtime system called BMDFM (Binary Modular DataFlow Machine), a novel hybrid parallel environment for SMP (Shared Memory Symmetric Multiprocessors), that creates a data-dependence graph and exploits parallelism of user application programs at run time. This thesis describes the design and provides a detailed analysis of BMDFM, which uses a dataflow runtime engine instead of a plain fork-join runtime library, thus providing transparent dataflow semantics on the top virtual machine level. Our hybrid approach eliminates disadvantages of the parallelization at compile-time, the directive based paradigm and the dataflow computational model. BMDFM is portable and is already implemented on a set of available SMP platforms. The transparent dataflow paradigm does not require parallelization and synchronization directives. The BMDFM runtime system shields the end-users from these details.},
keywords = {Parallel computing;Shared memory multiprocessors;Dataflow;Automatic Parallelization},
note = {},
url = {https://mediatum.ub.tum.de/601795},
}
```

[1]: https://en.wikipedia.org/wiki/Binary_Modular_Dataflow_Machine

#SMP #Parallelization #Multithreading #DependenceGraph #RunTime #DataFlow #VirtualMachine #VM #ParallelComputing #SharedMemoryMultiprocessors #AutomaticParallelization #CrossPlatform #Virtualization #Configware #Transputer

mediaTUM - Media and Publication Server

N-gated Hacker News May 28

Ah, yes, because nothing says "cutting-edge tech" like juggling Git worktrees and #Tmux while your AI coding agent goes "brrr" 🙄. Truly groundbreaking stuff: discovering #parallelization in 2024 like it's a rare species. 🚀🔧
https://www.skeptrune.com/posts/git-worktrees-agents-and-tmux/ #cuttingEdgeTech #GitWorktrees #AICodingAgent #HackerNews #ngated

LLM Codegen go Brrr – Git Worktrees + Tmux | Category | Trieve

If you're underwhelmed with AI coding agents or simply want to get more out of them, give parallelization a try. After seeing the results firsthand over the past month, I'm ready to call myself an evangelist. The throughput improvements are incredible, and I don't feel like I'm losing control of the codebase.

Nick Khami's Blog

Hacker News May 28

LLM Codegen go Brrr – Parallelization with Git Worktrees and Tmux

https://www.skeptrune.com/posts/git-worktrees-agents-and-tmux/

#HackerNews #LLM #Codegen #Tmux #Git #Worktrees #Parallelization

LLM Codegen go Brrr – Git Worktrees + Tmux | Category | Trieve

Nick Khami's Blog

N-gated Hacker News May 20

🐲 Oh, look! Someone spent their free time rendering 27,000 #dragons and 10,000 lights on a GPU—because that's the absolute pinnacle of #productivity, right? 🤖 Just what the world needed: another article about high-performance #parallelization strategies that only 0.000001% of the population will pretend to understand! 😂
https://logdahl.net/p/gpu-driven #GPU #techhumor #HackerNews #ngated

27'000 dragons and 10'000 lights: GPU-Driven Clustered Forward Renderer | logdahl.net

Olle Lögdahl's web ventures

Paul Grizzaffi Jun 14, 2024

I'm excited to be heading back to an in-person conference! QA or the Highway is a week from today. There's still time to join us. Come hear me talk about #parallelization, #TestData, and #risk with our #SoftwareTesting #Automation and #DevOps. If you attend, please come say hi!

https://www.qaorthehwy.com/paul-grizzaffi/bad-tests-running-wild-concurrency-test-data-and-minimal-human-interaction-in-test-automation-dev-ops

Paul Grizzaffi - QA or the Highway

Pawel Kozielecki Jun 10, 2024

👋 Have you considered making your #iOS #app #modular 🛠? This approach has recently become very popular, offering various benefits like efficient work #parallelization 🏎️, #interchangeability 🧩, superior #separation of #concerns 🧠, and stricter adherence to #clean #code principles 👮‍♂️. Although modularity is hardly a rocket science, it can be more complex to set up. Are there apps that shouldn't use a modular design? Let's find out! Strap in and let’s take a look 🚀 https://swiftandmemes.com/how-to-build-a-robust-and-scalable-modular-ios-app/

How to build a robust and scalable modular iOS app? ‣ Swift and Memes

What does the setup of modular iOS apps look like in practice? What types of modules are there? How can different app features be made to communicate?

Swift and Memes ‣ iOS, Swift, Good Practices - explained with memes!

Show thread

Christian Meesters Mar 8, 2024

We had a registration from Chicago, US-Il. After asking back, the student noticed that there are a few miles of water between Chicago and Mainz, Germany. He will probably be better off looking for a similar course in the US.

BTW, there are still some places available. #MPI #OpenMP #Parallelization for #C, #cpp, #Fortran and #Python for #HPC software.

Some Bits: Nelson's Linkblog Mar 3, 2024

One billion rows in Go: Nice explication of optimization and profiling techniques
https://benhoyt.com/writings/go-1brc/
#parallelization #optimization #programming #hashtables #golang #go #+

The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions

How I solved the One Billion Row Challenge (1BRC) in Go nine times, from a simple unoptimised version that takes 1 minute 45 seconds, to an optimised and parallelised version that takes 4 seconds.

kghose Jan 27, 2024

Ok, I'm a total `go` fan now. A hopeless groupie, after having seen how easy it is to write parallel code.

#golang #programming #concurrency #parallelization

https://kaushikghose.wordpress.com/2024/01/26/golang-7-parallelization/

golang (7): parallelization

Ok, I’m a total go fan now. Hopelessly enamored. My application is simple, it’s an embarrassingly parallel operation writing to independent parts of an array.but I’ve never had su…

Pages from the fire