This was extremely frustrating because parallelizing step A would bring runtime down from 10+ seconds to under 1 second (yes, it's a beefy machine), but it was definitely not worth it to do that if step C would then be some 10 times (if not more) slower, with low CPU usage to boot.

I have not investigated in detail the reasons for this massive slowdown. I suspect step C in general was slow because of (parallel and) dynamic memory allocation to build the per-G-element lists plus the need for each element to run over the whole L list to find the relevant elements.

I still don't know why it would become slower when the L was generated in parallel, but I actually found a solution that makes step C much simpler: sorting L by the second index in the tuples (the j) makes sure that the list can be trivially split (each section assigned to the proper j) simply by cutting it up at the boundaries when the j changes.

Sorting itself is extremely fast even on a list with millions of elements, and with this change step C takes less than a second *without parallelization*. It's not even worth parallelizing it anymore (in contrast to step A that does significantly benefit from it).

2/2

#Julia #parallelization

I had a weird experience with #Julia and #parallelization.

I had to collect all the pairs of items in two sets (S and G) that satisfied a particular relation function R. For Reasons™, it's easier to find these pairs looking at each element of S and find which elements of G are in relation with it, but this information is better used in reverse (i.e. on use we want, for any given element of G, find which elements of S are in relation with it.

So my idea was to build a set of tuples (i, j, r) where i is the index of an element of S, j the index of an element of G, and r the result of applying R to these two elements (which gives a value I need “on use” of this information I'm collecting).

To build this set of tuples, I went with (step A) a function mapping each index i to the list of tuples where the relation was possible, and then (step B) merge the lists together into a single list L.

Then (step C) for each index j of G, the approach was to find_all elements in L in which the second index was j, and thus have for each element of G the elements of S it was in relation with, including the relation function value, just as I wanted.

My original implementation was SLOW. Step A would take some 20 seconds, which I was able to bring down to 10 with optimizations, but step C, which would take between 30 and 130 second *using parallelization*.

The worst of it? I could trivially parallelize step A, but in this case step C would take FOREVER (stopped waiting after 15 minutes).

1/n

田中義弘 | taziku CEO / AI × Creative (@taziku_co)

Replit이 Parallel Agents 기능을 소개했다. 최대 10개의 AI 에이전트가 별도 환경에서 동시에 같은 앱을 구현해 탐색 공간을 병렬화하고, 이후 agentic merge로 결과를 통합하는 방식이다. 개발 효율을 크게 높일 수 있는 AI 코딩 워크플로우 개선으로 보인다.

https://x.com/taziku_co/status/2054052734408835121

#replit #aiagents #coding #parallelization #developertools

田中義弘 | taziku CEO / AI × Creative (@taziku_co) on X

これは開発ではなく、探索空間の並列化。 @Replit Parallel Agentsは最大10個のAIエージェントが、同じアプリを別々の環境で同時に作る。 並列実装後、最後はagentic merge。 どの程度精度が上がるかは実際に試してみたい。

X (formerly Twitter)

Yihua Wei [Advisor: Peng Jiang] will defend his doctoral thesis entitled "Runtime and Compiler Optimizations for Subgraph Matching Algorithms on GPUs" on Monday 4/6 at 4pm.

Deets at https://bit.ly/wei_4_6

#FinalExam #PhDLife #UIowaGrad26 #parallelization #optimization

Ah, another #GitHub wonder 🥱: #Forkrun claims to be the turbocharged, NUMA-aware, bash-native parallelizer we've all been waiting for 🎉... because dealing with threading complexities wasn't hard enough already, right? 🤦‍♂️ Just what we needed—another inscrutable tool promising to revolutionize workflows, while managing to bewilder mere mortals. 🚀
https://github.com/jkool702/forkrun #NUMA #parallelization #workflow #tools #tech #news #HackerNews #ngated
FLOSS Weekly Episode 862: Have Your CAKE And Eat It Too

This week Jonathan chats with Toke Hoiland-Jorgensen about CAKE_MQ, the newest Kernel innovation to combat Bufferbloat! What was the realization that made CAKE parallelization? When can we expect i…

Hackaday

#fura-utils
added #parallelization to the #opus conversion #bash script, and created a #flac conversion one,
you can find them as `fura-2opus` and `fura-2flac`, have fun mass converting! 😉
https://github.com/FraYoshi/fura-utils/commit/403e13e9fc7684f2552e7e0c2b059970f11b4e2c

oh, don't forget to install `parallel`, it is now a requirement for this script to work!

There was something interesting going on on one of my systems:

If a certain function in #Python was called as a separate process with the #multiprocessing library, then the sort_values function of #Pandas would just hang (and therefore the process would never produce the output I was waiting for). Called from the main process was OK.

The solution was to change the sorting algorithm by the `kind="stable"` parameter. Weird.

#ArchLinux #Linux #programming #parallelization

Ah yes, because the world was desperately incomplete without a way to hash a 25-byte string in merely 68 clock cycles. 😴🔧 Meanwhile, the rest of us are still waiting for the riveting sequel where we parallelize the #parallelization of parallelizing. 🚀💼
https://www.controlpaths.com/2025/06/29/parallelizing_sha256-calculation-fpga/ #hashing #innovation #tech #humor #developer #life #HackerNews #ngated
Parallelizing SHA256 calculation on FPGA

A few weeks ago, I wrote an article where I developed a hash calculator on an FPGA. Specifically, I implemented an SHA-256 calculator. This module computes the hash of a string (up to 25 bytes) in 68 clock cycles. The design leverages the parallelism of FPGAs to compute the W matrix and the recursive rounds concurrently. However, it produces only one hash every 68 clock cycles, leaving most of the FPGA underutilized during that time. In this article we are going to elevate the performance of that system by adding a set of hash calculators to be able of computing several hashes at the same time. The next diagram shows the structure of the project. I needed to change the hash calculator module to optimize it. If you remember the SHA-256 algorithm, it needs a set of pre-computed values, the K matrix. In this project, that matrix is not inside the SHA core, instead it is in a top level, where all the hash cores have access. This way only one K matrix has to be stored. In addition, the initialization of the W matrix values is performed in parallel, eliminating the AXI Stream interface. This two changes reduce the logic used by the core, and elevate its performance. This new SHA core is named sha256_core_pif (pif means parallel interface). module sha256_core_pif ( input wire aclk, input wire aresetn, /* input data channel */ input wire [31:0] string_w0, input wire [31:0] string_w1, input wire [31:0] string_w2, input wire [31:0] string_w3, input wire [31:0] string_w4, input wire [31:0] string_w5, input wire [31:0] string_w6, input wire [31:0] string_w7, input wire [31:0] string_w8, input wire [31:0] string_w9, input wire [31:0] string_w10, input wire [31:0] string_w11, input wire [31:0] string_w12, input wire [31:0] string_w13, input wire string_dv, output wire string_ready, input wire [7:0] string_size, output reg [6:0] round, input wire [31:0] k_round, /* output data channel */ output reg sha256_dv, output reg [255:0] sha256_data ); Then, a module called SHA256_manager was added to coordinate all the cores and feed them with the appropriate input values. The application I implemented is a simple hash cracker or password cracker. It receives a SHA-256 hash and attempts to recover the original string that generated it. This cannot be solved analytically; instead, the SHA256_manager iteratively hashes candidate strings, starting from the first printable character. It then increments the character until it reaches the last one, at which point it appends a new character and restarts the process. There are 95 printable ASCII characters. This means the system must compute 95 hashes for strings of length 1, 95^2 = 9 025 for two-character strings, and 95^3 = 857 375 for three-character strings. In general, the number of required hashes is 95^n for strings of length n. All the sha256_core_pif returns the hash calculated, and the SHA256_manager compares all with the received hash. If one of them is the same, then the hash sent to the first sha256_core_pif is sent to the host computer, and also the number of the sha256_core_pif that computes the correct hash. This way, the host computer can obtain the correct string. The project uses the Litefury board connected to a Raspberry Pi 5 over PCIe. In the next diagram you can find the block design of Vivado. To meet the timing requirements, I needed to reduce the AXI clock speed to 62.5 MHz. Using this configuration, I was able of integrate 12 sha256_core_pif modules. Regarding the utilization of the FPGA, you will see that it is not close to be full, but the problem was to met the timing requirements. Using 12 accelerators, and a clock speed of 62.5MHz, all the requirements were met. In the host side, I created a Python driver to manage the LiteFury. I used the xDMA drivers from Xilinx with the modification we made in this article. Now, the Python driver just needs to open the /dev/xdma0_user peripheral, and write the registers according the register map of the AXI peripheral. def __init__(self, uio_path="/dev/xdma0_user", map_size=0x20000): self.fd = os.open(uio_path, os.O_RDWR | os.O_SYNC) self.map_size = map_size self.m = mmap.mmap(self.fd, self.map_size, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=0) def close(self): self.m.close() os.close(self.fd) def write(self, addr, value): self.m.seek(addr+self.AXI_PERIPH_OFFSET) self.m.write(struct.pack("<I", value)) # Little endian def read(self, addr): self.m.seek(addr+self.AXI_PERIPH_OFFSET) return struct.unpack("<I", self.m.read(4))[0] As I mentioned before, to obtain the final string, we need to read the resulting string addresses, and add the number of the winner module. def get_password(self, winner): pw = b'' for addr in self.REG_R: word = self.read(addr) pw += word.to_bytes(4, 'big') # Add the value of the winner as integer to the resulting string pw_int = int.from_bytes(pw, 'big') + winner # Convert the result to bytes pw_bytes = pw_int.to_bytes(len(pw), 'big') # Invert the order of the result pw_bytes = pw_bytes[::-1] # ASCII decodingssh p return pw_bytes.rstrip(b'\x00').decode('ascii', errors='ignore') To test the project, I created another Python script that calculates the SHA-256 of a string (It also can be done using the openSSL library). Then, the hash calculated is sent to the accelerator, and it returns the initial string. ~/pass_cracker/python $ python3 sha256_comp.py eoi SHA-256 of 'eoi': 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b ~/pass_cracker/python $ sudo python3 pass_cracker.py 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b Password: eoi Projects like this can be quite impressive to engineers unfamiliar with FPGAs. The ability to accelerate SHA-256 computation by performing different tasks in parallel — and even using multiple hash calculators simultaneously — often sparks curiosity and interest in FPGA technology. The role of FPGAs in fields like cryptography and cybersecurity is expected to grow significantly in the coming years, as increasingly faster and more flexible systems are required. All the files of this project are shared in the controlpaths GitHub Are you involved in a cryptography project and wants to know if an FPGA could help? Contact me.

controlpaths.com

Link: https://mediatum.ub.tum.de/?id=601795 (It took digging to find this from the Wikipedia article [1] and the unsecured HTTP homepage for "BMDFM".)

```bibtex
@phdthesis{dissertation,
author = {Pochayevets, Oleksandr},
title = {BMDFM: A Hybrid Dataflow Runtime Parallelization Environment for Shared Memory Multiprocessors},
year = {2006},
school = {Technische Universität München},
pages = {170},
language = {en},
abstract = {To complement existing compiler-optimization methods we propose a programming model and a runtime system called BMDFM (Binary Modular DataFlow Machine), a novel hybrid parallel environment for SMP (Shared Memory Symmetric Multiprocessors), that creates a data-dependence graph and exploits parallelism of user application programs at run time. This thesis describes the design and provides a detailed analysis of BMDFM, which uses a dataflow runtime engine instead of a plain fork-join runtime library, thus providing transparent dataflow semantics on the top virtual machine level. Our hybrid approach eliminates disadvantages of the parallelization at compile-time, the directive based paradigm and the dataflow computational model. BMDFM is portable and is already implemented on a set of available SMP platforms. The transparent dataflow paradigm does not require parallelization and synchronization directives. The BMDFM runtime system shields the end-users from these details.},
keywords = {Parallel computing;Shared memory multiprocessors;Dataflow;Automatic Parallelization},
note = {},
url = {https://mediatum.ub.tum.de/601795},
}
```

[1]: https://en.wikipedia.org/wiki/Binary_Modular_Dataflow_Machine

#SMP #Parallelization #Multithreading #DependenceGraph #RunTime #DataFlow #VirtualMachine #VM #ParallelComputing #SharedMemoryMultiprocessors #AutomaticParallelization #CrossPlatform #Virtualization #Configware #Transputer

mediaTUM - Media and Publication Server