"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"

Sidebar

About this blog

These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.

Most content on this site is licensed under the WTFPL, version 2 (details).

Questions? Praise? Blame? Feel free to contact me.

My old blog (pre-2006) is also still available.

Sun	Mon	Tue	Wed	Thu	Fri	Sat
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

The actual resize

Armed with this knowledge, here's how the ful resize works:

lvconvert --splitcache tika/data
lvresize -L 300G tika/data @hdd
lvconvert --type cache --cachevol tika/data-cache tika/data --cachemode writethrough

The last command might need some additional parameters depending on how you set up the cache in the first place. You can view current cache parameters with e.g. lvs -a -o +cache_mode,cache_settings,cache_policy.

Cache pools

Note that all of this assumes using a cache volume an not a cache pool. I was originally using a cache pool setup, but it seems that a cache pool (which splits cache data and cache metadata into different volumes) is mostly useful if you want to split data and metadata over different PV's, which is not at all useful for me. So I switched to the cache volume approach, which needs fewer commands and volumes to set up.

I killed my cache pool setup with --uncache before I found out about --splitcache, so I did not actually try --splitcache with a cache pool, but I think the procedure is actually pretty much identical as described above, except that you need to replace --cachevol with --cachepool in the last command.

For reference, here's what my volumes looked like when I was still using a cache pool:

# lvs -a
LV                 VG   Attr       LSize   Pool         Origin       Data%  Meta%  Move Log Cpy%Sync Convert
data               tika Cwi-aoC--- 260.00g [data-cache] [data_corig] 99.99  19.39           0.00
[data-cache]       tika Cwi---C---  20.00g                           99.99  19.39           0.00
[data-cache_cdata] tika Cwi-ao----  20.00g
[data-cache_cmeta] tika ewi-ao----  20.00m
[data_corig]       tika owi-aoC--- 260.00g

This is a data volume of type cache, that ties together the big data_corig LV that contains the bulk data and a data-cache LV of type cache-pool that ties together the data-cache_cdata LV with the actual cache data and data-cache_cmeta with the cache metadata.

Tagged as: Linux, LVM, Resize

Comments

Dominic Raferd wrote at 2023-12-19 17:40

Interesting, thanks for your post. --splitcache sounds very neat but as far as I can tell the main advantage is speed (vs --uncache). When you run lvconvert to restore the existing cache you are only allowed to proceed if you accept that the entire existing cache contents are wiped.

Matthijs Kooijman wrote at 2023-12-19 19:55

Yeah, I think the end result is the same, it's just easier to use --splitcache indeed.

2 comments -:- permalink -:- 12:26

Sat, 18 Feb 2023

/ Blog / Software / IPMI

Disabling (broken) sensors in Supermicro IPMI to prevent alarm beeps

A few months ago, I put up an old Atom-powered Supermicro server (SYS-5015A-PHF) again, to serve at De War to collect and display various sensor and energy data about our building.

The server turned out to have an annoying habit: every now and then it would start beeping (one continuous annoying beep), that would continue until the machine was rebooted. It happened sporadically, but kept coming back. When I used this machine before, it was located in a datacenter where nobody would care about a beep more or less (so maybe it has been beeping for years on end before I replaced the server), but now it was in a server cabinet inside our local Fablab, where there are plenty of people to become annoyed by a beeping server...

I eventually traced this back to faulty sensor readings and fixed this by disabling the faulty sensors completely in the server's IPMI unit, which will hopefully prevent the annoying beep. In this post, I'll share my steps, in case anyone needs to do the same.

Shorting fan pins

At first, I noticed that there was an alarm displayed in the IPMI webinterface for one of the fans. Of course it makes sense to be notified of a faulty fan, except that the system did not have any fans connected... It did show the fan speed as 0RPM (or -2560RPM depending on where you looked) as expected, so I suspected it would start up realizing there was no fan but then sporadically seeing a bit of electrical noise on the fan speed pin, causing it to mark the fan as present and immediately as not running, triggering the alarm. I tried to fix this by shorting the fan speed detection pins to the GND pins to make it more noise-resilient.

Temperatures also wonky

However, a couple of weeks later, the server started beeping again. This time I looked a bit more closely, and found that the problem was caused by too high temperature this time. The IPMI system event log (queried using ipmi-sel) showed:

43   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Non-critical - going high ; Sensor Reading = 125.00 C ; Threshold = 85.00 C
44   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Critical - going high ; Sensor Reading = 125.00 C ; Threshold = 90.00 C
45   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Non-recoverable - going high ; Sensor Reading = 125.00 C ; Threshold = 95.00 C
46   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Non-recoverable - going high ; Sensor Reading = 41.00 C ; Threshold = 95.00 C
47   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Critical - going high ; Sensor Reading = 41.00 C ; Threshold = 90.00 C
48   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Non-critical - going high ; Sensor Reading = 41.00 C ; Threshold = 85.00 C

This is abit opaque, but the events at 9:18 show the temperature was read as 125°C - clearly indicating a faulty sensor. These are (I presume) the "asserted" events for each of the thresholds that this sensor has. Then at 16:26, the server was rebooted and the sensor read 41°C again (which I believe is still higher than realistic) and each of the thresholds emits a "deasserted" event.

Looking back, I noticed that the log showed events for both fans and both temperature sensors, so it seemed all of these sensors were really wonky. I could also see the incorrect temperatures clearly in the sensor data I had been collecting from the server (using telegraf, collected using lm-sensors from within the linux system itself, but clearly reading from the same sensor as IPMI):

Graph of erratic temperatures

Note that the graph above shows two sensors, while IPMI only reads two, so I am not sure what the third one is. The alarm from the IPMI log is shown clearly as a sudden jump of the temp2 purple line (jumping back down when the server was rebooted). But also note an unexplained second jump down a few hours later, and note that the next day temp1 dives down to -53°C for some reason, which also matches what IPMI reads:

$ sudo ipmitool sensor                                                
System Temp      | -53.000    | degrees C  | nr    | -9.000    | -7.000    | -5.000    | 75.000    | 77.000    | 79.000    
CPU Temp         | 27.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 85.000    | 90.000    | 95.000    
CPU FAN          | -2560.000  | RPM        | nr    | 400.000   | 585.000   | 770.000   | 29260.000 | 29815.000 | 30370.000 
SYS FAN          | -2560.000  | RPM        | nr    | 400.000   | 585.000   | 770.000   | 29260.000 | 29815.000 | 30370.000 
CPU Vcore        | 1.160      | Volts      | ok    | 0.640     | 0.664     | 0.688     | 1.344     | 1.408     | 1.472     
Vichcore         | 1.040      | Volts      | ok    | 0.808     | 0.824     | 0.840     | 1.160     | 1.176     | 1.192     
+3.3VCC          | 3.280      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
VDIMM            | 1.824      | Volts      | ok    | 1.448     | 1.480     | 1.512     | 1.960     | 1.992     | 2.024     
+5 V             | 5.056      | Volts      | ok    | 4.096     | 4.320     | 4.576     | 5.344     | 5.600     | 5.632     
+12 V            | 11.904     | Volts      | ok    | 10.368    | 10.496    | 10.752    | 12.928    | 13.056    | 13.312    
+3.3VSB          | 3.296      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
VBAT             | 2.912      | Volts      | ok    | 2.560     | 2.624     | 2.688     | 3.328     | 3.392     | 3.456     
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        
PS Status        | 0x1        | discrete   | 0x01ff| na        | na        | na        | na        | na        | na

Note that the voltage sensors show readings that do make sense, and looking at the history, they show no sudden jumps, so those are probably still reliably (even though they are read from the same sensor chip according to lm-sensors).

Disabling the sensors

It seems you can disable generation of events when a threshold is crossed, can even disable reading the sensor entirely. Hopefully this will also prevent the BMC from beeping on weird sensor values.

To disable things, I used ipmi-sensor-config (from the freeipmi-tools Debian package):

First I queried the current sensor configuration:

 sudo ipmi-sensors-config --checkout > ipmi-sensors-config.txt

Then I edited the generated file, setting Enable_All_Event_Messages and Enable_Scanning_On_This_Sensor to No. I also had to set the hysteresis values for the fans to None, since the -2375 value generated by --checkout was refused when writing back the values in the next step.

Commited the changes with:

sudo ipmi-sensors-config --commit --filename ipmi-sensors-config.txt

I suspect that modifying Enable_All_Event_Messages allows the sensor to be read, but prevents the threshold from being checked and generating events (especially since this setting seems to just clear the corresponding setting for each available threshold, so it seems you can also use this to disable some of the thresholds and keep some others). However, it is not entirely clear to me if this would just prevent these events from showing up in the event log, or if it would actually prevent the system from beeping (when does the system beep? On any event? Specific events? This is not clear to me).

For good measure, I decided to also modify Enable_Scanning_On_This_Sensor, which I believe prevents the sensor from being read at all by the BMC, so that should really prevent alarms. This also causes ipmitool sensor to display value and status as na for these sensors. The sensors command (from the lm-sensors package) can still read the sensor without issues, though the values are not very useful anyway...).

Note that apparently these settings are not always persistent across reboots and powercycles, so make sure you test that. For this particular server, the settings survive across a reboot, I have not tested a hard power cycle yet.

I cannot yet tell for sure if this has fixed the problem (only applied the changes today), but I'm pretty confident that this will indeed keep the people in our Fablab happy (and if not - I'll just solder off the beeper from the motherboard, but let's hope I will not have to resort to such extreme measures...).

Tagged as: IPMI, Linux, Supermicro, Tika

Related stories

Comments

No comments yet.

0 comments -:- permalink -:- 17:06

Sat, 04 Feb 2023

/ Blog / Software / Linux

Repurposing the "Ecobutton" to skip spotify songs using Linux udev/hwdb key remapping

The Ecobutton

When sorting out some stuff today I came across an "Ecobutton". When you attach it through USB to your computer and press the button, your computer goes to sleep (at least that is the intention).

The idea is that it makes things more sustainable because you can more easily put your computer to sleep when you walk away from it. As this tweakers poster (Dutch) eloquently argues, having more plastic and electronics produced in China, shipped to Europe and sold here for €18 or so probably does not have a net positive effect on the environment or your wallet, but well, given this button found its way to me, I might as well see if I can make it do something useful.

I had previously started a project to make a "Next" button for spotify that you could carry around and would (wirelessly - with an ESP8266 inside) skip to the next song using the Spotify API whenever you pressed it. I had a basic prototype working, but then the project got stalled on figuring out an enclosure and finding sufficiently low-power addressable RGB LEDs (documentation about this is lacking, so I resorted to testing two dozen different types of LEDs and creating a website to collect specs and test results for adressable LEDs, which then ended up with the big collection of other Yak-shaving projects waiting for this magical moment where I suddenly have a lot of free time).

In any case, it seemed interesting to see if this Ecobutton could be used as poor-man's spotify next button. Not super useful, but at least now I can keep the button around knowing I can actually use it for something in the future. I also produced some useful (and not readily available) documentation about remapping keys with hwdb in the process, so it was at least not a complete waste of time... Anyway, into the technical details...

How does it work?

I expected this to be an USB device that advertises itself as a keyboard, and then whenever you press the button, sends the "sleep" key to put the computer to sleep.

Turns out the first part was correct, but instead of sending a sleep keypress, it sends Meta-R, ecobutton (it types each of these letters after each other), enter. Seems you need to install a tool on your pc that is executed by using the windows-key+R shortcut. Pragmatic, but quite ugly, especially given a sleep key exists... But maybe Windows does not implement the key (or maybe this tool also changes some settings for deeper sleep, at least that's what was suggested in the tweakers post linked above).

Replacing the firmware?

I considered I could maybe replace the firmware to make the device send whatever keystroke I wanted, but writing firmware from scratch for existing hardware is not the easiest project (even for a simple device like this). After opening the device I decided this was not a feasible route.

The (I presume) microcontroller in there is hidden in a blob, so no indication as to its type, pin connections, programming ports (if it actually has flash and is not ROM only).

I did notice some solder jumpers that I figured could influence behavior (maybe the same PCB is used for differently branded buttons), but shorting S1 or S5 did not seem to change behavior (maybe I should have unsoldered S3, but well...).

Remap keys, then?

The next alternative is to remap keys on the computer. Running Linux, this should certainly be possible in a couple of dozen ways. This does need to be device-specific remapping, so my normal keyboard still works as normal, but if I can unmap all keys except for the meta key that it presses first, and map that to someting like the KEY_NEXTSONG (which is handled by spotify and/or Gnome already), that might work.

I first saw some remapping solutions for X, but those probably will not work - I'm running wayland and I prefer something more low-level. I also found cool remapping daemons (like keyd) that grab events from a keyboard and then synthesise new events on a virtual keyboard, allowing cool things like multiple layers or tapping shift and then another key to get uppercase instead of having to hold shift and the key together, but that is way more complicated than what I need here.

Then I found that udev has a thing called "hwdb", which allows putting files in /etc/udev/hwdb.d that match specific input devices and can specify arbitrary scancode-to-keycode mappings for them. Exactly what I needed - works out of the box, just drop a file into /etc.

Understanding and taming udev and hwdb

The challenge turned out to be to figure out how to match against my specific keyboard identifier, what scancodes and keycodes to use, and in general figure out how the ecosystem around this works (In short: when plugging in a device, udev rules consults the hwdb for extra device properties, which a udev builtin keyboard command then uses to apply key remappings in the kernel using an ioctl on the /dev/input/eventxx device). In case you're wondering - this means you do not need to use hwdb, you can also apply this from udev rules directly, but then you need a bit more care.

I've written down everything I figured about hwdb in a post on Unix stackexchange, so I'l not repeat everything here.

Using what I had learnt, getting the button to play nice was a matter of creating /etc/udev/hwdb.d/99-ecobutton.hwdb containing:

evdev:input:b????v3412p7856e*
  KEYBOARD_KEY_700e3=nextsong # LEFTMETA
  KEYBOARD_KEY_70015=reserved # R
  KEYBOARD_KEY_70008=reserved # E
  KEYBOARD_KEY_70006=reserved # C
  KEYBOARD_KEY_70012=reserved # O
  KEYBOARD_KEY_70005=reserved # B
  KEYBOARD_KEY_70018=reserved # U
  KEYBOARD_KEY_70017=reserved # T
  KEYBOARD_KEY_70011=reserved # N
  KEYBOARD_KEY_70028=reserved # ENTER

This matches the keyboard based on its usb vendor and product id (3412:7856) and then disables all keys that are used by the button, except for the first, and remaps that to KEY_NEXTSONG.

To apply this new file, run sudo systemd-hwdb update to recompile the database and then replug the button to apply it (you can also re-apply with udevadm trigger, but it seems then Gnome does not pick up the change, I suspect because the gnome-settings media-keys module checks only once whether a keyboard supports media keys at all and ignores it otherwise).

With that done, it now produces KEY_NEXTSONG events as expected:

$ sudo evtest --grab /dev/input/by-id/usb-3412_7856-event-if00
Input driver version is 1.0.1
Input device ID: bus 0x3 vendor 0x3412 product 0x7856 version 0x100
Input device name: "HID 3412:7856"

[ ... Snip more output ...]

Event: time 1675514344.256255, type 4 (EV_MSC), code 4 (MSC_SCAN), value 700e3
Event: time 1675514344.256255, type 1 (EV_KEY), code 163 (KEY_NEXTSONG), value 1
Event: time 1675514344.256255, -------------- SYN_REPORT ------------
Event: time 1675514344.264251, type 4 (EV_MSC), code 4 (MSC_SCAN), value 700e3
Event: time 1675514344.264251, type 1 (EV_KEY), code 163 (KEY_NEXTSONG), value 0
Event: time 1675514344.264251, -------------- SYN_REPORT ------------

More importantly, I can now skip annoying songs (or duplicate songs - spotify really messes this up) with a quick butttonpress!

Tagged as: Hack, hwdb, Keyboard, Linux, udev

Comments

L3P3 wrote at 2024-03-24 14:44

Maybe you missed the fact that it is also possible to keep the button pressed in order to open a website. This could be used for another function by mapping a key that is only included in that link like /.

L3P3 wrote at 2024-03-24 15:19

To map out the other url keys, my file now looks like this:

evdev:input:b????v3412p7856e*
  KEYBOARD_KEY_700e3=nextsong # LEFTMETA
  KEYBOARD_KEY_70015=reserved # R
  KEYBOARD_KEY_70008=reserved # E
  KEYBOARD_KEY_70006=reserved # C
  KEYBOARD_KEY_70012=reserved # O
  KEYBOARD_KEY_70005=reserved # B
  KEYBOARD_KEY_70018=reserved # U
  KEYBOARD_KEY_70017=reserved # T
  KEYBOARD_KEY_70011=reserved # N
  KEYBOARD_KEY_7000b=reserved # H
  KEYBOARD_KEY_70013=reserved # P
  KEYBOARD_KEY_700e1=reserved # LEFTSHIFT
  KEYBOARD_KEY_70054=reserved # /
  KEYBOARD_KEY_7001a=reserved # W
  KEYBOARD_KEY_70037=reserved # .
  KEYBOARD_KEY_7002d=reserved # -
  KEYBOARD_KEY_70010=reserved # M
  KEYBOARD_KEY_70016=reserved # S
  KEYBOARD_KEY_70033=reserved # ;
  KEYBOARD_KEY_70028=reserved # ENTER

When the button is pressed for 3 seconds, the same happens as when pressed shortly.

Matthijs Kooijman wrote at 2024-03-27 12:00

> Maybe you missed the fact that it is also possible to keep the button pressed in order to open a website. This could be used for another function by mapping a key that is only included in that link like /.

Ah, I indeed missed that. Thanks for pointing that out and the updated file :-)

3 comments -:- permalink -:- 13:58

Tue, 31 Jan 2023

/ Blog / Software / Linux

Recovering data from a failing hard disk with HFS+

Recently, a customer asked me te have a look at an external hard disk he was using with his Macbook. It would show up a file listing just fine, but when trying to open actual files, it would start failing. Of course there was no backup, but the files were very precious...

This started out as a small question, but ended up in an adventure that spanned a few days and took me deep into the ddrescue recovery tool, through the HFS+ filesystem and past USB power port control. I learned a lot, discovered some interesting things and produced a pile of scripts that might be helpful to others. Since the journey seems interesting as well as the end result, I will describe the steps I took here, "ter leering ende vermaeck".

I started out confirming the original problem. Plugging in the disk to my Linux laptop, it showed up as expected in dmesg. I could mount the disk without problems, see the directory listing and even open up an image file stored on the disk. Opening other files didn't seem to work.

SMART

As you do with bad disks, you try to get their SMART data. Since smartctl did not support this particular USB bridge (and I wasn't game to try random settings to see if it worked on a failing disk), I gave up on SMART initially. I later opened up the case to bypassing the USB-to-SATA controller (in case the problem was there, and to make SMART work), but found that this particular hard drive had the converter built into the drive itself (so the USB part was directly attached to the drive). Even later, I found out some page online (I have not saved the link) that showed the disk was indeed supported by smartctl and showed the option to pass to smartctl -d to make it work. SMART confirmed that the disk was indeed failing, based on the number of reallocated sectors (2805).

Fast-then-slow copying

Since opening up files didn't work so well, I prepared to make a sector-by-sector copy of the partition on the disk, using ddrescue. This tool has a good approach to salvaging data, where it tries to copy off as much data as possible quickly, skipping data when it comes to a bad area on disk. Since reading a bad sector on a disk often takes a lot of time (before returning failure), ddrescue tries to steer clear of these bad areas and focus on the good parts first. Later, it returns to these bad areas and, in a few passes, tries to get out as much data as possible.

At first, copying data seemed to work well, giving a decent read speed of some 70MB/s as well. But very quickly the speed dropped terribly and I suspected the disk ran into some bad sector and kept struggling with that. I reset the disk (by unplugging it) and did a few more attempts and quickly discovered something weird: The disk would work just fine after plugging it in, but after a while the speed would plummet tot a whopping 64Kbyte/s or less. This happened everytime. Even more, it happened pretty much exactly 30 seconds after I started copying data, regardless of what part of the disk I copied data from.

So I quickly wrote a one-liner script that would start ddrescue, kill it after 45 seconds, wait for the USB device to disappear and reappear, and then start over again. So I spent some time replugging the USB cable about once every minute, so I could at least back up some data while I was investigating other stuff.

Since the speed was originally 70MB/s, I could pull a few GB worth of data every time. Since it was a 2000GB disk, I "only" had to plug the USB connector around a thousand times. Not entirely infeasible, but not quite comfortable or efficient either.

So I investigated ways to further automate this process: Using hdparm to spin down or shutdown the disk, use USB powersaving to let the disk reset itself, disable the USB subsystem completely, but nothing seemed to increase the speed again other than completely powering down the disk by removing the USB plug.

While I was trying these things, the speed during those first 30 seconds dropped, even below 10MB/s at some point. At that point, I could salvage around 200MB with each power cycle and was looking at pulling the USB plug around 10,000 times: no way that would be happening manually.

Automatically pulling the plug

I resolved to further automate this unplugging and planned using an Arduino (or perhaps the GPIO of a Raspberry Pi) and something like a relay or transistor to interrupt the power line to the hard disk to "unplug" the hard disk.

For that, I needed my Current measuring board to easily interrupt the USB power lines, which I had to bring from home. In the meanwhile, I found uhubctl, a small tool that uses low-level USB commands to individually control the port power on some hubs. Most hubs don't support this (or advertise support, but simply don't have the electronics to actually switch power, apparently), but I noticed that the newer raspberry pi's supported this (for port 2 only, but that would be enough).

Coming to the office the next day, I set up a raspberry pi and tried uhubctl. It did indeed toggle USB power, but the toggle would affect all USB ports at the same time, rather than just port 2. So I could switch power to the faulty drive, but that would also cut power to the good drive that I was storing the recovered data on, and I was not quite prepared to give the good drive 10,000 powercycles.

The next plan was to connect the recovery drive through the network, rather than directly to the Raspberry Pi. On Linux, setting up a network drive using SSHFS is easy, so that worked in a few minutes. However, somehow ddrescue insisted it could not write to the destination file and logfile, citing permission errors (but the permissions seemed just fine). I suspect it might be trying to mmap or something else that would not work across SSHFS....

The next plan was to find a powered hub - so the recovery drive could stay powered while the failing drive was powercycled. I rummaged around the office looking for USB hubs, and eventually came up with some USB-based docking station that was externally powered. When connecting it, I tried the uhubctl tool on it, and found that one of its six ports actually supported powertoggling. So I connected the failing drive to that port, and prepared to start the backup.

When trying to mount the recovery drive, I discovered that a Raspberry pi only supports filesystems up to 2TB (probably because it uses a 32-bit architecture). My recovery drive was 3TB, so that would not work on the Pi.

Time for a new plan: do the recovery from a regular PC. I already had one ready that I used the previous day, but now I needed to boot a proper Linux on it (previously I used a minimal Linux image from UBCD, but that didn't have a compiler installed to allow using uhubctl). So I downloaded a Debian live image (over a mobile connection - we were still waiting for fiber to be connected) and 1.8GB and 40 minutes later, I finally had a working setup.

The run.sh script I used to run the backup basically does this:

Run ddrescue to pull of data
After 35 seconds, kill ddrescue
Tell the disk to sleep, so it can spindown gracefully before cutting the power.
Tell the disk to sleep again, since sometimes it doesn't work the first time.
Cycle the USB power on the port
Wait for the disk to re-appear
Repeat from 1.

By now, the speed of recovery had been fluctuating a bit, but was between 10MB/s and 30MB/s. That meant I was looking at some thousands up to ten thousands powercycles and a few days up to a week to backup the complete disk (and more if the speed would drop further).

Selectively backing up

Realizing that there would be a fair chance that the disk would indeed get slower, or even die completely due to all these power cycles, I had to assume I could not backup the complete disk.

Since I was making the backup sector by sector using ddrescue, this meant a risk of not getting any meaningful data at all. Files are typically fragmented, so can be stored anywhere on the disk, possible spread over multiple areas as well. If you just start copying at the start of the disk, but do not make it to the end, you will have backed some data but the data could belong to all kinds of different files. That means that you might have some files in a directory, but not others. Also, a lot of files might only be partially recovered, the missing parts being read as zeroes. Finally, you will also end up backing up all unused space on the disk, which is rather pointless.

To prevent this, I had to figure out where all kinds of stuff was stored on the disk.

The catalog file

The first step was to make sure the backup file could be mounted (using a loopback device). On my first attempt, I got an error about an invalid catalog.

I looked around for some documentation about the HFS+ filesystems, and found a nice introduction by infosecaddicts.com and a more detailed description at dubeiko.com. The catalog is apparently the place where the directory structure, filenames, and other metadata are stored in a single place.

This catalog is not in a fixed location (since its size can vary), but its location is noted in the so-called volume header, a fixed-size datastructure located at 1024 bytes from the start of the partition. More details (including easier to read offsets within the volume header) are provided in this example.

Looking at the volume header inside the backup, gives me:

root@debian:/mnt/recover/WD backup# dd if=backup.img bs=1024 skip=1 count=1 2> /dev/null | hd
00000000  48 2b 00 04 80 00 20 00  48 46 53 4a 00 00 3a 37  |H+.... .HFSJ..:7|
00000010  d4 49 7e 38 d8 05 f9 64  00 00 00 00 d4 49 1b c8  |.I~8...d.....I..|
00000020  00 01 24 7c 00 00 4a 36  00 00 10 00 1d 1a a8 f6  |..$|..J6........|
                                   ^^^^^^^^^^^ Block size: 4096 bytes
00000030  0e c6 f7 99 14 cd 63 da  00 01 00 00 00 01 00 00  |......c.........|
00000040  00 02 ed 79 00 6e 11 d4  00 00 00 00 00 00 00 01  |...y.n..........|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  a7 f6 0c 33 80 0e fa 67  |...........3...g|
00000070  00 00 00 00 03 a3 60 00  03 a3 60 00 00 00 3a 36  |......`...`...:6|
00000080  00 00 00 01 00 00 3a 36  00 00 00 00 00 00 00 00  |......:6........|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000c0  00 00 00 00 00 e0 00 00  00 e0 00 00 00 00 0e 00  |................|
000000d0  00 00 d2 38 00 00 0e 00  00 00 00 00 00 00 00 00  |...8............|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000110  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
00000120  00 0d 82 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000160  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
00000170  00 00 e0 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400

00000110  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
          ^^^^^^^^^^^^^^^^^^^^^^^ Catalog size, in bytes: 0x12600000

00000120  00 0d 82 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
                      ^^^^^^^^^^^ First extent size, in 4k blocks: 0x12600
          ^^^^^^^^^^^ First extent offset, in 4k blocks: 0xd8238
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

I have annotated the parts that refer to the catalog. The content of the catalog (just like all other files), are stored in "extents". An extent is a single, contiguous block of storage, that contains (a part of) the content of a file. Each file can consist of multiple extents, to prevent having to move file content around each time things change (e.g. to allow fragmentation).

In this case, the catalog is stored only in a single extent (since the subsequent extent descriptors have only zeroes). All extent offsets and sizes are in blocks of 4k byte, so this extent lives at 0xd8238 * 4k = byte 3626205184 (~3.4G) and is 0x12600 * 4k = 294MiB long. So I backed up the catalog by adding -i 3626205184 to ddrescue, making it skip ahead to the location of the catalog (and then power cycled a few times until it copied the needed 294MiB).

After backup the allocation file, I could mount the image file just fine, and navigate the directory structure. Trying to open files would mostly fail, since the most files would only read zeroes now.

I did the same for the allocation file (which tracks free blocks), the extents file (which tracks the content of files that are more fragmented and whose extent list does not fit in the catalog) and the attributes file (not sure what that is, but for good measure).

Afterwards, I wanted to continue copying from where I previously left off, so I tried passing -i 0 to ddrescue, but it seems this can only be used to skip ahead, not back. In the end, I just edited the logfile, which is just a textfile, to set the current position to 0. ddrescue is smart enough to skip over blocks it already backed up (or marked as failed), so it then continued where it previously left off.

Where are my files?

With the catalog backed up, I needed to read it to figure out what file were stored where, so I could make sure the most important files were backed up first, followed by all other files, skipping any unused space on the disk.

I considered and tried some tools for reading the catalog directly, but none of them seemed workable. I looked at hfssh from hfsutils (which crashed), hfsdebug (which is discontinued and no longer available for download), hfsinspect (which calsl itself "quite buggy").

Instead, I found the filefrag commandline utility that uses a Linux filesystem syscall to figure out where the contents of a particular file is stored on disk. To coax the output of that tool into a list of extents usable by ddrescue, I wrote a oneliner shell script called list-extents.sh:

sudo filefrag -e "$@"  | grep  '^   ' |sed 's/\.\./:/g' | awk -F: '{print $4, $6}'

Given any number of filenames, it produces a list of (start, size) pairs for each extent in the listed files (in 4k blocks, which is the Linux VFS native block size).

With the backup image loopback-mounted at /mnt/backup, I could then generate an extent list for a given subdirectory using:

sudo find /mnt/backup/SomeDir -type f -print0 | xargs -0 -n 100 ./list-extents.sh > SomeDir.list

To turn this plain list of extents into a logfile usable by ddrescue, I wrote another small script called post-process.sh, that adds the appropriate header, converts from 4k blocks to 512-byte sectors, converts to hexadecimal and sets the right device size (so if you want to use this script, edit it with the right size). It is called simply like this:

./post-process.sh SomeDir.list

This produces two new files: SomeDir.list.done, in which all of the selected files are marked as "finished" (and all other blocks as "non-tried") and SomeDir.list.notdone which is reversed (all selected files are marked as "non-tried" and all others are marked as "finished").

Backing up specific files

Edit: Elmo pointed out that all the mapfile manipulation with ddrescuelog was not actually needed if I had know about ddrescue's --domain-mapfile option, which passes a second mapfile to ddrescue and makes it only process blocks that are marked in that finished mapfile, while presumably reading and updating the regular mapfile as normal.

Armed with a couple of these logfiles for the most important files on the disk and one for all files on the disk, I used the ddrescuelog tool to tell ddrescue what stuff to work on first. The basic idea is to mark everything that is not important as "finished", so ddrescue will skip over it and only work on the important files.

ddrescuelog backup.logfile --or-mapfile SomeDir.list.notdone | tee todo.original > todo

This uses the ddrescuelog --or-mapfile option, which takes my existing logfile (backup.logfile) and marks all bytes as finished that are marked as finished in the second file (SomeDir.list.notdone). IOW, it marks all bytes that are not part of SomeDir as done. This generates two copies (todo and todo.original) of the result, I'll explain why in a minute.

With the generated todo file, we can let ddrescue run (though I used the run.sh script instead):

# Then run on the todo file
sudo ddrescue -d /dev/sdd2 backup.img todo -v -v

Since the generation of the todo file effectively threw away information (we can not longer see from the todo file what parts of the non-important sectors were already copied, or had errors, etc.), we need to keep the original backup.logfile around too. Using the todo.original file, we can figure out what the last run did, and update backup.logfile accordingly:

ddrescuelog backup.logfile --or-mapfile <(ddrescuelog --xor-mapfile todo todo.original) > newbackup.logfile

Note that you could also use SomeDir.list.done here, but actually comparing todo and todo.original helps in case there were any errors in the last run (so the error sectors will not be marked as done and can be retried later).

With backup.logfile updated, I could move on to the next subdirectories, and once all of the important stuff was done, I did the same with a list of all file contents to make sure that all files were properly backed up.

But wait, there's more!

Now, I had the contents of all files backed up, so the data was nearly safe. I did however find that the disk contained a number of hardlinks, and/or symlinks, which did not work. I did not dive into the details, but it seems that some of the metadata and perhaps even file content is stored in a special "metadata directory", which is hidden by the Linux filesystem driver. So my filefrag-based "All files"-method above did not back up sufficient data to actually read these link files from the backup.

I could have figured out where on disk these metadata files were stored and do a backup of that, but then I still might have missed some other special blocks that are not part of the regular structure. I could of course back up every block, but then I would be copying around 1000GB of mostly unused space, of which only a few MB or GB would actually be relevant.

Instead, I found that HFS+ keeps an "allocation file". This file contains a single bit for each block in the filesystem, to store whether the block is allocated (1) or free (0). Simply looking a this bitmap and backing up all blocks that are allocated should make sure I had all data, and only left unused blocks behind.

The position of this allocation file is stored in the volume header, just like the catalog file. In my case, it was stored in a single extent, making it fairly easy to parse.

The volume header says:

00000070  00 00 00 00 03 a3 60 00  03 a3 60 00 00 00 3a 36  |......`...`...:6|
          ^^^^^^^^^^^^^^^^^^^^^^^ Allocation file size, in bytes: 0x12600000

00000080  00 00 00 01 00 00 3a 36  00 00 00 00 00 00 00 00  |......:6........|
                      ^^^^^^^^^^^ First extent size, in 4k blocks: 0x3a36
          ^^^^^^^^^^^ First extent offset, in 4k blocks: 0x1

This means the allocation file takes up 0x3a36 blocks (of 4096 bytes of 8 bits each, so it can store the status of 0x3a36 * 4k * 8 = 0x1d1b0000 blocks, which is rounded up from the total size of 0x1d1aa8f6 blocks).

First, I got the allocation file off the disk image (this uses bash arithmetic expansion to convert hex to decimal, you can also do this manually):

dd if=/dev/backup of=allocation bs=4096 skip=1 count=$((0x3a36))

Then, I wrote a small python script parse-allocation-file.py to parse the allocate file and output a ddrescue mapfile. I started out in bash, but that got tricky with bit manipulation, so I quickly converted to Python.

The first attempt at this script would just output a single line for each block, to let ddrescuelog merge adjacent blocks, but that would produce such a large file that I stopped it and improved the script to do the merging directly.

cat allocation | ./parse-allocation-file.py > Allocated.notdone

This produces an Allocated.notdone mapfile, in which all free blocks are marked as "finished", and all allocated blocks are marked as "non-tried".

As a sanity check, I verified that there was no overlap between the non-allocated areas and all files (i.e. the output of the following command showed no done/rescued blocks):

ddrescuelog AllFiles.list.done --and-mapfile Allocated.notdone | ddrescuelog --show-status -

Then, I looked at how much data was allocated, but not part of any file:

ddrescuelog AllFiles.list.done --or-mapfile Allocated.notdone | ddrescuelog --show-status -

This marked all non-allocated areas and all files as done, leaving a whopping 21GB of data that was somehow in use, but not part of any files. This size includes stuff like the volume header, catalog, the allocation file itself, but 21GB seemed a lot to me. It also includes the metadata file, so perhaps there's a bit of data in there for each file on disk, or perhaps the file content of hard linked data?

Nearing the end

Armed with my Allocated.notdone file, I used the same commands as before to let ddrescue backup all allocated sectors and made sure all data was safe.

For good measure, I let ddrescue then continue backing up the remainder of the disk (e.g. all unallocated sectors), but it seemed the disk was nearing its end now. The backup speed (even during the "fast" first 30 seconds) had dropped to under 300kB/s, so I was looking at a couple of more weeks (and thousands of powercycles) for the rest of the data, assuming the speed did not drop further. Since the rest of the backup should only be unused space, I shut down the backup and focused on the recovered data instead.

What was interesting, was that during all this time, the number of reallocated sectors (as reported by SMART) had not increased at all. So it seems unlikely that the slowness was caused by bad sectors (unless the disk firmware somehow tried to recover data from these reallocated sectors in the background and locked up itself in the process). The slowness also did not seem related to what sectors I had been reading. I'm happy that the data was recovered, but I honestly cannot tell why the disk was failing in this particular way...

In case you're in a similar position, the scripts I wrote are available for download.

So, with a few days of work, around a week of crunch time for the hard disk and about 4,000 powercycles, all 1000GB of files were safe again. Time to get back to some real work :-)

Tagged as: ddrescue, Filesystems, HDD, HFS+, Linux, Recovery, SMART, Software, Work

Related stories

Comments

elmo wrote at 2023-01-22 23:03

You can use the ddrescue -m option to provide a 'domain' mapfile that tells ddrescue to only work on that part of the disk. This avoids all the mapfile manipulation youo had to go through.

Matthijs Kooijman wrote at 2023-01-31 17:59

Thanks for the tip, had I realized that option existed, that would have saved quite a fiddling with mapfiles. I've added a remark to my post about this!

2 comments -:- permalink -:- 17:58

Wed, 07 Sep 2022

/ Blog / Hardware / Thunderbolt

USB, Thunderbolt, Displayport & docks

USB-C logos

After I recently ordered a new laptop, I have been looking for a USB-C-connected dock to be used with my new laptop. This turned out to be quite complex, given there are really a lot of different bits of technology that can be involved, with various (continuously changing, yes I'm looking at you, USB!) marketing names to further confuse things.

As I'm prone to do, rather than just picking something and seeing if it works, I dug in to figure out how things really work and interact. I learned a ton of stuff in a short time, so I really needed to write this stuff down, both for my own sanity and future self, as well as for others to benefit.

I originally posted my notes on the Framework community forum, but it seemed more appropriate to publish them on my own blog eventually (also because there's no 32,000 character limit here :-p).

There are still quite a few assumptions or unknowns below, so if you have any confirmations, corrections or additions, please let me know in a reply (either here, or in the Framework community forum topic).

Parts of this post are based on info and suggestions provided by others on the Framework community forum, so many thanks to them!

Getting started

First off, I can recommend this article with a bit of overview and history of the involved USB and Thunderbolt technolgies.

Then, if you're looking for a dock, like I was, the Framework community forum has a good list of docks (focused on Framework operability), and Dan S. Charlton published an overview of Thunderbolt 4 docks and an overview of USB-C DP-altmode docks (both posts with important specs summarized, and occasional updates too).

Then, into the details...

Lines, lanes, channels, duplex, encoding and bandwidth

The transmission of high-speed signals almost exclusively used "differential pairs" for encoding. This means two wires/pins are used to transmit a single signal in a single direction (at the same time). In this post, I will call one such pair a "line".
A single line can either be simplex (unidirectional, transmitting in the same direction all the time) or half-duplex (alternating between directions).
Two lines (one for each direction) can be combined to form a full-duplex channel, where data can be transmitted in both directions simultaneously. Sometimes (e.g. in USB, PCIe) such a pair of lines (so 4 wires in total) is referred to as a "(full duplex) lane", but since the term "lane" is also used to refer to a single unidirectional wire pair (e.g. in DisplayPort), I'll use "full duplex channel" for this instead.
I'll still sometimes use "lane" when referring to specific protocols, but I'll try to qualify what I mean exactly then.
A line will be operated at a specific bitrate, which is the same as the available bandwidth for that line. Multiple lines can be combined, leading to higher total bandwidth (but the same bitrate).
To really compare different technologies, it is often useful to look at the bitrate used on the individual lines, for which I'll use Gbps-per-line below. I'll use "total bandwidth" refer to the sum of all the lines used in either direction, and "full duplex bandwidth" means the bandwidth that can be used in both directions at the same time.
Most of the numbers shown here are for bits on the wire. However, that does not represent the amount of actual data being sent. For example, SATA, PCIe 2.0, USB 3.0, HDMI, and DisplayPort 1.4 transmit 10 bits on the wire for every 8 bits of data, otherwise known as 8b/10b encoding. This means USB 3.0 is more like 4 Gbps instead of 5 Gbps. This data includes protocol overhead, such as framing, headers, error correction, etc. so the effective data transfer rates (i.e. data written to disk) is lower still.

USB1/2/3

USB1/2: single half-duplex pair up to 480Mbs half-duplex.
USB1/2 link setup happens using pullups/pulldowns on the datalines, always starting up at USB1 speeds (low-speed/full-speed) and then using USB data messages to coordinate the upgrade to USB2 speeds (hi-speed).
When a device using USB1 speeds ("low speed" or "full speed") is connected to a USB2 hub, the difference in speed is translated using a "Transaction Translator". Most hubs have a single TT for the entire hub, meaning that when multiple such devices are connected, only one of them can be transmitting at the same time, and they have to share the 12Mbps (full speed) bandwidth amongst them. In practice, this can sometimes lead to issues, especially with devices that need guaranteed bandwidth, like USB audio devices. Note that this applies even to modern devices that support USB2, but only implement the slower transmission speed (like often audio devices, mice, keyboard, etc.). Some hubs have multiple TTs, meaning each downstream port has the full 12Mbps available, making multi-TT hubs a better choice (but unfortunately, manufacturers usually do not specify this, you can use lsusb -v to tell).
USB3.0: Adds two extra lines at 5Gbps-per-line (== 1 full-duplex channel, 10Gbps total bandwidth) using a new "USB3" connector with 4 extra data wires.
USB3 is a separate protocol from USB1/2 and works exclusively over the newly added lines. This means that a USB3 hub is essentially two fully separate devices in one (USB1/2 hub and USB3 hub) and USB3 traffic and USB1/2 traffic can happen at the same time over the same cable independently.
USB3.0 link setup uses "LFPS link training" pulses, where both sides just start to transmit on their TX line and listen on their RX line). [source].
USB3.0 over a USB-C connector only uses 2 of the 4 high-speed lines available in the connector and cable.
USB3.1 increases single line speed to 10Gbps (with appropriate cabling), for 20Gbps total bandwidth.
USB3.2 allows use of all 4 high-speed lines in a USB-C connection at the same 10Gbps-per-line speed of 3.1, for 40Gbps total bandwidth.

DisplayPort

[source]

Displayport is essentially a high-speed 4 line unidirectional connection (main link) for display and audio data, plus one lower speed half-duplex auxilliary channel.
The auxilliary channel is used for things like EDID (detecting supported resolution) and CEC (controlling playback and power state of devices). Deciding the link speed to use on the high-speed lines is done using "link training" on those lines themselves, not using the aux channel. [source]
Unlike VGA/DVI/HDMI, Displayport is based on data packets (similar to ethernet), rather than a continuous (clocked) stream of data on specific pins.
Displayport speed evolved over time: 4x1.62=6.48Gbps for DP1/RBR, 4x2.7=10.8Gbps for DP1/HBR, 4x5.4=21.6Gbps for DP1.2/HBR2, 4x8.1=32.4Gbps for DP1.3/HBR3, 4x10=40Gbps for DP2.0/UHBR10, 4x13.5=54Gbps for DP2.0/UHBR13.5, 4x20=80Gbps for DP2.0/UHBR20. All versions use the same 4-line connection, just driving more data over the same wires (needing better quality cables). [source]
Displayport 1.2 introduced optional Display Stream Compression (DSC), which is a "visually lossless" compression that can reduce bandwidth usage by a factor 3. Starting with DisplayPort 2.0, supporting DSC is mandatory.
Displayport 1.2 introduced Multi-Stream Transport (MST), which allow sending data for multiple displays (up to 63 in theory) over a single DP link (multiplexed inside the datastream, so without having to dedicate lines to displays). Such an MST signal can then be split by an MST hub (often a 2-port hub inside a display to allow daisy-chaining displays). MST is also possible for DP inside an USB-C or thunderbolt connection. It can also use DSC (compression) on the input and uncompressed on the output, to drive displays that do not support compression. [source]
Displayport dualmode (aka DP++) allows sending out HDMI/DVI signals from a displayport connector. This essentially just repurposes pins on the DP connector to transfer HDMI/DVI signals, so can be used with a passive adapter or cable (though the adapter does need to do voltage level shifting). This only works to connect a DP source to a HDMI/DVI sink (display), not the other way around. This also repurposes two additional CONFIG1/2 pins from being grounded to carry HDMI/DVI signals, so dual-mode is not available on an USB-C DP alt mode connection. [source]
Some Displayport devices support Adaptive Sync, where the display refresh rate is synchronized with the GPU framerate. This seems to still work when DisplayPort is tunneled through Thunderbolt or USB4, but often breaks when going through an MST hub (though some explicitly document support for it and probably do work). [source]

Thunderbolt 1/2/3

[source]

Thunderbolt 1/2 is an Apple/Intel specific protocol that reuses the mini-displayport connector to transfer both PCIe (for e.g. fast storage or external GPUs) and DP signals using four 10Gbps high-speed lines (40Gbps total bandwidth) configured as two 10Gbps full-duplex channels (TB1) or a single 20Gbps full-duplex channel (TB2). [source].
It is a bit unclear to me how these signals are multiplexed. Apparently TB2 combines all four lines into a single full-duplex channel, suggesting that on TB1 there are two separate channels, but does that mean that one channel carries PCIe and one carries DP on TB1? Or each device connected is assigned to (a part of) either channel?
TB1/2 are backwards compatible with DP, so you can plug in a non-TB DP display into a TB1/2 port. This uses a multiplexer to connect the port to either the TB controller or the GPU. This mux is controlled by snooping the signal lines and detecting if the signals look like DP or TB. In early Light Ridge controllers, this was external circuitry, from Cactus Ridge all this (including the mux) was integrated in the TB controller. [source: comments below]
TB1/2/3 ports can be daisychained with up to 6 devices in the chain. Hubs (i.e. a split in the connection) are not supported until TB4. The last device in the chain can be a non-TB DP display as well. [source]
USB (1/2/3) over thunderbolt is not supported directly, but is implemented by putting a full USB-host controller in a thunderbolt hub/dock that is then controlled by the host through PCIe-over-thunderbolt.
Thunderbolt 3 is an upgrade of TB2 that uses an USB-C connector instead of mini-displayport. It uses 4 high-speed lines at 20Gbps-per-line (double of TB2), to get 40Gbps full-duplex, or 80Gbps total bandwidth (and maybe also the sidechannel SBU line for link management or so, but I could not find this). [source]
Thunderbolt 4 seems to be essentially just USB4 with all optional items implemented (but USB4 is more like TB3 than USB3...).
TB1/2/3 are only implemented by Intel chips, and TB1/2 is used almost exclusively by Apple. TB4/USB4 can be implemented by other chip makers too, but this has not happened yet I believe.

USB-C

[source]

USB-C is a specification of a connector, intended to use for USB by default, but also for other protocols (usually in addition to USB) and not tied to any particular USB version.
USB-C has a huge number of pins (24 pins), consisting of:
- 4 VBUS and 4 GND pins to allow delivering more current,
- 4 high-speed lines (8 pins), typically used as two full-duplex channels,
- USB2 D+/D- pins (duplicated, to allow reverse insertion without additional circuitry)
- Two CC/VCONN pins for low-speed connection setup, USB-PD negotiation, altmode setup, querying the cable, etc.
- Two SBU sidechannel pins as an additional lower speed line.
USB-C cables come in different flavors: [source]
- The simplest cables have just power and USB2 pins, and do not support anything other than USB1/2. Cables that do connect all pins are apparently called "full-featured", though these might still not support the higher (10/20Gbps-per-line) speeds.
- Cables can be passive (direct copper connections), or active (with electronics in the cable to repeat/amplify a signal to allow higher speeds and longer cables).
- Active cables typically seem to be geared towards (and/or limited to) a particular signalling scheme and/or direction. This means that they are usually not usable for protocols they are not explicitly designed for. This partially comes down to the bitrates used, but also the exact low-level signalling protocol (e.g. USB3 Gen2, USB4 Gen2 and DP UHBR 10 all use 10Gbps-per-line, but use different encodings, error correction and/or direction, meaning active cables might support one, but not the other).
- Passive cables can generally be used more flexibly, mainly limited to a certain bitrate based on cable quality and length.
- Cables can also contain a "e-marker" chip that can be queried using USB-PD messages on the CC pin, containing info on the cable's spec. Such a chip is required for 5A operation (all other cables are expected to allow up to 3A) and for all USB4 (even the lower-speed 10Gbps-per-line USB4) operation.
- Cables with "e-marker" chip but no signal conditioning are still considered passive cables.
- USB4/TB4 requires an e-marker chip in the cable (falling back to USB3 or TB3 without it). Passive cables (including passive TB3 cables) can always be used for USB4 (with the exact speed depending on the cable rating), while active cables can only be used when they are rated for USB4 (so active USB3 cables cannot be used for USB4, but active USB4 cables can be used for USB3). [source]
USB-C uses the CC pins for detecting a connection and deciding how to set it up. A host (or dowstream facing port or sourcing port) has pullups on both CC pins (value depens on available current), a device (or upstream facing port or sinking port) has pulldowns on both pins (value is fixed). Together this forms a voltage divider that allows detecting a connection and maximum current. A cable connects only one of the CC pins (leaving the other floating, or pulling it down with a different resistor value) which allows detecting the orientation of the cable. [source]
After initial connection, the CC pin is also used for other management communication, such as USB-PD (including setting up alt modes and USB4), or vendor-specific extensions.
The other CC pin (pulled down inside the cable) is repurposed as VCONN and used to power any active circuitry (including the e-marker) inside the cable.
USB-C can provide up to 15W (5V×3A) on the VBUS pins with passive negotiation using resistors on the CC pins. USB-PD uses active negotiation using data messages on the CC pin and allows up to 100W (20V×5A) of power on the VBUS pins and allows reversing the power direction as well (i.e. allow a laptop USB-C port to receive power instead of supplying it).

USB-C alt modes

USB-C alt modes allow using the pins (in particular the 4 high-speed lines and the lower speed SBU line) for other purposes.
Alt modes are negotiated actively through the CC pin using USB-PD VDM messages. Once negotiated, this allows using these pins different physical signalling, including different directions. This means that e.g. a USB-C-to-DP adapter using DP alt mode can be mostly a passive device, though it does need some logic for the USB-PD negotiation and might need some switches to disconnect the wires until the alt mode is enabled.
Alt modes are single-hop only (hubs might connect upstream and downstream alt mode channels, but there is no protocol to configure this AFAICT).
Alt modes are technically separate from the USB protocol (version) supported (i.e. they can be configured without having to do USB communication), but the bandwidth capabilities does introduce some correlation (i.e. DP alt mode 2.0 supports DP 2.0 and might need up to 20Gbps-per-line, so effectively needs a USB4-capable controller, cable and hub/dock).
DP alt mode (and probably others) can use 1, 2, or all 4 of the high-speed lines (plus the SBU line for the aux channel). When using only 1 or 2 lines, the 2 other lines can still be used for USB3 traffic (just using 1 line for USB is not possible, USB3/4 always needs a full-duplex channel, so 2 or 4 lines). In all cases, the USB2 line can still be used for USB1/2.
Because in DP alt mode all four lines can be used unidirectionally (unlike USB, which is always full-duplex), this means the effective bandwidth for driving a display can be twice as much in alt mode than when tunneling DP over USB4 or using DisplayLink over USB3.

In practice though, DP-altmode on devices usually supports only DP1.4 (HBR3 = 8.1Gbps-per-line) for 4x8.1 = 32.4Gbps of total and unidirectional bandwidth, which is less than TB3/4 with its 20Gbps-per-line for 2x20 = 40Gbps of full-duplex bandwidth. This will change once devices start support DP-altmode 2.0 (UHBR20 = 20Gbps-per-line) for 4x20=80Gbps of unidirectional bandwidth.
It seems that DP alt mode can only be combined with USB3, not TB3 or USB4 (USB4 and DP alt mode both need the SBU pins). Since DP can be tunneled over TB/USB4, there is not much need for DP alt mode when using TB/USB4, except that it could potentially have allowed more bandwidth (2x20Gbps unidirectional DP2.0 plus 1x20Gbps full-duplex TB/USB4).
HDMI alt mode exists, but is apparently not actually implemented by anyone. USB-C-to-HDMI adapters typically seem to use DP alt mode combined with an active DP-to-HDMI converter. I'm assuming this also holds for the Framework HDMI expansion cards. [source]
Thunderbolt 3 is also implemented as an alt mode.
Audio Adapter Accessory Mode allows sending analog audio data over an USB-C connector (using the USB2 and sideband pins). This looks like another alt mode, but instead of active negotiation over the CC pin, this mode is detected by just grounding both CC pins so adapters for this mode can be completely passive. [source]
Ethernet or PCIe are sometimes mentioned as alt modes, but apparently do not actually exist. [source]

USB4 / Thunderbolt 4

[source] and [source] and [source]

USB4 is again really different from USB3 and looks more like Thunderbolt 3 than USB3.
USB4 does not specify any device types by itself (like USB1/2/3), but is a generic bulk data transport that allows tunneling of other protocols, like PCIe, DisplayPort and USB3.
USB4 uses the 4 USB-C high-speed lines at up to 20Gbps-per-line for 40Gbps full-duplex and 80Gbps total bandwidth.
USB4 link setup happens using USB-PD messages over the CC pins, falling back to USB3 setup if no USB4 link is detected within a second or so (so also in that sense USB4 is also more like Thunderbolt 3 alt mode than USB3). [source]
USB4 also uses two additional pins (sideband channel) of the USB-C connector for further link initialization and management, so it needs USB-C and cannot be used over USB3-capable USB-A/USB-B/USB-micro connectors and cables. [source]
Like with TB3, this tunneling happens on a data level: The physical layer signalling is defined by USB4 and different tunneled protocols are mixed in the same bitstream (unlike USB-C alt mode, where individual lines are completely allocated to some alt mode, which also determines the physical signalling used). This also means that DP alt mode and USB4-tunneled-displayport are two different ways to transfer DP over a USB4-capable connection.
Unlike with TB3, USB2 is not tunneled but still ran in parallel over the USB2 pins.
Unlike with TB3, USB3 is tunneled directly over USB4 and uses the host USB3 controller with only USB3 hubs in docks (though I guess docks could still choose to include PCIe connected USB controllers). In some cases, this might result in lower total USB bandwidth (10 or 20Gbps shared by all devices) compared to the TB3 approach of (sometimes multiple) USB3 controllers in the dock that are accessed through tunneled PCIe.
USB4 end devices have most features optional. USB4 hosts require 10Gbps-per-line, tunneled 10Gbps USB3, tunneled DP, and DP altmode (unsure if this requires DP altmode 2.0), leaving 20Gbps-per-line, tunneled 20Gbps USB3, tunneled PCIe, TB3 and other altmodes optional. USB4 hubs require everything, except 20Gbps USB3 and other altmodes. [source]
Thunderbolt 4 is (unlike thunderbolt 3) not really a protocol by itself, but is essentially just USB4 with all or most optional features enabled, and with additional certification of compatibility and performance. In particular, TB4 ports must support 20Gbps-per-line, computer-to-computer networking, PCIe, charging through at least one port, wake-up from sleep, 2x4K@60 or 1x8K@60 display, up to 2m passive cables. I have not been able to find the actual specification to see what these things really mean technically (i.e. are these display resolutions required over USB4, or altmode (2.0), must they be uncompressed, from which states does this wakeup wake up, etc.). [source] and [source]
TB1/2/3 were propietary protocols, implemented only on Intel chips. USB4 (and also TB4) are (somewhat) more open and can be implemented by multiple vendors (though I could find the USB4 spec easily, but not the TB4 spec, or the TB3 spec which is required for compatibility, or the DP alt mode spec, which is required for USB4 ports).
USB4 hubs must also support the TB3 for compatibility, controllers and devices may support TB3. [source]
Different USB3/4 transfer modes have been renamed repeatedly (i.e. USB3.0 became USB3.1 gen1 and then USB3.2 gen1x1). Roughly, gen1 (aka SuperSpeed aka SuperSpeed USB) means 5Gbps-per-line, gen2 (aka SuperSpeed+ aka SuperSpeed USB 10Gbps, or SuperSpeed USB 20Gbps aka USB4 20Gbps for gen2x2) means 10Gbps-per-line, gen3 (aka USB4 40Gbps) means 20Gbps per line. No suffix or x1 means using only one pair of lines, x2 means two pairs of lines (so e.g. gen2x2 uses all four lines, at 10Gbps-per-line, for 20Gbps full-duplex and 40Gbps of total bandwidth). Also, you can mostly ignore the USB versions (i.e. USB3.1 gen1 and USB3.2 gen1x1 are the same transfer mode), except for USB4 gen2, which is also 10Gbps-per-line, but uses different encoding. [source]
A USB4 hub now effectively contains 3 separate USB hubs: A USB1/2 hub, a USB3 hub, and a USB4 hub (aka USB4 router). As with USB3 hubs, the USB1/2 hub is completely distinct, always connected to the USB-C connectors and its communication runs in parallel with the USB3/4 communication. The USB3 (superspeed) hub can either connect directly to the USB-C connectors directly (when an USB3 device or host is connected), or it can connect through the USB4 router to receive / send USB4-tunneled-USB3 traffic. USB3 traffic seems to be always tunneled over just one hop in the connection, so a host -> USB4 hub -> USB4 hub -> USB3 device connection is tunneled in USB4 from the host to the first hub, is then unpacked into the USB3 hub, then tunneled again for the trip to the second hub, unpacked again into the USB3 hub inside the second hub, and then to the USB3 device directly over the USB-C (or USB-A) connector without further tunneling. [source]
PCIe tunneling works similar to USB3 tunneling, where the tunneling is done over a single hop, unpacked into a PCIe switch, and then tunneled again for further hops is needed. [source]
For DP tunneling, this works differently, there a single tunnel can stretch over multiple hops, all the way from the DP source in the USB4 host towards the DP output in the last USB4 hub (or USB4-capable display). [source]

Thunderbolt / USB4 hardware

Thunderbolt 1/2/3 is exclusively implemented by Intel controllers. USB4/TB4 (including TB3 compatibility) is open for others to implement too (but currently Intel controllers are still used mostly).
Intel Thunderbolt and USB4 controller families are all named "Something Ridge". Initially there were quite a few families implementing TB1, then Falcon Ridge implemented TB2, Alpine and Titan ridge for TB3, Maple and Goshen ridge for TB4 [source].
Most of these controllers can be used in hosts, devices and hubs, but some of them are exclusively for devices/hubs (such as Goshen Ridge) or hosts.
Originally, thunderbolt was implemented using discrete thunderbolt controller chips, which were connected to the CPU using high-speed PCIe and DP traces on the mainboard, making thunderbolt support a complex and expensive thing. With Intel Ice Lake and Tiger Lake, the thunderbolt controller is integrated in the processor, making the implementation cheaper and more power efficient. [source] and [source]

Understanding tunneling and bandwidth limitations

[source]

Looking at how Thunderbolt/USB4 hardware is constructed, you typically have a controller chip (router in USB4 terminology) that handles the TB/USB4 traffic (and usually also TB3/USB3 backward compatibility). Both the host and the dock (or real TB device) have such a controller (often the same controller can be used in hosts, docks/hubs and devices).
These controllers have one or more TB outputs, connected to TB (USB-C) connectors, forming the TB link between then.
These controllers also have other interfaces (adapters in USB4 terminology), such as DP inputs or outputs, PCIe upstream/downstream interfaces, or USB upstream/downstream interfaces.
- In the host these interfaces are usually connected to the CPU/GPU. USB links are usually internally connected to a USB controller integrated into the controller.
- In a dock/hub DP interface(s) are usually connected to output connectors (DP, HDMI, TB or USB-C). These are often multiplexed, so there might be 2 DP interfaces available that can be routed to any two of the available outputs, or they can be connected through an internal MST hub.
- In a dock/hub PCIe interface(s) are usually internally connected to devices (e.g. PCIe ethernet) or exposed (e.g. PCIe NVMe slot). If multiple PCIe lanes are available, they can be connected to the same device, or split among multiple.
- In a dock/hub the USB interface(s) are usually internally connected to USB2/3 hubs (often integrated into the TB/USB4 controller).
- Thunderbolt ports can usually be internally connected to the TB interface, a DP interface (for DP alt mode) or an USB3 controller (for non-thunderbolt USB compatibility).
Often these interfaces provide a transparent connection, where e.g. the GPU negotiates a certain DP bitrate with an attached display through the tunnel, just as if the display was directly connected.
On Linux, to get some basic info on present TB controllers and devices and the used link speed, you can use boltctl (e.g. boltctl list -a).
On Linux, to see what interfaces are present on e.g. a TB dock, run echo "module thunderbolt +p" > /sys/kernel/debug/dynamic_debug/control, plug in your dock and check dmesg (which will call the USB4 controller/router in the the dock a "USB4 switch" and its interfaces "Ports").
When looking at what limits / determines the bandwidth available when a tunneled (i.e. TB3 or TB4) link is involved, there are mainly two kinds of limit:
1. The total bandwidth of (each hop of) the link itself. This bandwidth is shared between all tunneled connections and depends on the protocol and cable used (e.g 20/40Gbps for TB3/TB4/USB4 depending on cable and controller capabilities, TB3/TB4 controllers always support 40Gbps).
2. The bandwidth (and number) of individual tunnels. This is mostly a limit are limited by the hardware interfaces present on the controllers/routers used (e.g. dual 4xHBR2 interfaces on Alpine Ridge, dual 4xHBR3 interfaces on Titan Ridge). These limitations are mostly linked to the hardware used, not so much limited by the protocol (though the protocols do seem to typically specify minimum bandwidth/features for each port).
The second limit is the most complex to figure out, because it involves limitations on both ends of the connection (and for USB3/PCIe also any intermediate steps).
The link bandwidth is shared between all tunnels on a single port, but the tunnel bandwidth is often shared between all ports on the same controller.
When multiple tunnels run over the same link, only the actual bandwidth usage counts (except DP does limit links based on max link bandwidth, see below). For example, an USB3 10Gbps tunnel with only a thumb drive connected will only use very little bandwidth, leaving the rest available for other tunnels. Only when the combined bandwidth usage of all tunnels exceeds the link bandwidth will bandwidth of each tunnel be limited.
The number of tunnels over a single link seems to be practically unlimited by the protocols, but only limited by the number of interfaces on the used controllers (so we could in theory see future controllers that support three or four DP interfaces, though that probably does not make too much sense given MST exists and total bandwidth is still limited). Maybe TB1/2/3 do have protocol limitations, but since it is unlikely that new controllers will be made for these, it won't matter anyway.
For DP:
- TB1 controllers support one 4xHBR interface (DP1.0), TB2 controllers support one 4xHBR2 interface (DP1.2), Alpine Ridge (TB3) supports two 4xHBR2 interfaces (2xDP1.2), Titan Ridge (TB3), Goshen/Maple Ridge (TB4) and Tiger Lake (TB4) support two 4xHBR3 interfaces (2xDP1.4).
- Each DP interface consists of four lanes, which cannot be split.
- Each of the interfaces (four lanes each) on the host controller can be routed to a single dock or display (where they can still be split into multiple DP or DP-alt-mode outputs using MST, but they cannot be split into multiple TB-tunneled-DP-links).
- Goshen Ridge seems to have no dedicated DP output pins, but it can still internally multiplex its two DP interfaces to all three of its downstream TB ports for DP altmode. Docks that have a dedicated DP connector usually sacrifice one of these TB ports (leaving only two downstream TB ports) which is then tied to the DP connector (and probably permanently configured in DP "altmode" in firmware).
- DP links can be fully transparent (a single DP link between GPU and display), or the tunnel can be a "DP LTTPR" (like an active cable) resulting in two DP links (GPU to host TB controller, dock TB controller to display). [source, section 2.2.10.2]
- Note that the full bandwidth on the newest controllers (2x4xHBR3 == 51.84Gbps after encoding) cannot be sent over a single TB3/TB4 port (40Gbps after encoding). But when using two TB ports on the same controller, both can transmit a full 4xHBR3 signal, allowing to use the full DP bandwidth of the controller.
- Bandwidth allocation for DP links happens based on the maximum bandwidth for the negotiated link rate (e.g. HBR3) and seems to happen on a first-come first-served basis. For example, if the first display negotiates 4xHBR3, this takes up 25.92Gbs (after encoding) of bandwidth, leaving only 2xHBR3 or 4xHBR1 for a second display connected.
  
  This means that the order of connecting displays can be relevant to the supported resolutions on each (on multi-output hubs with displays already connected, it seems the hub typically initializes displays in a fixed order).
  
  If the actual bandwidth for the resolution used is less than the allocated bandwidth, the extra bandwidth is not lost, but can still be used for other traffic (like bulk USB traffic, which does not need allocated / reserved bandwidth). [source]
- In some cases, it seems bandwidth can still be over-allocated if the OS knows that the full bandwidth will not be used (e.g. allocate 2x4xHDR3, knowing that with the resolutions in use, the total bandwidth will fit in the available 40Gbps). [source]
- MST hubs often seem to allocate maximum supported bandwidth whenever the first display is connected (to prevent the need for renegotiation when another display is added), which might significantly over-allocate. [source]
For USB3:
- When using USB4/TB4, the USB3 controller is in the host, with hubs in downstream devices. This means the total USB3 bandwidth is 5, 10 or 20Gbps, limited by the hosts's USB3 controller (often integrated into the TB/USB4 controller), as well as any hubs along the way. This bandwidth is probably also shared between all (both) ports connected to the same TB/USB4 controller.
- USB3 tunnels are single-hop, so on a USB4/TB4 dock, the TB4/USB4 controller has an upstream USB interface (tunneled to the upstream host or hub) and one downstream interface for each downstream TB4 port (tunneled to the downstream devices or hubs) and an (integrated) USB3 hub that connects all these USB3 interfaces.
- When using TB3, the USB3 controller (and usually also one or more hubs) is in the (each) dock, connected to the host through tunneled PCIe. This means that the USB3 bandwidth is limited by the controllers in the docks (each controller having its own chunk of 5/10/20Gbps). When there are multiple USB3 controllers involved (inside the same dock, in multiple daisy-chained docks, or connected to multiple ports using the same host TB controller), the total bandwidth can be higher but is still limited by the available PCIe bandwidth.

Bandwidths & Encoding

[source] and [source] and [source]

As mentioned, different protocols use different bitrates and different encodings. Here's an overview of these, with the effective transfer rate (additional protocol overhead like packet headers, error correction, etc. still needs to be accounted for, so i.e. effective transfer rate for a mass-storage device will be lower still). Again, these are single-line bandwidths, so total bandwidth is x4 (unidirectional) for DP and x2 (full-duplex) for TB/USB3.2/USB4.

PCIe 3.0 8 Gbps using 128b/130b = 7.877 Gbps
PCIe 6.0 64 Gbps using 242b/256b = 60.5 Gbps
DP 1.0/1.1 RBR 1.62Gbps using 8b/10b = 1.296Gbps
DP 1.0/1.1 HBR 2.70Gbps using 8b/10b = 2.16Gbps
DP 1.2 HBR2 5.40Gbps using 8b/10b = 4.32Gbps
DP 1.3/1.4 HBR3 8.10Gbps using 8b/10b = 6.48Gbps
DP 2.0 UHBR10 10Gbps using 128b/132b = 9.697Gbps
DP 2.0 UHBR13.5 13.5Gbps using 128/132b = 13.091Gbps
DP 2.0 UHBR20 20Gbps using 128/132b = 19.394Gbps
USB 3.x 5 Gbps (gen 1) using 8b/10b = 4 Gbps
USB 3.x 10 Gbps (gen 2) using 128b/132b = 9.697 Gbps
USB4 10 Gbps (gen 2) uses 64b/66b = 9.697 Gbps
USB4 20 Gbps (gen 3) uses 128b/132b = 19.39 Gbps
Thunderbolt 3 10.3125Gbps uses 64b/66b = 10Gbps
Thunderbolt 3 20.625Gbps uses 64b/66b = 20Gbps

Note that for TB3, the rounded 10/20Gbps speeds are after encoding, so the raw bitrate is actually slightly higher. Thunderbolt 4 is not listed as (AFAICT) this just uses USB4 transfer modes.

Displaylink

[source]

Displaylink is a technology (made by the company with the same name) that allows sending display data over USB2/3.
Some DisplayLink chips also support ethernet and analog audio, presumably by just presenting these as parts of a composite device.
Displaylink is used in docks as well as inside monitors directly, to allow sending video without needing any specific hardware support other than USB2/3. A driver for these devices is needed in the OS though.
It seems at least some DisplayLink devices are supported in Linux, though mostly using a closed-source binary blob. There is also basic support in open-source drivers, but these are apparently reverse-engineered, so I suspect DisplayLink is not forthcoming with documentation or code themselves. [source]
DisplayLink performance is limited, mainly because even though the laptop GPU can apparently be used for some acceleration, actually transferring the resulting display data is largely a software/driver affair, involving compression and wrapping data in USB packets. Performance can also be affected by other activity on the bus. DisplayLink seems to be mostly intended for office work, less for video and gaming (unlike other solutions, like DP alt mode or tunneling DP over Thunderbolt or USB4, where the GPU takes care of generating the DisplayPort data stream, which is delivered directly to the USB/Thunderbolt controller without involving software, and which allocates dedicated bus bandwidth for that stream). [source]

Docks / hubs

Docks come roughly in two flavors:
- "USB-C docks", without thunderbolt, that connect to the host using USB2, USB3 and/or DP-altmode. These are typically cheaper, but also run at lower bandwidth.
- "Thunderbolt 3/4" docks, which tunnel everything (including USB3) over Thunderbolt/USB4 (for TB3 even USB2.0 is tunneled, for TB4 it runs over its own wires). These typically have higher upstream bandwidth and more features.
- USB4 (non-Thunderbolt) docks could technically exist too, but since TB4 is really just USB4 with most optional features made mandatory (and some additional certification), and most of these optional features are already mandatory for USB4 hubs (including even TB3 compatibility), it seems likely that such hubs will just be made TB4-compatible anyway.
Thunderbolt docks often also support a fallback mode where they work like a USB-C dock using USB2/3 and/or DP-alt-mode to connect to the host, at reduced bandwidth and features (i.e. PCIe connected devices are not supported).
USB-A ports on docks are fairly uncomplicated and will just support USB devices (using either USB2 or USB3), though note the caveat about single-TT vs multi-TT above.
Downstream USB-C ports are more tricky: I expect these will always support USB2 and/or 3, but it seems that often only a selection also supports TB3, USB4/TB4, or DP alt mode.
I suspect that if a dock is TB4-certified, it must probably support all TB4 features on all of its downstream USB4 ports too (which are probably going to be labeled "TB4" ports anyway). USB4 downstream ports are at least required to support DP alt mode.
Video on docks is complex, see below.
Other things like ethernet, SD-card readers and audio are typically just implemented using USB devices inside the dock, connected to an integrated USB2/3 hub.
Some docks have their ethernet (especially 2.5G/10G) connected through PCIe instead of USB. These often allow better performance (because these chips often offload more processing, and this moves bandwidth usage from the USB3 tunnel to the PCIe tunnel, which can be helpful depending on what else is connected), but does require specific drivers for the ethernet chip used (often shipped with OS's, though). See the source for a list of docks and the ethernet chips they use. [source]

Displays on docks / hubs

Here the situation becomes more complicated. There seem to be at least 6 ways that display data can be transferred over USB-C to a dock: DP alt mode 1.0, DP alt mode 2.0, tunneled through USB4, tunneled through TB3, using a PCIe-connected eGPU (where PCIe can run over TB3 or USB4), or using DisplayLink or a similar USB2/3 connected device.
AFAICS, DisplayLink has significant downsides (so I would avoid it), a PCIe-connected eGPU is quite expensive (so only used when you really need that extra GPU power), and the other options (altmode and tunneling DisplayPort) are largely equivalent (except for available bandwidth and thus resolution).
It seems DP-over-USB4 and DP-over-Thunderbolt (also TB3) can always be sent through multiple hops (i.e. when using multiple hubs or docks, or a single hub or dock with a thunderbolt or USB4 capable display connected using USB-C to the dock), provided the hubs or docks have downstream ports supporting Thunderbolt or USB4.
For DP altmode, multiple hops can be supported in theory, but this depends on how the dock is constructed. I'm not sure if docks actually implement this (and if they do, if they properly document how it works, on which downstream ports this works, etc.).
HDMI (or DP++) outputs on docks typically just take a DP signal and add an active DP-to-HDMI converter. These converters have been seen to cause problems with some displays, so it might be better to avoid HDMI and use DP exclusively. [source]
How video is routed inside docks can apparently also be complex. For example the HP Thunderbolt dock G2 has a TB3 input, which contains two tunneled DP signals, one of which is split using MST into three outputs, two available on DP++ connectors and one available either as DP alt mode on a downstream USB-C connector, or converted into a VGA connector. The second DP signal is forwarded entirely to the downstream TB3 port. [source]
Since MacOS does not support MST, any docks that document support for > 2 display outputs on MacOS likely use DisplayLink, while docks that document support for > 2 display outputs on Windows/PC, but only 2 on MacOS likely use MST.
I also recall reading something (on this forum?) about a dock that uses DisplayLink for the first two monitor outputs, and DP alt mode only when three outputs are used, but I cannot seem to find the posting again (also, it seems like the decision of how to send display data to the dock is made by the OS, the dock can only decide or limit how it outputs the signals sent, I guess?).
Displays with a USB-C connector are usually internally similar to a USB-C or Thunderbolt dock supporting video input through DP-altmode and/or tunneling over Thunderbolt combined with USB over the remaining wires or also tunneled. Displays with integrated USB ports can sometimes be configured in 2-line USB3.0 + 2-line DP-alt-mode, or USB2.0-only mode, leaving all 4 lines available for a full-bandwidth 4-lane DP-alt-mode connection.

Framework laptop

The framework laptop supports USB4, and is intended to be TB4 certified, so I'm assuming that it already has the right hardware and software to support the entire TB4 featureset (though there might still be problems in either hardware or software, of course).
The Framework laptop GPU supports driving 4 simultaneous display outputs, including the builtin display (which can be disabled by closing the lid to allow driving 4 external monitors). [source]
All four USB-C ports can be used to output video, and multiple displays can be driven through one USB-C port with a DisplayPort MST hub in the dock or by separate DP-over-TB/USB4 tunnels. [source]
The USB-C ports are divided into two pairs, each pair sharing a single USB4 router / TB controller, and with two DP 4xHBR3 interfaces on each controller. This means at most two DP streams divided between the two ports in the pair (which includes both TB/USB4 tunnels and DP-alt-mode). [source, section 10, Framework uses UP3]
Using MST, multiple displays can be driven from a single DP stream, so this probably allows driving more than two displays from a single port (or pair of ports), but probably still limited to four displays in total (it is not entirely clear where in the Tiger Lake chip the MST-encoding happens).
The DisplayPort expansion card (and also the Tiger Lake TB controller) supports DisplayPort 1.4 (so not 2.0), I assume it is mostly a passive unit with DP altmode 1.0 then. [source]
Looking at the maximum uncompressed resolution supported by the framework DP expansion card (5120x3200 60Hz), according to the wikipedia resolution list that needs 22.18Gbps total bandwidth, so HBR3 (4x8.1Gbps-per-line), so it seems the expansion card supports all DP 1.4 bitrates (HBR3 is the highest available).
The expansion card also supports "8192x4320 60Hz uncompressed dual-port", which needs 49.65Gbps total bandwidth, which is indeed too much for HBR3, so I guess this means you'd need two DP expansion cards and two DP cables to a single monitor. [source] and [source]

That's it for now. Kudos if you made it this far! As always, feel free to leave a comment if this has been helpful to you, or if you have any suggestions or questions!

Update 2022-03-29: Added "Understanding tunneling and bandwidth limitations" section and some other small updates.

Update 2022-06-10: Mention boltctl as a tool for getting device info.

Tagged as: Displayport, Dock, Framework, Protocol, Thunderbolt, USB

Related stories

Comments

Rob Fisher wrote at 2022-03-26 10:46

This has dramatically improved my understanding of USB, which is even more complicated than I thought. Excellent research!

Matthijs Kooijman wrote at 2022-03-26 23:40

> This has dramatically improved my understanding of USB, which is even more complicated than I thought. Excellent research!

Good to hear! And then I did not even dive into the details of how USB itself works, with all its descriptors, packet types, transfer modes, and all other complex stuff that is USB (at least 1/2, I haven't looked at USB3 in that much detail yet...).

Dirk Ballay wrote at 2022-04-03 19:59

Hi there are 4 corrupted links with a slipped ")". Just search for ")]".

Matthijs Kooijman wrote at 2022-04-03 20:44

Ah, I copied the markdown from my original post on Discourse, and it seems my own markdown renderer is slightly different when there are closing parenthesis inside a URL. Urlencoding them fixed it. Thanks!

Jens wrote at 2022-04-23 11:08

Thank you for sharing! Which dock (hub?) did you end up buying?

Matthijs Kooijman wrote at 2022-04-23 15:21

Hey Jens, good question!

I ended up getting the Caldigit TS4, which is probably one of the most expensive ones, but I wanted to get at least TB4 (since USB4 has public specifications, and supports wake-from-sleep, unlike TB3) and this one has a good selection of ports, the host port on the back side (which I think ruled out 70% of the available products or so - I used this great list to make the initial selection).

See also this post for my initial experiences with that dock.

Lukas wrote at 2022-09-04 16:27

> It is a bit unclear to me how these signals are multiplexed. Apparently TB2 combines all four lines into a single full-duplex channel, suggesting that on TB1 there are two separate channels, but does that mean that one channel carries PCIe and one carries DP on TB1? Or each device connected is assigned to (a part of) either channel?

No. Each Thunderbolt controller contains a switch. Protocol adapters for PCIe and DP attach to the switch and translate between that protocol and the low-level Thunderbolt transport protocol. The protocols are tunneled transparently over the switching fabric and packets for different protocols may be transmitted on the same lane in parallel.

Someone wrote at 2022-09-04 16:42

> TB1/2 are backwards compatible with DP, so you can plug in a non-TB DP display into a TB1/2 port. I suspect these ports start out as DP ports and there is some negotiation over the aux channel to switch into TB mode, but I could not find any documentation about this.

When Apple integrated the very first Copper-based Thunderbolt controller, Light Ridge, into their products, they put a mux and redrivers on the mainboard which switches the plug's signals between the Thunderbolt controller and the DP-port on the GPU. They also added a custom microcontroller which snoops on the signal lines, autodetects whether the signals are DP or Thunderbolt, and drives the mux accordingly.

The next-generation chip Cactus Ridge (still Thunderbolt 1) integrated this functionality into the Thunderbolt controller. So the signals go from the DP plug to the Thunderbolt controller, and if it detects DP signals, it routes them through to the GPU via its DP-in pins.

(Source: I've worked on Linux Thunderbolt bringup on Macs and studied their schematics extensively.)

MikeNap wrote at 2022-09-04 20:20

Amazing write up. I spent so long trying to patch together an understanding of TB docks for displays and this has helped TREMENDOUSLY. Thanks again

Omar wrote at 2022-09-04 21:25

I just "Waybaked*" your page. It's such a gem. :) Thank you for your writeup Matthijs!

*(archive.org)

cheater wrote at 2022-09-04 23:01

Interesting read. After reading it, though, I am still none the wiser on a situation I experienced recently. I have a Moto G6 phone. I bought a USB C dock for it, a "Kapok 11-in-1-USB-C Laptop Dockingstation Dual HDMI Hub" on Amazon. It would charge, but the video out didn't work, and the Ethernet didn't work either. It did work for my Steam Deck, though. How come? I understand some USB C hubs work for android phones, some don't, and I don't know how that works. How does one find a dock that will work with Android?

Matthijs Kooijman wrote at 2022-09-05 18:22

@Lukas, you replied:

> No. Each Thunderbolt controller contains a switch. Protocol adapters for PCIe and DP attach to the switch and translate between that protocol and the low-level Thunderbolt transport protocol. The protocols are tunneled transparently over the switching fabric and packets for different protocols may be transmitted on the same lane in parallel.

I understand this, but the questions in my post that you are responding to were prompted because the wikipedia page on TB2 ( en.wikipedia.org/wiki/Thunderbolt(interface)#Thunderbolt2 ) says:

> The data-rate of 20 Gbit/s is made possible by joining the two existing 10 Gbit/s-channels, which does not change the maximum bandwidth, but makes using it more flexible.

Which is confusing to me. I could imagine this refers to the TB controller having one adapter that supports 20Gbps DP or PCIe instead of 2 adapters that support 10Gbps each, but TB2 only supports DP1.2 (so only 4.32Gbps) and PCIe bandwidths seem to be multiples of 8Gbps/GTps (but I do see PCIe 2.0 that has 5GTps, so maybe TB1 supported two PCIe 2.0 x2 ports for 10Gbps each and TB2 (also?) supports once PCIe 2.0 x4 port for 20Gbps?

Matthijs Kooijman wrote at 2022-09-05 18:30

@Someone, thanks for clarifying how DP vs TB1/2 negotiation worked, I'll edit my post accordingly.

@MikeNap, I feel your pain, great to hear my post has been helpful :-D

@Omar, cool!

@cheater, my first thought about video is that your dock might be of the DisplayLink (or some other video-over-USB) kind and your Android phone might not have the needed support/drivers. A quick google suggests this might need an app from the Play Store.

Or maybe the reverse is true and the dock only supports DP alt mode (over USB C) to get the display signal and your Android does not support this (but given it is a dual HDMI dock without thunderbolt, this seems unlikely, since the only way to get multiple display outputs over DP alt mode is to use MST and I think not many docks support that - though if the dock specs state that dual display output is not supported on Mac, that might be a hint you have an MST dock after all, since MacOS does not support MST).

As for ethernet, that might also be a driver issue?

Laurent Lyaudet wrote at 2022-09-07 19:05

Hello,

Thanks for this instructive post :)

I spotted some typos if it may help:

"can be combied" -> "can be combined"

"USB3.2 allows use of a all" -> "USB3.2 allows use of all"

"but a also" -> " but also"

"up to 20Gps-per-line" -> "up to 20Gbps-per-line"

"an USB3 10Gpbs" -> "an USB3 10Gbps"

"leaving ony" -> "leaving only"

"reverse-engineerd" -> "reverse-engineered"

"a single hub or dock with and a thunderbolt or USB4" -> "a single hub or dock with a thunderbolt or USB4"

"certifified" -> "certified"

Best regards, Laurent Lyaudet

Matthijs Kooijman wrote at 2022-09-07 19:31

@Laurent, thanks for your comments, I've fixed all of the typos :-D Especially "certifified" is nice, that should have been the correct word, sounds much nicer ;-p

Laurent Lyaudet wrote at 2022-09-08 15:41

Hello :)

Google does understand (incomplete URL because of SPAM protection):

/search?q=certifified+meme&source=lnms&tbm=isch&sa=X

But there is not yet an exact match for "certifified meme" ;P

AL wrote at 2023-03-07 15:06

Thanks you very much for your article!

I am interested in buying the Caldigit TS4. But when I kindly emailed caldigit to find out ways to work around these following issues:

non support of vlan ID by the network card
No Multi-Stream Transport (MST) support (due to lack of mac support grrr)
Firmware proprietary orientation (black box firmware)

By taking care to inform them of the expectations regarding the docks of the users of the laptop framework. And giving them your blog link and others.

I got this response from caldigit: " Currently our products are not supported in a Linux environment and so we do not run any tests or able to offer any troubleshooting for Linux ."

Finally, Matthijs could you tell us, please, why you didn't choose the lenovo dock? It is MST compatible, and supports vlan ID at rj45 level and seems officially supported under linux?

Thanks in advance !

Matthijs Kooijman wrote at 2023-03-07 15:33

@Al, thanks for your response, and showing us Caldigits responses to your question.

As for why I chose the Caldigit instead of the Lenovo - I think it was mostly a slightly better selection of ports (and a card reader). As for Linux support - usually nobody officially supports linux, but I've read some succesful reports of people using the TS4 under Linux (and I have been using it without problems under linux as well - everything just works).

As for MST and VLANs - I do not think I really realized that these were not supported, but I won't really miss them anyway (though VLAN support might have come in handy - good to know it is not supported. Is this a hardware limitation or driver? If the latter, is it really not supported under Linux, then? It's just an intel network card IIRC?).

As for black-box firmware - that's something I do care about, but I do not expect there will be any TB dock available that has open firmware, or does Lenovo publish sources?

AL wrote at 2023-03-07 15:56

Thank you for your reply ! For the vlan, it seems to be tested, therefore! because the site indicated in 2019: www.caldigit.com/do-caldigit-docks-support-vlan-feature

Indeed, I don't know any free firmware for lenovo, I was trying to find out their feeling to libre software things... Because it's always annoying to be secure everywhere except at one place in the chain: 4G stack, intel wifi firmware , and dock..

Steve wrote at 2023-04-02 23:59

Hi Matthijs;

Thanks for the amazing writeup, very useful. I just got my framework laptop and I bought the TS4 largely because of your recommendation (no pressure! :D ) but also because all the work you put in, it seemed the 'most documented'.

I have a question regarding usage of USB ports. I've got my framework laptop plugged into it with the thunderbolt cable. I have a 10-port StarTech USB 3 hub and I plug most everything into that, then that hub is plugged into one of the ports on the TS4 (the lower unpowered one, specifically -- The 10 port hub has its own power).

The 10 port hub is about half full. Keyboard (USB3), mouse (USB 1? 2? no idea, its reasonably new but cheap), a scanner which is usually turned off (USB1), a blueray player that takes two ports (USB2), and a printer that is usually turned off (USB2).

Most of the time, everything seems to work okay, but now and then I get a weirdness. Like, I left my computer sitting for awhile (30 min ~ 1 hour) and came back to it. The ethernet had stopped working out of the blue -- my laptop didn't sleep/suspend or anything, I have basically all the power saving stuff off because in my experience thatj ust causes weird chaos in Linux.

I unplugged/replugged in the TS4, and ethernet came back, but now my mouse stopped working (everything else seemed okay). I unplugged the mouse from my 10 port hub and put it in the second TS4 unpowered power and now its back.

Is there a power saving thing on the TS4 that I can turn off? Or is this something else? Or have you not encountered this one in your usage? So far, nothing has happened like this while I'm at the computer actively using it.

If you have any ideas, I'd really appreciate hearing them (though I understand totally if all this is just weird and you have no idea ...!)

Thanks again!!

Matthijs Kooijman wrote at 2023-04-30 12:22

@Steven, thanks for your comment!

As for your issues with ethernet or other USB devices that stop working after some time - I have no such experience - so far the dock has been rather flawless for me. I have seen some issues with USB devices, but I think those were only triggered by suspend/resume and/or plugging/unplugging the TS4, and those occurred very rarely (and I think haven't happened in the last few months at all).

For reference, I'm running Ubuntu 22.10 on my laptop. I also have a number of USB hubs behind the TS4 (one in my keyboard, one in my monitor and two more 7-port self-powered multi-TT hubs), so that is similar to your setup. I have not modified any (power save or other) settings on the TS4 - I'm not even sure if there are settings to tweak other than the Linux powersaving settings (which I haven't specifically looked at or changed).

So I don't think I have anything for you, except wishing you good luck with your issues. If you ever figure them out, I'd be curious to hear what you've found :-)

21 comments -:- permalink -:- 19:30

Fri, 25 Mar 2022

/ Blog / Hardware / Framework

My next laptop: Framework

Framework laptop

For a while, I've been considering replacing Grubby, my trusty workhorse laptop, a Thinkpad X201 that I've been using for the past 11 years. These thinkpads are known to last, and indeed mine still worked nicely, but over the years lost Bluetooth functionality, speaker output, one of its USB ports (I literally lost part of the connector), some small bits of the casing (dropped something heavy on it), the fan sometimes made a grinding noise, and it was getting a little slow at times (but still fine for work). I had been postponing getting a replacement, though, since having to figure out what to get, comparing models, reading reviews is always a hassle (especially for me...).

Then, when I first saw the Framework laptop last year, I was immediately sold. It's a laptop that aims to be modular, in the sense that it can be easily repaired and upgraded. To be honest, this did not seem all that special to me at first, but apparently in the 11 years since I last bought a laptop, manufacturers have been more using glue rather than screws, and solder rather than sockets, which is a trend that Framework hopes to turn.

In addition to the modularity, I like the fact they make repairability and upgradability an explicit goal, in attempt to make the electronics ecosystem more sustainable (they remind me of Fairphone in that sense). On top of that, it seems that this is also a really well made laptop, with a lot of attention to details, explicit support for Linux, open-source where possible (e.g. code for the embedded controller is open, ), flexible expansion ports using replacable modules, encouraging third parties to build and market their own expansion cards and addons (with open-source reference designs available), a mainboard that can be used standalone too (makes for a nice SBC after a mainboard upgrade), decent keyboard, etc.

The only things that I'm less enthusiastic about are the reflective screen (I had that on my previous laptop and I remember liking the switch to a matte screen, but I guess I'll get used to that), having just four expansion ports (the only fixed port is an audio jack, everything else - USB, displays, card reader - has to go through expansion modules, so we'll see if I can get by with four ports) and the lack of an ethernet port (apparently there is an ethernet expansion module in the works, but I'll probably have to get a USB-to-ethernet module in the meanwhile).

Unfortunately, when I found the Framework laptop a few months ago, they were not actually being sold yet, though they expected to open up pre-orders in December. I really hoped Grubby would last long enough so I could get a Framework laptop. Then pre-orders opened only for US and Canada, with shipping to the EU announced for Q1 this year. Then they opened up orders for Germany, France and the UK, and I still had to wait...

So when they opened up pre-orders in the Netherlands last month, I immediately placed my order. They are using a batched shipping system and my batch is supposed to ship "in March" (part of the batch has already been shipped), so I'm hoping to get the new laptop somewhere it the coming weeks.

I suspect that Grubby took notice, because last friday, with a small sputter, he powered off unexpectedly and has refused to power back on. I've tried some CPR, but no luck so far, so I'm afraid it's the end for Grubby. I'm happy that I already got my Framework order in, since now I just borrowed another laptop as a temporary solution rather than having to panic and buy something else instead.

So, I'm eager for my Framework laptop to be delivered. Now, I just need to pick a new name, and figure out which Thunderbolt dock I want... (I had an old-skool docking station for my Thinkpad, which worked great, but with USB-C and Thunderbolt's single cable for power, display, usb and ethernet, there is now a lot more choice in docks, but more on that in my next post...).

Tagged as: Framework, Grubby, Hardware, Laptop, Repair, Sustainable, X201

Related stories

Comments

No comments yet.

0 comments -:- permalink -:- 19:03

Mon, 13 Dec 2021

/ Blog / Hardware / Electronics

Script to generate pinout listings for STM32 MCUs

STM32 Chip

Recently, I've been working with STM32 chips for a few different projects and customers. These chips are quite flexible in their pin assignments, usually most peripherals (i.e. an SPI or UART block) can be mapped onto two or often even more pins. This gives great flexibility (both during board design for single-purpose boards and later for a more general purpose board), but also makes it harder to decide and document the pinout of a design.

ST offers STM32CubeMX, a software tool that helps designing around an STM32 MCU, including deciding on pinouts, and generating relevant code for the system as well. It is probably a powerful tool, but it is a bit heavy to install and AFAICS does not really support general purpose boards (where you would choose between different supported pinouts at runtime or compiletime) well.

So in the past, I've used a trusted tool to support this process: A spreadsheet that lists all pins and all their supported functions, where you can easily annotate each pin with all the data you want and use colors and formatting to mark functions as needed to create some structure in the complexity.

However, generating such a pinout spreadsheet wasn't particularly easy. The tables from the datasheet cannot be easily copy-pasted (and the datasheet has the alternate and additional functions in two separate tables), and the STM32CubeMX software can only seem to export a pinout table with alternate functions, not additional functions. So we previously ended up using the CubeMX-generated table and then adding the additional functions manually, which is annoying and error-prone.

So I dug around in the CubeMX data files a bit, and found that it has an XML file for each STM32 chip that lists all pins with all their functions (both alternate and additional). So I wrote a quick Python script that parses such an XML file and generates a CSV script. The script just needs Python3 and has no additional dependencies.

To run this script, you will need the XML file for the MCU you are interested in from inside the CubeMX installation. Currently, these only seem to be distributed by ST as part of CubeMX. I did find one third-party github repo with the same data, but that wasn't updated in nearly two years). However, once you generate the pin listing and publish it (e.g. in a spreadsheet), others can of course work with it without needing CubeMX or this script anymore.

For example, you can run this script as follows:

$ ./stm32pinout.py /usr/local/cubemx/db/mcu/STM32F103CBUx.xml
name,pin,type
VBAT,1,Power
PC13-TAMPER-RTC,2,I/O,GPIO,EXTI,EVENTOUT,RTC_OUT,RTC_TAMPER
PC14-OSC32_IN,3,I/O,GPIO,EXTI,EVENTOUT,RCC_OSC32_IN
PC15-OSC32_OUT,4,I/O,GPIO,EXTI,ADC1_EXTI15,ADC2_EXTI15,EVENTOUT,RCC_OSC32_OUT
PD0-OSC_IN,5,I/O,GPIO,EXTI,RCC_OSC_IN
(... more output truncated ...)

The script is not perfect yet (it does not tell you which functions correspond to which AF numbers and the ordering of functions could be improved, see TODO comments in the code), but it gets the basic job done well.

You can find the script in my "scripts" repository on github.

Update: It seems the XML files are now also available separately on github: https://github.com/STMicroelectronics/STM32_open_pin_data, and some of the TODOs in my script might be solvable.

Comments

No comments yet.

0 comments -:- permalink -:- 16:27

Fri, 03 Sep 2021

/ Blog / Blog

Using MathJax math expressions in Markdown

For this blog, I wanted to include some nicely-formatted formulas. An easy way to do so, is to use MathJax, a javascript-based math processor where you can write formulas using (among others) the often-used Tex math syntax.

However, I use Markdown to write my blogposts and including formulas directly in the text can be problematic because Markdown might interpret part of my math expressions as Markdown and transform them before MathJax has had a chance to look at them. In this post, I present a customized MathJax configuration that solves this problem in a reasonable elegant way.

An obvious solution is to put the match expression in Markdown code blocks (or inline code using backticks), but by default MathJax does not process these. MathJax can be reconfigured to also typeset the contents of <code> and/or <pre> elements, but since actual code will likely contain parts that look like math expressions, this will likely cause your code to be messed up.

This problem was described in more detail by Yihui Xie in a blogpost, along with a solution that preprocesses the DOM to look for <code> tags that start and end with an math expression start and end marker, and if so strip away the <code> tag so that MathJax will process the expression later. Additionally, he translates any expression contained in single dollar signs (which is the traditional Tex way to specify inline math) to an expression wrapped in $ and $, which is the only way to specify inline math in MathJax (single dollars are disabled since they would be too likely to cause false positives).

Improved solution

I considered using his solution, but it explicitly excludes code blocks (which are rendered as a <pre> tag containing a <code> tag in Markdown), and I wanted to use code blocks for centered math expressions (since that looks better without the backticks in my Markdown source). Also, I did not really like that the script modifies the DOM and has a bunch of regexes that hardcode what a math formula looks like.

So I made an alternative implementation that configures MathJax to behave as intended. This is done by overriding the normal automatic typesetting in the pageReady function and instead explicitly typesetting all code tags that contain exactly one math expression. Unlike the solution by Yihui Xie, this:

Lets MathJax decide what is and is not a math expression. This means that it will also work for other MathJax input plugins, or with non-standard tex input configuration.
Only typesets string-based input types (e.g. TeX but not MathML), since I did not try to figure out how the node-based inputs work.
Does not typeset anything except for these selected <code> elements (e.g. no formulas in normal text), because the default typesetting is replaced.
Also typesets formulas in <code> elements inside <pre> elements (but this can be easily changed using the parent tag check from Yihui Xie's code).
Enables typesetting of single-dollar inline math expressions by changing MathJax config instead of modifying the delimeters in the DOM. This will not produce false positive matches in regular text, since typesetting is only done on selected code tags anyway.
Runs from the MathJax pageReady event, so the script does not have to be at the end of the HTML page.

You can find the MathJax configuration for this inline at the end of this post. To use it, just put the script tag in your HTML before the MathJax script tag (or see the MathJax docs for other ways).

Examples

To use it, just use the normal tex math syntax (using single or double $ signs) inside a code block (using backticks or an indented block) in any combination. Typically, you would use single $ delimeters together with backticks for inline math. You'll have to make sure that the code block contains exactly a single MathJax expression (and maybe some whitespace), but nothing else. E.g. this Markdown:

Formulas *can* be inline: `$z = x + y$`.

Renders as: Formulas can be inline: $z = x + y$ .

The double $$ delimeter produces a centered math expression. This works within backticks (like Yihui shows) but I think it looks better in the Markdown if you use an indented block (which Yihui's code does not support). So for example this Markdown (note the indent):

    $$a^2 + b^2 = c^2$$

Renders as:

$$a^2 + b^2 = c^2$$

Then you can also use more complex, multiline expressions. This indented block of Markdown:

    $$
    \begin{vmatrix}
      a & b\\
      c & d
    \end{vmatrix}
    =ad-bc
    $$

Renders as:

$$
\begin{vmatrix}
  a & b\\
  c & d
\end{vmatrix}
=ad-bc
$$

Note that to get Markdown to display the above example blocks, i.e. code blocks that start and with $$, without having MathJax process them, I used some literal HTML in my Markdown source. For example, in my blog's markdown source, the first block above literall looks like this:

<pre><code><span></span>    $$a^2 + b^2 = c^2$$</code></pre>

Markdown leaves the HTML tags alone, and the empty span ensures that the script below does not process the contents of the code block (since it only processes code blocks where the full contents of the block are valid MathJax code).

The code

So, here is the script that I am now using on this blog:

<script type="text/javascript">
MathJax = {
  options: {
    // Remove <code> tags from the blacklist. Even though we pass an
    // explicit list of elements to process, this blacklist is still
    // applied.
    skipHtmlTags: { '[-]': ['code'] },
  },
  tex: {
    // By default, only \( is enabled for inline math, to prevent false
    // positives. Since we already only process code blocks that contain
    // exactly one math expression and nothing else, it is also fine to
    // use the nicer $...$ construct for inline math.
    inlineMath: { '[+]': [['$', '$']] },
  },
  startup: {
    // This is called on page ready and replaces the default MathJax
    // "typeset entire document" code.
    pageReady: function() {
      var codes = document.getElementsByTagName('code');
      var to_typeset = [];
      for (var i = 0; i < codes.length; i++) {
        var code = codes[i];
        // Only allow code elements that just contain text, no subelements
        if (code.childElementCount === 0) {
          var text = code.textContent.trim();
          inputs = MathJax.startup.getInputJax();
          // For each of the configured input processors, see if the
          // text contains a single math expression that encompasses the
          // entire text. If so, typeset it.
          for (var j = 0; j < inputs.length; j++) {
            // Only use string input processors (e.g. tex, as opposed to
            // node processors e.g. mml that are more tricky to use).
            if (inputs[j].processStrings) {
              matches = inputs[j].findMath([text]);
              if (matches.length == 1 && matches[0].start.n == 0 && matches[0].end.n == text.length) {
                // Trim off any trailing newline, which otherwise stays around, adding empty visual space below
                code.textContent = text;
                to_typeset.push(code);
                code.classList.add("math");
                if (code.parentNode.tagName == "PRE")
                  code.parentNode.classList.add("math");
                break;
              }
            }
          }
        }
      }
      // Code blocks to replace are collected and then typeset in one go, asynchronously in the background
      MathJax.typesetPromise(to_typeset);
    },
  },
};
</script>

Update 2020-08-05: Script updated to run typesetting only once, and use typesetPromise to run it asynchronously, as suggested by Raymond Zhao in the comments below.

Update 2020-08-20: Added some Markdown examples (the same ones Yihui Xie used), as suggested by Troy.

Update 2021-09-03: Clarified how the script decides which code blocks to process and which to leave alone.

Tagged as: Blog, Markdown, Math, MathJax, Script

Related stories

Comments

Raymond Zhao wrote at 2020-07-29 22:37

Hey, this script works great! Just one thing: performance isn't the greatest. I noticed that upon every call to MathJax.typeset, MathJax renders the whole document. It's meant to be passed an array of all the elements, not called individually.

So what I did was I put all of the code elements into an array, and then called MathJax.typesetPromise (better than just typeset) on that array at the end. This runs much faster, especially with lots of LaTeX expressions on one page.

Matthijs Kooijman wrote at 2020-08-05 08:28

Hey Raymond, excellent suggestion. I've updated the script to make these changes, works perfect. Thanks!

Troy wrote at 2020-08-19 20:53

What a great article! Congratulations :)

Can you please add a typical math snippet from one of your .md files? (Maybe the same as the one Yihui Xie uses in his post.)

I would like to see how you handle inline/display math in your markdown.

Matthijs Kooijman wrote at 2020-08-20 16:47

Hey Troy, good point, examples would really clarify the post. I've added some (the ones from Yihui Xie indeed) that show how to use this from Markdown. Hope this helps!

Xiao wrote at 2021-09-03 04:09

Hi, this code looks pretty great! One thing I'm not sure about is how do you differentiate latex code block and normal code block so that they won't be rendered to the same style?

Matthijs Kooijman wrote at 2021-09-03 13:09

Hi Xiao, thanks for your comment. I'm not sure I understand your question completely, but what happens is that both the math/latex block and a regular code block are processed by markdown into a <pre><code>...</code></pre> block. Then the script shown above picks out all <code> blocks, and passes the content of each to MathJax for processing.

Normally MathJax finds any valid math expression (delimited by e.g. $$ or $) and processes it, but my script has some extra checks to only apply MathJax processing if the entire <code> block is a single MathJax block (in other words, if it starts and ends with $$ or $).

This means that regular code blocks will not be MathJax processed and stay regular code blocks. One exception is when a code block starts and ends with e.g. $$ but you still do not want it processed (like the Markdown-version of the examples I show above), but I applied a little hack with literal HTML tags and an empty <span> for that (see above, I've updated the post to show how I did this).

Or maybe your question is more about actually styling regular code blocks vs math blocks? For that, the script adds a math class to the <code> and <pre> tags, which I then use in my CSS to slightly modify the styling (just remove the grey background for math blocks, all other styling is handled by Mathjax already it seems).

Does that answer your question?

6 comments -:- permalink -:- 13:05

Fri, 06 Nov 2020

/ Blog / Software / Cpp

Forcing compiletime initialization of variables in C++ using constexpr

Every now and then I work on some complex C++ code (mostly stuff running on Arduino nowadays) so I can write up some code in a nice, consise and abstracted manner. This almost always involves classes, constructors and templates, which serve their purpose in the abstraction, but once you actually call them, the compiler should optimize all of them away as much as possible.

This usually works nicely, but there was one thing that kept bugging me. No matter how simple your constructors are, initializing using constructors always results in some code running at runtime.

In contrast, when you initialize normal integer variable, or a struct variable using aggregate initialization, the copmiler can completely do the initialization at compiletime. e.g. this code:

struct Foo {uint8_t a; bool b; uint16_t c};
Foo x = {0x12, false, 0x3456};

Would result in four bytes (0x12, 0x00, 0x34, 0x56, assuming no padding and big-endian) in the data section of the resulting object file. This data section is loaded into memory using a simple loop, which is about as efficient as things get.

Now, if I write the above code using a constructor:

struct Foo {
    uint8_t a; bool b; uint16_t c;};
    Foo(uint8_t a, bool b, uint16_t c) : a(a), b(b), c(c) {}
};
Foo x = Foo(0x12, false, 0x3456);

This will result in those four bytes being allocated in the bss section (which is zero-initialized), with the constructor code being executed at startup. The actual call to the constructor is inlined of course, but this still means there is code that loads every byte into a register, loads the address in a register, and stores the byte to memory (assuming an 8-bit architecture, other architectures will do more bytes at at time).

This doesn't matter much if it's just a few bytes, but for larger objects, or multiple small objects, having the loading code intermixed with the data like this easily requires 3 to 4 times as much code as having it loaded from the data section. I don't think CPU time will be much different (though first zeroing memory and then loading actual data is probably slower), but on embedded systems like Arduino, code size is often limited, so not having the compiler just resolve this at compiletime has always frustrated me.

Constant Initialization

Today I learned about a new feature in C++11: Constant initialization. This means that any global variables that are initialized to a constant expression, will be resolved at runtime and initialized before any (user) code (including constructors) starts to actually run.

A constant expression is essentially an expression that the compiler can guarantee can be evaluated at compiletime. They are required for e.g array sizes and non-type template parameters. Originally, constant expressions included just simple (arithmetic) expressions, but since C++11 you can also use functions and even constructors as part of a constant expression. For this, you mark a function using the constexpr keyword, which essentially means that if all parameters to the function are compiletime constants, the result of the function will also be (additionally, there are some limitations on what a constexpr function can do).

So essentially, this means that if you add constexpr to all constructors and functions involved in the initialization of a variable, the compiler will evaluate them all at compiletime.

(On a related note - I'm not sure why the compiler doesn't deduce constexpr automatically. If it can verify if it's allowed to use constexpr, why not add it? Might be too resource-intensive perhaps?)

Note that constant initialization does not mean the variable has to be declared const (e.g. immutable) - it's just that the initial value has to be a constant expression (which are really different concepts - it's perfectly possible for a const variable to have a non-constant expression as its value. This means that the value is set by normal constructor calls or whatnot at runtime, possibly with side-effects, without allowing any further changes to the value after that).

Enforcing constant initialization?

Anyway, so much for the introduction of this post, which turned out longer than I planned :-). I learned about this feature from this great post by Andrzej Krzemieński. He also writes that it is not really possible to enforce that a variable is constant-initialized:

It is difficult to assert that the initialization of globals really took place at compile-time. You can inspect the binary, but it only gives you the guarantee for this binary and is not a guarantee for the program, in case you target for multiple platforms, or use various compilation modes (like debug and retail). The compiler may not help you with that. There is no way (no syntax) to require a verification by the compiler that a given global is const-initialized.

If you accidentially forget constexpr on one function involved, or some other requirement is not fulfilled, the compiler will happily fall back to less efficient runtime initialization instead of notifying you so you can fix this.

This smelled like a challenge, so I set out to investigate if I could figure out some way to implement this anyway. I thought of using a non-type template argument (which are required to be constant expressions by C++), but those only allow a limited set of types to be passed. I tried using builtin_constant_p, a non-standard gcc construct, but that doesn't seem to recognize class-typed constant expressions.

Using `static_assert`

It seems that using the (also introduced in C++11) static_assert statement is a reasonable (though not perfect) option. The first argument to static_assert is a boolean that must be a constant expression. So, if we pass it an expression that is not a constant expression, it triggers an error. For testing, I'm using this code:

class Foo {
public:
  constexpr Foo(int x) { }
  Foo(long x) { }
};

Foo a = Foo(1);
Foo b = Foo(1L);

We define a Foo class, which has two constructors: one accepts an int and is constexpr and one accepts a long and is not constexpr. Above, this means that a will be const-initialized, while b is not.

To use static_assert, we cannot just pass a or b as the condition, since the condition must return a bool type. Using the comma operator helps here (the comma accepts two operands, evaluates both and then discards the first to return the second):

static_assert((a, true), "a not const-initialized"); // OK
static_assert((b, true), "b not const-initialized"); // OK :-(

However, this doesn't quite work, neither of these result in an error. I was actually surprised here - I would have expected them both to fail, since neither a nor b is a constant expression. In any case, this doesn't work. What we can do, is simply copy the initializer used for both into the static_assert:

static_assert((Foo(1), true), "a not const-initialized"); // OK
static_assert((Foo(1L), true), "b not const-initialized"); // Error

This works as expected: The int version is ok, the long version throws an error. It doesn't trigger the assertion, but recent gcc versions show the line with the error, so it's good enough:

test.cpp:14:1: error: non-constant condition for static assertion
 static_assert((Foo(1L), true), "b not const-initialized"); // Error
 ^
test.cpp:14:1: error: call to non-constexpr function ‘Foo::Foo(long int)’

This isn't very pretty though - the comma operator doesn't make it very clear what we're doing here. Better is to use a simple inline function, to effectively do the same:

template <typename T>
constexpr bool ensure_const_init(T t) { return true; }

static_assert(ensure_const_init(Foo(1)), "a not const-initialized"); // OK
static_assert(ensure_const_init(Foo(1L)), "b not const-initialized"); // Error

This achieves the same result, but looks nicer (though the ensure_const_init function does not actually enforce anything, it's the context in which it's used, but that's a matter of documentation).

Note that I'm not sure if this will actually catch all cases, I'm not entirely sure if the stuff involved with passing an expression to static_assert (optionally through the ensure_const_init function) is exactly the same stuff that's involved with initializing a variable with that expression (e.g. similar to the copy constructor issue below).

The function itself isn't perfect either - It doesn't handle (const) (rvalue) references so I believe it might not work in all cases, so that might need some fixing.

Also, having to duplicate the initializer in the assert statement is a big downside - If I now change the variable initializer, but forget to update the assert statement, all bets are off...

Using `constexpr` constant

As Andrzej pointed out in his post, you can mark variables with constexpr, which requires them to be constant initialized. However, this also makes the variable const, meaning it cannot be changed after initialization, which we do not want. However, we can still leverage this using a two-step initialization:

constexpr Foo c_init = Foo(1); // OK
Foo c = c_init;

constexpr Foo d_init = Foo(1L); // Error
Foo d = d_init;

This isn't very pretty either, but at least the initializer is only defined once. This does introduce an extra copy of the object. With the default (implicit) copy constructor this copy will be optimized out and constant initialization still happens as expected, so no problem there.

However, with user-defined copy constructors, things are diffrent:

class Foo2 {
public:
  constexpr Foo2(int x) { }
  Foo2(long x) { }
  Foo2(const Foo2&) { }
};

constexpr Foo2 e_init = Foo2(1); // OK
Foo2 e = e_init; // Not constant initialized but no error!

Here, a user-defined copy constructor is present that is not declared with constexpr. This results in e being not constant-initialized, even though e_init is (this is actually slighly weird - I would expect the initialization syntax I used to also call the copy constructor when initializing e_init, but perhaps that one is optimized out by gcc in an even earlier stage).

We can user our earlier ensure_const_init function here:

constexpr Foo f_init = Foo(1);
Foo f = f_init;
static_assert(ensure_const_init(f_init), "f not const-initialized"); // OK

constexpr Foo2 g_init = Foo2(1);
Foo2 g = g_init;
static_assert(ensure_const_init(g_init), "g not const-initialized"); // Error

This code is actually a bit silly - of course f_init and g_init are const-initialized, they are declared constexpr. I initially tried this separate init variable approach before I realized I could (need to, actually) add constexpr to the init variables. However, this silly code does catch our problem with the copy constructor. This is just a side effect of the fact that the copy constructor is called when the init variables are passed to the ensure_const_init function.

Using two variables

One variant of the above would be to simply define two objects: the one you want, and an identical constexpr version:

Foo h = Foo(1);
constexpr Foo h_const = Foo(1);

It should be reasonable to assume that if h_const can be const-initialized, and h uses the same constructor and arguments, that h will be const-initialized as well (though again, no real guarantee).

This assumes that the h_const object, being unused, will be optimized away. Since it is constexpr, we can also be sure that there are no constructor side effects that will linger, so at worst this wastes a bit of memory if the compiler does not optimize it.

Again, this requires duplication of the constructor arguments, which can be error-prone.

Summary

There's two significant problems left:

None of these approaches actually guarantee that const-initialization happens. It seems they catch the most common problem: Having a non-constexpr function or constructor involved, but inside the C++ minefield that is (copy) constructors, implicit conversions, half a dozen of initialization methods, etc., I'm pretty confident that there are other caveats we're missing here.
None of these approaches are very pretty. Ideally, you'd just write something like:
```
constinit Foo f = Foo(1);
```
or, slightly worse:
```
Foo f = constinit(Foo(1));
```

Implementing the second syntax seems to be impossible using a function - function parameters cannot be used in a constant expression (they could be non-const). You can't mark parameters as constexpr either.

I considered to use a preprocessor macro to implement this. A macro can easily take care of duplicating the initialization value (and since we're enforcing constant initialization, there's no side effects to worry about). It's tricky, though, since you can't just put a static_assert statement, or additional constexpr variable declaration inside a variable initialization. I considered using a C++11 lambda expression for that, but those can only contain a single return statement and nothing else (unless they return void) and cannot be declared constexpr...

Perhaps a macro that completely generates the variable declaration and initialization could work, but still a single macro that generates multiple statement is messy (and the usual do {...} while(0) approach doesn't work in global scope. It's also not very nice...

Any other suggestions?

Update 2020-11-06: It seems that C++20 has introduced a new keyword, constinit to do exactly this: Require that at variable is constant-initialized, without also making it const like constexpr does. See https://en.cppreference.com/w/cpp/language/constinit

The actual resize

Cache pools

Shorting fan pins

Temperatures also wonky

Disabling the sensors

How does it work?

Replacing the firmware?

Remap keys, then?

Understanding and taming udev and hwdb

SMART

Fast-then-slow copying

Automatically pulling the plug

Selectively backing up

The catalog file

Where are my files?

Backing up specific files

But wait, there's more!

Nearing the end

Getting started

Lines, lanes, channels, duplex, encoding and bandwidth

USB1/2/3

DisplayPort

Thunderbolt 1/2/3

USB-C

USB-C alt modes

USB4 / Thunderbolt 4

Thunderbolt / USB4 hardware

Understanding tunneling and bandwidth limitations

Bandwidths & Encoding

Displaylink

Docks / hubs

Displays on docks / hubs

Framework laptop

Improved solution

Examples

The code

Constant Initialization

Enforcing constant initialization?

Using static_assert

Using constexpr constant

Using two variables

Summary

Using `static_assert`

Using `constexpr` constant