Summarizing Linux Kernel threads with LLMs (2024-03-04)

Willy Tarreau over at the Workflows Linux Kernel Mailing list has prototyped an excellent mail summarization tool. I am an avid reader of the Linux Kernel Mailing List, but often times I skim the subject lines of threads and try to quickly grasp what is happening. There is not enough time to understand every single email. But that is error-prone method – there is a lot of nuance and engineering wisdom and elegance I miss.

With the tool Wily created – LLM could be leveraged to summarize large threads and then it could point me to the more interesting (to me) parts of the thread.

Here is Willy’s proposal:

OK, if you're interested in giving it a try at home, here's what I've been
using:
  - github.com/ggerganov/llama.cpp
  - the mixtral LLM from:
    https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main
    I'm used to Q5_K_M quantization which generally provides the best
    compromise of accuracy/performance/speed, but for e-mail summaries,
    maybe using a smaller one would give good enough results faster.
  - ~35G of available RAM for the model above and plenty of cores (80 in
    my case)

I downloaded the whole thread in mbox format from lore (message ID
20240223143833.1509961-1-guanyulin@google.com), passed it through
formail to drop useless headers that take a lot of context space
and time, and drop 2nd-level replies:

  $ (echo "<|im_start|>data"; zcat t.mbox.gz | \
    formail -I X- -I Received -I ARC- -I Authentication- -I DKIM- \
            -I List- -I Precedence -I Mime- -I Message-ID -s | \
    grep -v '^>[ ]*>'; echo "<|im_end|>") > q1.txt

The mbox roughly shrank in half (~4k words).

Then I ran the attempt below with the following command (the model
understands different prompt formats, this one works fine enough):

  $ ./main -c 0 --temp 0.3 -n -1 --threads 80 -tbd 40 \
  -m /mnt/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf \
  -p "<|im_start|>system
    You're an e-mail summarizaiton bot, please follow the instructions below.<|im_end|>
   <|im_start|>user
   Please summarize the conversation in the DATA section below, focusing on
   the important points, objections and decisions taken, in three to four
   paragraphs. Please do not forget to mention who thinks what, and if
   anyone agrees or objects.<|im_end|>
   " -f q1.txt

Note that you can remove the debugging pollution using --log-disable.
Also, 80 and 40 are the number of CPUs to use when parsing and responding
respectively, they're optimal on this machine but normally you just use
--threads $(nproc) and don't specify -tbd.

This time with the trimmed 2nd-level responses, it took 9m45, and
produced the following, which I found reasonably accurate based on some
visual checks I did in this thread:

| The conversation in the provided data section is centered around a patch
| submitted by Guan-Yu Lin to the Linux kernel mailing list. The patch
| introduces a new sysfs entry called "conditional_skip" that allows the
| system to selectively skip certain device power management state
| transitions. This feature is intended to address conflicts that can
| arise in systems with a main processor and a co-processor, where
| asynchronous controller management can lead to issues such as the main
| processor attempting to suspend a device while the co-processor is
| actively using it.
|
| Andy Shevchenko raises concerns about the use of an int for the ".event"
| member in struct pm_message, suggesting that it might give subtle errors
| related to promoted signdness. Guan-Yu Lin agrees and asks if they
| should refrain from using bitwise operation here or just change the type
| to u32.
|
| Florian Fainelli questions the use of a sysfs entry for this feature,
| suggesting that it would be more appropriate to use firmware's
| description for this purpose. Guan-Yu Lin explains that they anticipate
| that control of the peripheral will be shared between operating system
| kernels and that they need to store information about whether another
| operating system is actively using this peripheral. Florian Fainelli
| suggests unbinding and rebinding the device from its driver when it
| needs to be taken over by another operating system.
|
| Rafael J. Wysocki expresses concerns about the idea of conditionally
| skipping system-wide power management transitions for a device,
| questioning why a device would be skipped in one transition but not
| another and expressing doubts about the overall usefulness of this
| feature. Guan-Yu Lin attempts to clarify their reasoning, but Rafael J.
| Wysocki remains unconvinced.
|
| Overall, the conversation revolves around the design and implementation
| of a new feature for the Linux kernel that allows for more fine-grained
| control over device power management state transitions in systems with
| multiple processors or co-processors. There are concerns about the use
| of a sysfs entry for this purpose and doubts about the overall
| usefulness of the feature, but Guan-Yu Lin attempts to address these
| concerns and clarify their reasoning throughout the conversation.
| [end of text]