Institute of Operating Systems and Computer Networks
- News
- About us
- Connected and Mobile Systems
  - Courses
  - Theses
  - Projects
  - Publications
  - Software
  - Datasets
- Reliable System Software
  - Team
  - Teaching
  - Theses & Jobs
  - Research
  - Publications
- Algorithms
  - Courses
  - Theses
  - Projects
  - Publications
- Microprocessor Lab
- Education
- Services
- Spin-Offs
  - Docoloc
  - bliq (formerly AIPARK)
  - Confidential Technologies
- Research Cooperations
  - IST.hub

Cookie Counting

☃️

Git-Repository: Template Solution Solution-Diff (Solution is posted at 18:00 CET)
Workload: 68 lines of code
Important System-Calls: perf_event_open(2)

All cookies are rescued. With your help, the ELF search team was able to find the "borrowing" ELF and bring back all the cookies, which now form a huge, and I mean HUGE, pile of cookies in the Christmas village. But somehow, the pile looks a little bit too small. After some intense eye contact, the borrowing ELF admitted that he ate one or two or maybe a few more cookies on the way. To estimate the number of destroyed cookies, the ELFs came up with two different estimation strategies. As time to Christmas is running out, it is important to choose the faster cookie counting strategy. Can you help?

perf Events

Performing microbenchmarks is an important part of writing efficient software as it helps us to identifying or disprove the bottlenecks in our programs. However, often it is not enough to know how long a given function or component executes, but we need more detailed metrics to understand why the benchmark is so fast/slow. For example, with modern cache hierarchies, the cache-miss rate is a major indicator of performance. But also, due to super-scalar CPUs, the number of finished instructions per cycle can be equally as important because it indicates how good we are at utilizing the CPU's resources.

Often, we measure those metrics with a separate benchmark program: the perf(1) tool. If you haven't heard of perf, you should immediately stop reading and look at the perf man page as well as Brendan Gregg's resources on flame graphs.

Are you back again? Ok, good... But sometimes, we cannot easily rip out the function-under-test from a larger program but we have to measure it in its environment, which also fosters the realism of the results. With perf_event_open(2), which is also the basis for the perf tool, we're able to dynamically setup performance monitoring in our programs.

With the perf subsystem, Linux provides us with a powerful performance measurement infrastructure, which we can use to record all kinds of metrics during our benchmark execution. This does not only include hardware counters (e.g. executed instructions, cycles, cache misses), but also software metrics, like CPU migrations and page faults.

   int perf_event_open(struct perf_event_attr *attr,
               pid_t pid, int cpu, int group_fd, unsigned long flags);

In a nutshell, for every metric that you want to measure, you'd need to create a new perf probe with perf_event_open(), which returns a file descriptor to the probe. With attr you configure the probe and with (pid, cpu) you describe its measurement scope (this thread vs. the whole system). By passing an already created probe as group_fd, you can group multiple probes into a single multi-metric probe. After creation, you have to use the generic ioctl(2) system call to reset/enable/disable the probe:

ioctl(probe_fd, PERF_EVENT_IOC_ENABLE, 0);

To access the measured data, you just read(2) the data from the probe descriptor and the kernel returns the metrics according to attr.read_format.

Task

In the template, you'll find two algorithms that perform matrix multiplication. It is your task to use perf_event_open(2) to compare both algorithms. For this, you should measure the number of executed instructions, the passed cycles and the number of cache misses. With a reasonable sized matrix (2048x2048), I see the following numbers on my machine:

$ ./perf 2048
matrix_size: 32.00 MiB
drepper    19462.57M instr,     0.92 instr-per-cycle,     0.20 miss-per-instr
naive      77355.59M instr,     0.17 instr-per-cycle,     0.53 miss-per-instr

The "drepper" algorithm is the cache-optimized variant from "What Every Programmer Should Know About Memory" by Ulrich Drepper. From my results, we see that Drepper's variant requires less instructions (lower is better), has a higher instruction-per-cycle metric (higher is better), and a lower miss rate (lower is better).

Hints

With struct perf_handle and struct read_format, we already hinted one variant how you could design your APIs. In our implementation, a perf_handle is a multi-metric probe that contains enough information in the read_format to disseminate the read data.

Last modified: 2023-12-01 15:52:27.723613, Last author: , Permalink: /p/advent-19-perf

Cookie Counting

perf Events

Task

Hints

For All Visitors

For Students

Internal Tools

Contact