utils.samplers.multipack
utils.samplers.multipack
Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences into fixed-capacity batches to optimize memory usage and training throughput.
Classes
| Name | Description |
|---|---|
| MultipackBatchSampler | Batch sampler class for efficient packing of variable-length sequences |
MultipackBatchSampler
utils.samplers.multipack.MultipackBatchSampler(
sampler,
batch_size,
batch_max_len,
lengths,
packing_efficiency_estimate=1.0,
drop_last=True,
num_count_samples=4,
sequential=False,
group_size=100000,
bin_size=200,
num_processes=None,
safe_mode=True,
mp_start_method='fork',
**kwargs,
)Batch sampler class for efficient packing of variable-length sequences
This sampler packs sequences into fixed-capacity bins (batches) to maximize GPU memory utilization and training throughput by reducing padding.
It supports both parallel packing (using FFD algorithm) and sequential packing (preserving original sequence order).
Methods
| Name | Description |
|---|---|
| efficiency | Calculate the packing efficiency (ratio of tokens used to total token slots). |
| gather_efficiency | Gather and synchronize packing efficiency estimates across all distributed |
| gather_len_batches | Gather and synchronize batch counts across all distributed ranks. Returns |
| generate_batches | Generate packed batches for training. |
| set_epoch | Set the epoch number, used for reproducible shuffling across epochs |
efficiency
utils.samplers.multipack.MultipackBatchSampler.efficiency()Calculate the packing efficiency (ratio of tokens used to total token slots). Higher is better - 1.0 would mean perfect packing with no wasted space.
gather_efficiency
utils.samplers.multipack.MultipackBatchSampler.gather_efficiency()Gather and synchronize packing efficiency estimates across all distributed ranks.
Returns
| Name | Type | Description |
|---|---|---|
| float | A conservative efficiency estimate based on the measurements. |
gather_len_batches
utils.samplers.multipack.MultipackBatchSampler.gather_len_batches(num)Gather and synchronize batch counts across all distributed ranks. Returns the minimum number of batches available on any rank.
generate_batches
utils.samplers.multipack.MultipackBatchSampler.generate_batches(set_stats=False)Generate packed batches for training.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| set_stats | bool | Whether to update efficiency statistics. | False |
Returns
| Name | Type | Description |
|---|---|---|
| list[list[list[int]]] | List of batches, where each batch contains multiple bins, and each bin contains multiple sequence indices. |
set_epoch
utils.samplers.multipack.MultipackBatchSampler.set_epoch(epoch)Set the epoch number, used for reproducible shuffling across epochs
Functions
| Name | Description |
|---|---|
| allocate_sequentially | Sequential allocator that preserves example order. |
| ffd_check | First-fit-decreasing bin packing algorithm check. |
| pack_group | Pack a group of sequences into bins using First-Fit Decreasing algorithm. |
| pack_parallel | Pack sequences into bins using parallel processing. |
allocate_sequentially
utils.samplers.multipack.allocate_sequentially(
sequence_lengths,
rank,
bin_capacity,
num_ranks,
)Sequential allocator that preserves example order.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| sequence_lengths | np.ndarray | The lengths of all examples. | required |
| rank | int | The current rank (for distributed training). | required |
| bin_capacity | int | The capacity of each bin (maximum sequence length). | required |
| num_ranks | int | Number of ranks (processes / GPUs). | required |
Returns
| Name | Type | Description |
|---|---|---|
| rank_batches | list[list[int]] | List of batches for the current rank. |
| total_tokens_used | int | Number of actual example tokens. |
| total_token_slots | int | Maximum theoretical number of example tokens (number of bins * bin capacity). |
ffd_check
utils.samplers.multipack.ffd_check(sequence_lengths, bin_capacity, num_bins)First-fit-decreasing bin packing algorithm check.
Checks if sequences with the given lengths could fit in the specified number of bins.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required |
| bin_capacity | int | Maximum capacity of each bin. | required |
| num_bins | int | Number of bins available. | required |
Returns
| Name | Type | Description |
|---|---|---|
| bool | True if all sequences can be packed, False otherwise. |
pack_group
utils.samplers.multipack.pack_group(
sequence_lengths,
group_offset,
bin_capacity,
max_bins,
bin_size,
safe_mode=True,
)Pack a group of sequences into bins using First-Fit Decreasing algorithm.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required |
| group_offset | int | Offset to apply to indices when returning results. | required |
| bin_capacity | int | Maximum capacity of each bin. | required |
| max_bins | int | Maximum number of bins to use. | required |
| bin_size | int | Maximum number of sequences per bin. | required |
| safe_mode | bool | If True, use a more conservative packing approach. | True |
Returns
| Name | Type | Description |
|---|---|---|
| list[list[int]] | List of bins, where each bin contains indices of sequences assigned to it. |
pack_parallel
utils.samplers.multipack.pack_parallel(
sequence_lengths,
bin_capacity,
group_size,
bin_size,
num_processes=None,
safe_mode=True,
mp_start_method='fork',
)Pack sequences into bins using parallel processing.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required |
| bin_capacity | int | Maximum capacity of each bin as total number of tokens. | required |
| group_size | int | Number of sequences to process in each group. | required |
| bin_size | int | Maximum number of bins to use. | required |
| num_processes | int | None | Number of parallel processes to use. | None |
| safe_mode | bool | If True, use a more conservative packing approach. | True |
| mp_start_method | str | None | Multiprocessing start method (‘fork’, ‘spawn’, ‘forkserver’). ‘spawn’ is often safer with Numba/PyTorch. Set to None to use system default. | 'fork' |
Returns: List of bins, where each bin contains indices of sequences assigned to it.