stereoplegic 's Collections LLM architecture
updated
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
• 2310.19956
• Published
• 10
Retentive Network: A Successor to Transformer for Large Language Models
Paper
• 2307.08621
• Published
• 173
RWKV: Reinventing RNNs for the Transformer Era
Paper
• 2305.13048
• Published
• 21
Attention Is All You Need
Paper
• 1706.03762
• Published
• 115
READ: Recurrent Adaptation of Large Transformers
Paper
• 2305.15348
• Published
• 2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit
from Emergent Modular Structures?
Paper
• 2310.10908
• Published
• 1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
Paper
• 2203.01104
• Published
• 2
Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture
Paper
• 2303.16753
• Published
• 1
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Paper
• 2311.01927
• Published
• 1
White-Box Transformers via Sparse Rate Reduction
Paper
• 2306.01129
• Published
• 1
Improving Transformers with Probabilistic Attention Keys
Paper
• 2110.08678
• Published
• 1
Wide Attention Is The Way Forward For Transformers?
Paper
• 2210.00640
• Published
• 1
Architecture Matters in Continual Learning
Paper
• 2202.00275
• Published
• 1
Scaling TransNormer to 175 Billion Parameters
Paper
• 2307.14995
• Published
• 23
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
• 2311.05908
• Published
• 14
Hiformer: Heterogeneous Feature Interactions Learning with Transformers
for Recommender Systems
Paper
• 2311.05884
• Published
• 9
AutoML in the Age of Large Language Models: Current Challenges, Future
Opportunities and Risks
Paper
• 2306.08107
• Published
• 1
Continual Learning with Dependency Preserving Hypernetworks
Paper
• 2209.07712
• Published
• 1
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse
Paper
• 2311.07468
• Published
• 1
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published
• 25
Trellis Networks for Sequence Modeling
Paper
• 1810.06682
• Published
• 1
ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting
of RNN-like Language Models
Paper
• 2311.01981
• Published
• 1
Exponentially Faster Language Modelling
Paper
• 2311.10770
• Published
• 119
Replacing softmax with ReLU in Vision Transformers
Paper
• 2309.08586
• Published
• 19
Transformer Language Models without Positional Encodings Still Learn
Positional Information
Paper
• 2203.16634
• Published
• 5
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Paper
• 2308.14929
• Published
• 1
Robust low-rank training via approximate orthonormal constraints
Paper
• 2306.01485
• Published
• 1
Low Rank Optimization for Efficient Deep Learning: Making A Balance
between Compact Architecture and Fast Training
Paper
• 2303.13635
• Published
• 1
Cuttlefish: Low-Rank Model Training without All the Tuning
Paper
• 2305.02538
• Published
• 1
Relaxed Attention for Transformer Models
Paper
• 2209.09735
• Published
• 1
I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
Paper
• 2303.07624
• Published
• 1
Emergence of Segmentation with Minimalistic White-Box Transformers
Paper
• 2308.16271
• Published
• 17
White-Box Transformers via Sparse Rate Reduction: Compression Is All
There Is?
Paper
• 2311.13110
• Published
• 2
Linear Self-Attention Approximation via Trainable Feedforward Kernel
Paper
• 2211.04076
• Published
• 1
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
• 2312.00752
• Published
• 150
Decoder-only Architecture for Speech Recognition with CTC Prompts and
Text Data Augmentation
Paper
• 2309.08876
• Published
• 1
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper
• 2312.04410
• Published
• 15
HyperMixer: An MLP-based Low Cost Alternative to Transformers
Paper
• 2203.03691
• Published
• 1
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Paper
• 2309.10713
• Published
• 1
Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
Paper
• 2310.18780
• Published
• 3
LLM360: Towards Fully Transparent Open-Source LLMs
Paper
• 2312.06550
• Published
• 57
Text Generation with Diffusion Language Models: A Pre-training Approach
with Continuous Paragraph Denoise
Paper
• 2212.11685
• Published
• 2
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
Paper
• 2305.09515
• Published
• 3
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
Paper
• 2305.08379
• Published
• 3
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
Paper
• 2210.08933
• Published
• 6
DiffuSIA: A Spiral Interaction Architecture for Encoder-Decoder Text
Diffusion
Paper
• 2305.11517
• Published
• 1
Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning
Paper
• 2308.12219
• Published
• 1
Likelihood-Based Diffusion Language Models
Paper
• 2305.18619
• Published
• 1
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer
Paper
• 2308.15459
• Published
• 1
SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for
Text Generation and Modular Control
Paper
• 2210.17432
• Published
• 2
Self-conditioned Embedding Diffusion for Text Generation
Paper
• 2211.04236
• Published
• 1
Cached Transformers: Improving Transformers with Differentiable Memory
Cache
Paper
• 2312.12742
• Published
• 13
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Paper
• 2309.07623
• Published
• 1
Learning to Skip for Language Modeling
Paper
• 2311.15436
• Published
• 1
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper
• 2312.04927
• Published
• 3
Beyond Surface: Probing LLaMA Across Scales and Layers
Paper
• 2312.04333
• Published
• 19
DeLighT: Deep and Light-weight Transformer
Paper
• 2008.00623
• Published
• 1
Leveraging Contextual Information for Effective Entity Salience
Detection
Paper
• 2309.07990
• Published
• 8
Paper
• 2306.09539
• Published
• 10
Blockwise Parallel Transformer for Long Context Large Models
Paper
• 2305.19370
• Published
• 3
Block-Recurrent Transformers
Paper
• 2203.07852
• Published
• 1
Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model
Paper
• 2401.09417
• Published
• 62
RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent
Neural Networks
Paper
• 2106.08928
• Published
• 1
LKCA: Large Kernel Convolutional Attention
Paper
• 2401.05738
• Published
• 1
InfoDiffusion: Information Entropy Aware Diffusion Process for
Non-Autoregressive Text Generation
Paper
• 2310.11976
• Published
• 2
Enhancing Phrase Representation by Information Bottleneck Guided Text
Diffusion Process for Keyphrase Extraction
Paper
• 2308.08739
• Published
• 1
Gated Linear Attention Transformers with Hardware-Efficient Training
Paper
• 2312.06635
• Published
• 9
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Paper
• 2311.04823
• Published
• 2
Improving Natural Language Capability of Code Large Language Model
Paper
• 2401.14242
• Published
• 1
BlackMamba: Mixture of Experts for State-Space Models
Paper
• 2402.01771
• Published
• 25
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
• 2402.04248
• Published
• 32
Simple Hardware-Efficient Long Convolutions for Sequence Modeling
Paper
• 2302.06646
• Published
• 2
A Quantitative Review on Language Model Efficiency Research
Paper
• 2306.01768
• Published
• 2
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
• 2302.06218
• Published
• 1
Accelerating Toeplitz Neural Network with Constant-time Inference
Complexity
Paper
• 2311.08756
• Published
• 1
Agent Attention: On the Integration of Softmax and Linear Attention
Paper
• 2312.08874
• Published
• 2
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
• 2402.10644
• Published
• 81
Enhancing Transformer RNNs with Multiple Temporal Perspectives
Paper
• 2402.02625
• Published
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper
• 2105.13626
• Published
• 4
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
• 2402.19427
• Published
• 56
Simple linear attention language models balance the recall-throughput
tradeoff
Paper
• 2402.18668
• Published
• 20
Linear Transformers are Versatile In-Context Learners
Paper
• 2402.14180
• Published
• 7
DenseMamba: State Space Models with Dense Hidden Connection for
Efficient Large Language Models
Paper
• 2403.00818
• Published
• 19
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Paper
• 2305.07185
• Published
• 10
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
• 2404.14619
• Published
• 126
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published
• 66
Various Lengths, Constant Speed: Efficient Language Modeling with
Lightning Attention
Paper
• 2405.17381
• Published
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
Paper
• 2402.04347
• Published
• 15
SVD-LLM: Truncation-aware Singular Value Decomposition for Large
Language Model Compression
Paper
• 2403.07378
• Published
• 4
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
• 2407.12077
• Published
• 57