(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 52 results for author: Neyshabur, B

.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  3. arXiv:2312.06585  [pdf, other

    cs.LG

    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

    Authors: Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron , et al. (16 additional authors not shown)

    Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted to TMLR. Camera-ready version. First three authors contributed equally

  4. arXiv:2211.11052  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Convexifying Transformers: Improving optimization and understanding of transformer networks

    Authors: Tolga Ergen, Behnam Neyshabur, Harsh Mehta

    Abstract: Understanding the fundamental mechanism behind the success of transformer networks is still an open problem in the deep learning literature. Although their remarkable performance has been mostly attributed to the self-attention mechanism, the literature still lacks a solid analysis of these networks and interpretation of the functions learned by them. To this end, we study the training problem of… ▽ More

    Submitted 20 November, 2022; originally announced November 2022.

  5. arXiv:2211.10193  [pdf, other

    cs.LG

    Layer-Stack Temperature Scaling

    Authors: Amr Khalifa, Michael C. Mozer, Hanie Sedghi, Behnam Neyshabur, Ibrahim Alabdulmohsin

    Abstract: Recent works demonstrate that early layers in a neural network contain useful information for prediction. Inspired by this, we show that extending temperature scaling across all layers improves both calibration and accuracy. We call this procedure "layer-stack temperature scaling" (LATES). Informally, LATES grants each layer a weighted vote during inference. We evaluate it on five popular convolut… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    Comments: 10 pages, 7 figures, 3 tables

    ACM Class: I.2.6; I.2.10

  6. arXiv:2211.09066  [pdf, other

    cs.LG cs.AI cs.CL

    Teaching Algorithmic Reasoning via In-context Learning

    Authors: Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, Hanie Sedghi

    Abstract: Large language models (LLMs) have shown increasing in-context learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multi-step reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

  7. arXiv:2211.08403  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    REPAIR: REnormalizing Permuted Activations for Interpolation Repair

    Authors: Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, Behnam Neyshabur

    Abstract: In this paper we look into the conjecture of Entezari et al. (2021) which states that if the permutation invariance of neural networks is taken into account, then there is likely no loss barrier to the linear interpolation between SGD solutions. First, we observe that neuron alignment methods alone are insufficient to establish low-barrier linear connectivity between SGD solutions due to a phenome… ▽ More

    Submitted 25 September, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

  8. arXiv:2209.06640  [pdf, other

    cs.LG cs.AI

    Revisiting Neural Scaling Laws in Language and Vision

    Authors: Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai

    Abstract: The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating s… ▽ More

    Submitted 1 November, 2022; v1 submitted 13 September, 2022; originally announced September 2022.

    Journal ref: Neural Information Processing Systems (NeurIPS), 2022

  9. arXiv:2207.04901  [pdf, other

    cs.CL cs.LG

    Exploring Length Generalization in Large Language Models

    Authors: Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur

    Abstract: The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring th… ▽ More

    Submitted 14 November, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

  10. arXiv:2206.14858  [pdf, other

    cs.CL cs.AI cs.LG

    Solving Quantitative Reasoning Problems with Language Models

    Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

    Abstract: Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained o… ▽ More

    Submitted 30 June, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: 12 pages, 5 figures + references and appendices

  11. arXiv:2206.13947  [pdf, other

    cs.LG cs.CL

    Long Range Language Modeling via Gated State Spaces

    Authors: Harsh Mehta, Ankit Gupta, Ashok Cutkosky, Behnam Neyshabur

    Abstract: State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and sh… ▽ More

    Submitted 2 July, 2022; v1 submitted 26 June, 2022; originally announced June 2022.

  12. arXiv:2206.10915  [pdf, other

    cs.CV

    Understanding the effect of sparsity on neural networks robustness

    Authors: Lukas Timpl, Rahim Entezari, Hanie Sedghi, Behnam Neyshabur, Olga Saukh

    Abstract: This paper examines the impact of static sparsity on the robustness of a trained network to weight perturbations, data corruption, and adversarial examples. We show that, up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity fixed, sparsified networks consistently match and often outperform their initially dense versions. Robustness and accuracy… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

  13. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  14. arXiv:2203.07852  [pdf, other

    cs.LG cs.AI cs.NE

    Block-Recurrent Transformers

    Authors: DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

    Abstract: We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is stri… ▽ More

    Submitted 1 November, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

    Comments: Update to NeurIPS camera-ready version

  15. arXiv:2202.01994  [pdf, other

    cs.LG cs.CL

    Data Scaling Laws in NMT: The Effect of Noise and Architecture

    Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

    Abstract: In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

  16. arXiv:2201.04234  [pdf, other

    cs.LG stat.ML

    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

    Authors: Saurabh Garg, Sivaraman Balakrishnan, Zachary C. Lipton, Behnam Neyshabur, Hanie Sedghi

    Abstract: Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on… ▽ More

    Submitted 14 October, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

    Comments: Accepted at ICLR 2022

  17. arXiv:2110.06296  [pdf, other

    cs.LG

    The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks

    Authors: Rahim Entezari, Hanie Sedghi, Olga Saukh, Behnam Neyshabur

    Abstract: In this paper, we conjecture that if the permutation invariance of neural networks is taken into account, SGD solutions will likely have no barrier in the linear interpolation between them. Although it is a bold conjecture, we show how extensive empirical attempts fall short of refuting it. We further provide a preliminary theoretical result to support our conjecture. Our conjecture has implicatio… ▽ More

    Submitted 5 July, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

  18. arXiv:2110.04369  [pdf, other

    cs.LG cs.AI

    A Loss Curvature Perspective on Training Instability in Deep Learning

    Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat

    Abstract: In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 20 pages, 16 figures

  19. arXiv:2110.02095  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Exploring the Limits of Large Scale Pre-training

    Authors: Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi

    Abstract: Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular,… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

  20. arXiv:2106.15831  [pdf, other

    cs.LG cs.AI cs.CV

    The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

    Authors: Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, Rebecca Roelofs

    Abstract: Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" and are exceedingly rare. Ident… ▽ More

    Submitted 30 June, 2021; originally announced June 2021.

    Comments: 27 pages, 25 figures

  21. arXiv:2106.09647  [pdf, other

    cs.LG stat.ML

    Deep Learning Through the Lens of Example Difficulty

    Authors: Robert J. N. Baldock, Hartmut Maennel, Behnam Neyshabur

    Abstract: Existing work on understanding deep learning often employs measures that compress all data-dependent information into a few numbers. In this work, we adopt a perspective based on the role of individual examples. We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth. Our extensive investigation reveals surprising yet simple… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Main paper: 15 pages, 8 figures. Appendix: 31 pages, 40 figures

  22. arXiv:2012.07976  [pdf, other

    cs.LG stat.ML

    NeurIPS 2020 Competition: Predicting Generalization in Deep Learning

    Authors: Yiding Jiang, Pierre Foret, Scott Yak, Daniel M. Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, Behnam Neyshabur

    Abstract: Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, c… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: 20 pages, 2 figures. Accepted for NeurIPS 2020 Competitions Track. Lead organizer: Yiding Jiang

  23. arXiv:2012.03107  [pdf, other

    cs.LG cs.CV eess.IV stat.ML

    When Do Curricula Work?

    Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

    Abstract: Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefit… ▽ More

    Submitted 9 February, 2021; v1 submitted 5 December, 2020; originally announced December 2020.

    Comments: ICLR 2021

  24. arXiv:2010.15775  [pdf, other

    cs.LG cs.CV stat.ML

    Understanding the Failure Modes of Out-of-Distribution Generalization

    Authors: Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur

    Abstract: Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way {\em even} in easy-to-learn tasks where one would expe… ▽ More

    Submitted 29 April, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

  25. arXiv:2010.14495  [pdf, other

    cs.LG stat.ML

    Are wider nets better given the same number of parameters?

    Authors: Anna Golubeva, Behnam Neyshabur, Guy Gur-Ari

    Abstract: Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width w… ▽ More

    Submitted 30 April, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: 9 pages

  26. arXiv:2010.08127  [pdf, other

    cs.LG cs.CV cs.NE math.ST stat.ML

    The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

    Authors: Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi

    Abstract: We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If… ▽ More

    Submitted 18 February, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

    Comments: Accepted to ICLR 2021

  27. arXiv:2010.01412  [pdf, other

    cs.LG stat.ML

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Authors: Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

    Abstract: In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultan… ▽ More

    Submitted 29 April, 2021; v1 submitted 3 October, 2020; originally announced October 2020.

  28. arXiv:2008.13363  [pdf, other

    cs.LG cs.CV stat.ML

    Extreme Memorization via Scale of Initialization

    Authors: Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur

    Abstract: We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and… ▽ More

    Submitted 1 May, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

  29. arXiv:2008.11687  [pdf, other

    cs.LG stat.ML

    What is being transferred in transfer learning?

    Authors: Behnam Neyshabur, Hanie Sedghi, Chiyuan Zhang

    Abstract: One desired capability for machines is the ability to transfer their knowledge of one domain to another where data is (usually) scarce. Despite ample adaptation of transfer learning in various deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analyses to address thes… ▽ More

    Submitted 14 January, 2021; v1 submitted 26 August, 2020; originally announced August 2020.

    Comments: Equal contribution, authors ordered randomly

    Journal ref: NeurIPS 2020

  30. arXiv:2007.13657  [pdf, other

    cs.LG cs.CV stat.ML

    Towards Learning Convolutions from Scratch

    Authors: Behnam Neyshabur

    Abstract: Convolution is one of the most essential components of architectures used in computer vision. As machine learning moves towards reducing the expert bias and learning it from data, a natural next step seems to be learning convolution-like structures from scratch. This, however, has proven elusive. For example, current state-of-the-art architecture search algorithms use convolution as one of the exi… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: 18 pages, 9 figures, 4 tables

  31. arXiv:1912.02975  [pdf, other

    cs.LG cs.AI stat.ML

    Observational Overfitting in Reinforcement Learning

    Authors: Xingyou Song, Yiding Jiang, Stephen Tu, Yilun Du, Behnam Neyshabur

    Abstract: A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of… ▽ More

    Submitted 27 December, 2019; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Published as a conference paper in ICLR 2020

  32. arXiv:1912.02178  [pdf, other

    cs.LG stat.ML

    Fantastic Generalization Measures and Where to Find Them

    Authors: Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio

    Abstract: Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study o… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

  33. arXiv:1912.00528  [pdf, other

    cs.LG stat.ML

    The intriguing role of module criticality in the generalization of deep networks

    Authors: Niladri S. Chatterji, Behnam Neyshabur, Hanie Sedghi

    Abstract: We study the phenomenon that some modules of deep neural networks (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained parameters, results in a large drop in the network's performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measur… ▽ More

    Submitted 14 February, 2020; v1 submitted 1 December, 2019; originally announced December 2019.

  34. arXiv:1805.12076  [pdf, other

    cs.LG stat.ML

    Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

    Authors: Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro

    Abstract: Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound… ▽ More

    Submitted 30 May, 2018; originally announced May 2018.

    Comments: 19 pages, 8 figures

  35. arXiv:1802.05296  [pdf, other

    cs.LG

    Stronger generalization bounds for deep nets via a compression approach

    Authors: Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang

    Abstract: Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparamet… ▽ More

    Submitted 26 November, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

  36. arXiv:1709.01953  [pdf, other

    cs.LG

    Implicit Regularization in Deep Learning

    Authors: Behnam Neyshabur

    Abstract: In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implic… ▽ More

    Submitted 7 September, 2017; v1 submitted 6 September, 2017; originally announced September 2017.

    Comments: PhD Thesis

  37. arXiv:1707.09564  [pdf, ps, other

    cs.LG

    A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro

    Abstract: We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.

    Submitted 23 February, 2018; v1 submitted 29 July, 2017; originally announced July 2017.

    Comments: Accepted to ICLR 2018

  38. arXiv:1706.08947  [pdf, other

    cs.LG

    Exploring Generalization in Deep Learning

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro

    Abstract: With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures expl… ▽ More

    Submitted 6 July, 2017; v1 submitted 27 June, 2017; originally announced June 2017.

    Comments: 19 pages, 8 figures

  39. arXiv:1705.09280  [pdf, other

    stat.ML cs.LG

    Implicit Regularization in Matrix Factorization

    Authors: Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

    Submitted 25 May, 2017; originally announced May 2017.

  40. arXiv:1705.07831  [pdf, other

    cs.LG cs.CV

    Stabilizing GAN Training with Multiple Random Projections

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, Ayan Chakrabarti

    Abstract: Training generative adversarial networks is unstable in high-dimensions as the true data distribution tends to be concentrated in a small fraction of the ambient space. The discriminator is then quickly able to classify nearly all generated samples as fake, leaving the generator without meaningful gradients and causing it to deteriorate after a point in training. In this work, we propose training… ▽ More

    Submitted 22 June, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

  41. arXiv:1705.03071  [pdf, other

    cs.LG

    Geometry of Optimization and Implicit Regularization in Deep Learning

    Authors: Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro

    Abstract: We argue that the optimization plays a crucial role in generalization of deep learning models through implicit regularization. We do this by demonstrating that generalization ability is not controlled by network size but rather by some other implicit control. We then demonstrate how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not… ▽ More

    Submitted 8 May, 2017; originally announced May 2017.

    Comments: This survey chapter was done as a part of Intel Collaborative Research institute for Computational Intelligence (ICRI-CI) "Why & When Deep Learning works -- looking inside Deep Learning" compendium with the generous support of ICRI-CI. arXiv admin note: substantial text overlap with arXiv:1506.02617

  42. arXiv:1612.06246  [pdf, ps, other

    cs.LG stat.ML

    Corralling a Band of Bandit Algorithms

    Authors: Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire

    Abstract: We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master… ▽ More

    Submitted 5 June, 2017; v1 submitted 19 December, 2016; originally announced December 2016.

    Comments: Accepted to COLT 2017

  43. arXiv:1605.07221  [pdf, other

    stat.ML cs.LG math.OC

    Global Optimality of Local Search for Low Rank Matrix Recovery

    Authors: Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random… ▽ More

    Submitted 26 May, 2016; v1 submitted 23 May, 2016; originally announced May 2016.

    Comments: 21 pages, 3 figures

  44. arXiv:1605.07154  [pdf, other

    cs.LG cs.NE

    Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

    Authors: Behnam Neyshabur, Yuhuai Wu, Ruslan Salakhutdinov, Nathan Srebro

    Abstract: We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, e… ▽ More

    Submitted 23 May, 2016; originally announced May 2016.

    Comments: 15 pages

  45. arXiv:1511.06747  [pdf, other

    cs.LG

    Data-Dependent Path Normalization in Neural Networks

    Authors: Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro

    Abstract: We propose a unified framework for neural net normalization, regularization and optimization, which includes Path-SGD and Batch-Normalization and interpolates between them across two different dimensions. Through this framework we investigate issue of invariance of the optimization, data dependence and the connection with natural gradients.

    Submitted 19 January, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

    Comments: 17 pages, 3 figures

  46. arXiv:1506.02617  [pdf, other

    cs.LG cs.CV cs.NE stat.ML

    Path-SGD: Path-Normalized Optimization in Deep Neural Networks

    Authors: Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro

    Abstract: We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD… ▽ More

    Submitted 8 June, 2015; originally announced June 2015.

    Comments: 12 pages, 5 figures

  47. arXiv:1503.00036  [pdf, ps, other

    cs.LG cs.AI cs.NE stat.ML

    Norm-Based Capacity Control in Neural Networks

    Authors: Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

    Abstract: We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.

    Submitted 14 April, 2015; v1 submitted 27 February, 2015; originally announced March 2015.

    Comments: 29 pages

  48. arXiv:1412.6614  [pdf, ps, other

    cs.LG cs.AI cs.CV stat.ML

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    Authors: Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

    Abstract: We present experiments demonstrating that some other form of capacity control, different from network size, plays a central role in learning multilayer feed-forward networks. We argue, partially through analogy to matrix factorization, that this is an inductive bias that can help shed light on deep learning.

    Submitted 16 April, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

    Comments: 9 pages, 2 figures

  49. arXiv:1410.5518  [pdf, ps, other

    stat.ML cs.DS cs.IR cs.LG

    On Symmetric and Asymmetric LSHs for Inner Product Search

    Authors: Behnam Neyshabur, Nathan Srebro

    Abstract: We consider the problem of designing locality sensitive hashes (LSH) for inner product similarity, and of the power of asymmetric hashes in this context. Shrivastava and Li argue that there is no symmetric LSH for the problem and propose an asymmetric LSH based on different mappings for query and database points. However, we show there does exist a simple symmetric LSH that enjoys stronger guarant… ▽ More

    Submitted 8 June, 2015; v1 submitted 20 October, 2014; originally announced October 2014.

    Comments: 11 pages, 3 figures, In Proceedings of The 32nd International Conference on Machine Learning (ICML)

  50. arXiv:1405.3167  [pdf, ps, other

    cs.LG

    Clustering, Hamming Embedding, Generalized LSH and the Max Norm

    Authors: Behnam Neyshabur, Yury Makarychev, Nathan Srebro

    Abstract: We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions.

    Submitted 13 May, 2014; originally announced May 2014.

    Comments: 17 pages