@KingsmanVince@kbin.social
@KingsmanVince@kbin.social avatar

KingsmanVince

@KingsmanVince@kbin.social

Este perfil es de un servidor federado y podría estar incompleto. Explorar más contenido en la instancia original.

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks (aclanthology.org) en

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse...

Demystifying CLIP Data (arxiv.org) en

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP...

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (arxiv.org) en

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We...

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (arxiv.org) en

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models (openaccess.thecvf.com) en

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for...

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No (arxiv.org) en

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD...

Scaling Vision-Language Models with Sparse Mixture of Experts (arxiv.org) en

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these...

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks (arxiv.org) en

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these...

Foundational Models Defining a New Era in Vision: A Survey and Outlook (arxiv.org) en

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and...

itnewsbot, a random en
@itnewsbot@schleuss.online avatar

OpenAI admits that AI writing detectors don’t work - Enlarge (credit: Getty Images)

Last week, OpenAI published tip... - https://arstechnica.com/?p=1966483 -3 -4

KingsmanVince,
@KingsmanVince@kbin.social avatar

@itnewsbot Good, now all teachers and professors (who rely on these AI detector tools solely) should be informed.

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training (aclanthology.org) en

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we...

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks (arxiv.org) en

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of...

Vision Language Transformers: A Survey (arxiv.org) en

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models...

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use (arxiv.org) en

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like...

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback (arxiv.org) en

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human...

RecycleGPT: An Autoregressive Language Model with Recyclable Module (arxiv.org) en

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens...

Multitask Pretraining with Structured Knowledge for Text-to-SQL Generation (aclanthology.org) en

Many machine learning-based low-code or no-code applications involve generating code that interacts with structured knowledge. For example, one of the most studied tasks in this area is generating SQL code from a natural language statement. Prior work shows that incorporating context information from the database schema, such as...

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org) en

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs (arxiv.org) en

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc...

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (aclanthology.org) en

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In...

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (proceedings.mlr.press) en

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large...

NeurIPS 2023 Machine Unlearning Challenge (unlearning-challenge.github.io) en

Deep neural networks are at the center of rapid progress in AI, with applications to computer vision, natural language processing, speech recognition and others. While this progress offers many exciting opportunities, it also introduces new challenges, as we researchers bear the responsibility to understand and mitigate the...

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages (arxiv.org) en

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer...

KingsmanVince, a random en
@KingsmanVince@kbin.social avatar
 _._     _,-'""`-._
(,-.`._,'(       |`-/|
    `-.-'  )-`( , o o)
          `-    `_`"'-

GitHub - PiotrNawrot/nanoT5: Fast & Simple repository for pre-training and fine-tuning T5-style models (github.com) en

This repository comprises the code to reproduce the pre-training of a "Large Language Model" (T5) under a limited budget (1xA100 GPU, < 24 hours) in PyTorch. We start from the randomly initialised T5-base-v1.1 (248M parameters) model, and we pre-train it on the English subset of the C4 dataset and then fine-tune it on...

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models (arxiv.org) en

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive...

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing (arxiv.org) en

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and...

KingsmanVince,
@KingsmanVince@kbin.social avatar
KingsmanVince,
@KingsmanVince@kbin.social avatar

I know we are moving away from Reddit. However, if I don't link, I feel like we may miss out good threads on r/machinelearning. Moreover, the authors don't only post arxiv links, they post other sutff such as Summary, Key points, ... (e.g this).

So can I at least put them in the posts instead of posting in a comment?

KingsmanVince,
@KingsmanVince@kbin.social avatar

I will follow then.

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation (arxiv.org) en

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of large pre-trained models for Natural Language Processing and Computer Vision. Recently, we...

Vision-Language Models for Vision Tasks: A Survey (arxiv.org) en

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been...

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks (arxiv.org) en

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous...

KingsmanVince,
@KingsmanVince@kbin.social avatar

I also want to share some resources.
For Pytorch,

For TPU,

  • Todo
  • Suscrito
  • Moderado
  • Favoritos
  • random
  • noticiascr
  • CostaRica
  • Todos las revistas