Conferences >2024 IEEE International Confe...

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language m...Show More

Metadata

Abstract:

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLM-Dataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B’s enhanced protein sequence understanding capability, we introduce TourSynbioAgent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis. Our results show that this combination facilitates protein engineering tasks in wet labs, leading t...

Published in: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Date of Conference: 03-06 December 2024

Date Added to IEEE Xplore: 10 January 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/BIBM62325.2024.10822695

Conference Location: Lisbon, Portugal

Contents

I. Introduction

Protein engineering enables the modification and optimization of protein sequences or structures for diverse applications, revolutionizing our ability to manipulate biological systems at the molecular level [1], [2]. Deep learning approaches offer higher efficiency and better performance in protein engineering tasks by efficiently processing vast amounts of protein data, capturing complex patterns in sequences, predicting protein properties and structures with increasing accuracy, and facilitating rapid exploration of optimal protein designs [3]–[5]. In protein engineering, protein sequences serve as the fundamental data format, often referred to as the "language of life sciences" due to their role in encoding biological information [6], [7]. The inherent sequential similarities between protein sequences and natural language have already led to the parallel development of foundation models, namely protein language models [8] and large language models (LLMs) [9]. Because LLMs have demonstrated strong capabilities in text understanding [10], recent research has begun exploring their potential in protein understanding through multi-modal large models [11], [12]. Previous attempts have focused on integrating protein sequences or structures in the form of protein graphs with textual content using extra encoders [13], [14], as depicted in Fig. 1(a). However, these approaches fail to fully leverage the intricate connections between protein sequences and natural language, leading to higher model complexity and sub-optimal performance. This limitation highlights the need for a model that can directly understand and process protein sequences without relying on external encoders, potentially improving both efficiency and performance in protein engineering tasks. To address this gap, we present TourSynbio7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders as shown in Fig. 1(b). TourSynbio-7B is post-trained and instruction fine-tuned on InternLM2-7B [15] using ProteinLM-Dataset [16], demonstrating that LLMs themselves can learn to understand proteins in the form of language.

References is not available for this document.

MIT Libraries

MIT Libraries

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References