New computer tool to investigate the complexity of the genome

Researchers develop a new computer tool to investigate the complexity of the genome

The computer program developed by I2SysBio makes it possible to discover new tra
The computer program developed by I2SysBio makes it possible to discover new transcripts that were not in the genome databases. Credits: Pixabay

A team from the Institute of Integrative Systems Biology (UV-CSIC) has published in Nature Methods its own software to analyse data obtained by long-read sequencing of the genome. This system makes it possible to discover new RNA molecules and assign them a function in the creation of tissues. This deepens the knowledge of the formation of the organism and its diseases.

The complexity of an organism emerges from its genome, the book that contains its DNA’s instructions for life. The method for reading this book - sequencing - has evolved towards reading increasingly longer fragments of the genome. In this field, a research group led by the Institute of Integrative Systems Biology (I2SysBio), a joint centre of the University of Valencia (UV) and the Spanish National Research Council (CSIC), has improved its own computer program capable of discover new transcripts -RNA molecules to synthesise proteins and create tissuesfrom their sequencing with long-read instruments; and assign them a function in the formation of the organism. This has been published by Nature Methods.

Long-read sequencing is the third generation of genome sequencing methods. Compared to short fragment reading, which analyses about 200 nucleotides, long read methods can obtain reads 100 times longer, leaving fewer gaps in the genome information to fill using bioinformatics tools. This was one of the reasons why Nature Methods itself considered it ’2022 Method of the Year’.

A few years earlier, in 2018, researcher Ana Conesa, then at the University of Florida, developed a computer program called SQANTI to analyse the information that was extracted using these long-read methods. Now, her research team at I2SysBio has published a substantial improvement to this software that can be freely used on the major commercial systems employing long-read sequencing, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

"Long-read techniques better analyse the complexity of human transcripts and transcriptome", says Conesa. This identifies the portion of the genome that is read in each cell to give rise to tissues and organs. Thus, a single gene can give rise to a great diversity of transcripts, through small changes in the structure of the RNA it encodes, and with them proteins with different cellular functions. "Short read sequencing cannot solve this puzzle. Long reading better reconstructs the functional complexity of the human transcriptome, and this is key to studying certain diseases, especially neurological diseases and cancer", says the CSIC researcher.

Better understanding the complexity of the body and diseases

The version published now -SQANTI3- solves some previous problems derived from RNA degradation and introduces notable improvements. The program is capable of discovering new transcripts that were not in the genome databases used by these computer programs. Furthermore, through Artificial Intelligence techniques, the software can assign functional information to the new transcript, "something essential to understand the functional complexity of the organism and the diseases", highlights Conesa.

To develop this computer program, the I2SysBio Garnatxa computing cluster has been used, which has 15 computing nodes capable of offering 950 parallel computing threads. In addition, the Gene Expression Genomics group led by Ana Conesa at I2SysBio participates in ELIXIR, one of the strategic infrastructures for the European Strategic Forum on Research Infrastructures (ESFRI) that allows life sciences laboratories across Europe to share and store your data.

The University of Florida and Pacific Biosciences have collaborated in the development of SQANTI3.


SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nature Methods (2024). Pardo-Palacios, F. J., Arzalluz-Luque, A., Kondratova, L. et al.­’024 -02229-2