The new frontline
13 Feb 2020
Using big data analysis and knowledge networks is the new frontier in combating infectious disease in crops says Dr Kim Hammond-Kosack
The FAO, the UN’s Food and Agriculture Organization, estimates that agricultural production must rise by about 60% by 2050 to feed a larger and generally richer population - but around the planet food crops are increasingly under severe threat. Global warming, increased travel and rapid transportation of fresh produce is allowing many kinds of disease-causing organisms to move beyond their typical ranges, which is introducing diseases into new (and potentially unprepared) areas.
Collectively, disease-causing microbes and pests can reduce potential crop yields by up to 25%. Some problems of particular concern incited by either fungi or bacteria include rice blast, late blight of potato, Fusarium head blight, the cereal rusts, soybean rust, black sigatoka and Panama diseases of banana, Xylella infection of olive trees and various bacterial blight diseases. However, with 2020 being the International Year of Plant Health, we have a once in a lifetime opportunity, to raise global awareness on how protecting plant health can help end hunger, reduce poverty, protect the environment, and boost economic development.
Whilst various control measures on the ground are fighting valiantly to stop or slow the spread of these diseases, the war will only be won in the lab. It is here where data on the genetics behind infection, immunity, and resistance to pesticides provide us with our only real chances of long-term success. A key challenge is provided by the sheer volume of genomic data out there, with relevant studies spread widely across taxa and scientific disciplines – making it less likely individual research groups will come across those key findings which would accelerate their own breakthroughs.
Data deluge
Since the completion of the first public-domain whole genome sequencing projects for microbes and plants in the late 1980s and 1990s, the amount of data created from genome sequencing now doubles every seven months. In 2020, sequencing data will be generated a million times faster than in 2010. Most bioscience disciplines have now transitioned to the ‘post-genomic era’, where analysing data at a huge scale is the new norm.
To benefit many of these bioscience disciplines, including plant, crop, soil and human health, it is imperative to ensure that the information being produced is Findable, Accessible, Inter-operable, and Reusable (following the FAIR principles). Of equal importance is the development of novel ways to analyse data, such as using interactive networks and artificial intelligence to go from genome to genotype to phenotype.
All scientific communities are struggling with the genomic and post-genomic data deluge as well as the increasing numbers of peer reviewed articles reporting on the function of individual genes and proteins, individual members of specific gene families and even all genes with the same predicted ontologies. With a focus on infectious microbes and their disease-afflicted hosts, the globally unique multispecies data resource PHI-base – publicly available since 2005 at www.PHI-base.org – is successfully addressing both these major issues.
The PHI-base team manually curates the peer reviewed literature on numerous pathogen-host interactions, and stores gold-standard information on phenotypes on candidate and confirmed virulence genes implicated in the disease-causing ability of a pathogen. PHI-base also curates host gene products that are first targeted by pathogen-derived mobile molecules, called ‘effectors’, that enter the host and directly affect the interaction outcome. The PHI-base team is preparing to provide information on anti-infective chemistries, their targets sites in each pathogenic organism, as well as the mutation types which confer increased resistance or sensitivity to one or more chemistries.
The current version of PHI-base contains 3,454 manually curated references and provides information on 6,780 genes from 268 pathogenic species, tested on 210 hosts in 13,801 interactions. Prokaryotic and eukaryotic pathogens are represented in almost equal numbers. Pathogen species consist of approximately 60% plant-infecting (split 50:50 between cereal and non-cereal plants), 35% of medical importance and 5% causing diseases on other host types. Currently, 37% of PHI-base interaction entries are for the most important food and feed crops worldwide: namely wheat, rice, maize, barley, tomato, potato and brassicas.
Phenotype and genotype
PHI-base is a primary information source for the global research community studying plant-pathogen and human-pathogen interactions. Researchers from a wide range of bioscience disciplines can easily familiarise themselves with relevant molecular and biological facts on pathogenicity, virulence, and effector genes, and their first-host targets. To assist with the comparative analysis of different pathosystems across the entire tree of life – nine species-neutral, high-level phenotypes are used to describe the overall experimental interaction outcomes.
The phenotypic data can be directly traced to individual gene entries within the genomes of over 300 plant-pathogenic species available in specialist genome databases: including ENSEMBL Invertebrates and FungiDB. This allows researchers within a few minutes to change the scale and focus of their analysis from single gene sequences, to gene function data sets – and the experimental evidence coming from curated peer reviewed literature in PHI-base – and then back to the pan-genome for the gene of interest, as well as the surrounding genes and other genetic elements within the local genomic landscape.
PHI-base joined the UK node of ELIXIR's ‘Data for Life’ project as an agricultural–genomics data provider in 2017. ELIXIR, an intergovernmental European organisation, coordinates life science data resources in its member states with a mission to safeguard and integrate the increasing volume of research data being generated in publicly funded research. PHI-base adheres to the 15 FAIR data principles adopted by ELIXIR ensuring that pathogen-host interaction data can be used long-term by humans and computers alike. The FAIR data principles require for example stable data record identifiers, rich metadata, use of standard computer communication protocols, links to other data resources and a clear data licencing policy.
Multiple uses
The curated phenotypes and genotypes in PHI-base can be used together with the sequenced genome information to increase the range of approaches used to control diseases in agriculture, horticulture, forestry, and human healthcare.
Generic new approaches include: tracking the genes, gene clusters, and horizontal gene transfer events that allow pathogens to cause disease in different geographical regions, environmental conditions, and new host species; identifying new pathogen genes and proteins that biochemical and biotechnology industries can target to suppress or kill pathogens; and undertaking pan-genome analyses between current and historic strains of each pathogen, to explore how these strains have evolved to overcome existing control strategies. To improve plant and crop health, commercial plant breeders can target the removal of host susceptibility genes or, use functionally characterised pathogen effectors to directly screen collections of plant germplasm to maintain the widest possible repertoire of plant disease resistance genes.
To try and make sense of the huge amount of biological data that is being produced, researchers are exploring new ways to visualise and combine data, with the ultimate aim of finding (and exploiting) new weaknesses in plant pathogens. One of the most powerful tools to be developed for plants and their associated pathogens is Knetminer.
Knetminer connects information from databases and the literature in an intelligent data model – known as a knowledge graph – that understands biological entities and their relationships to one another as concepts, rather than raw text. Knetminer enables scientists to search the knowledge graph for genes, phenotypes, stresses, molecules and other information – and instantly tells the stories of complex traits and diseases for both pathogens and their hosts using explainable artificial intelligence.
Knetminer allows researchers without specialist bioinformatics skills to explore a wealth of existing genomic information from multiple species and compare this with their own experimental data to permit rapid progress, new insights and new discoveries. The combination of knowledge graphs, artificial intelligence and digital technologies will transform the way researchers work and help to protect crops and humans from microbial pathogens more efficiently and sustainably.
Author:
Dr Kim Hammond-Kosack is a molecular plant pathologist at Rothamsted Research, the world’s oldest agricultural research institute, where she oversees the PHI-base database.