CarcinoPred-EL

Prediction of chemical carcinogenicity using ensemble learning methods

About CarcinoPred-EL Server

What is CarcinoPred-EL server?

CarcinoPred-EL (carcinogenicity prediction using ensemble learning methods) is a carcinogenicity prediction web server, which classifies compounds as Carcinogens and Non-Carcinogens using only their two-dimensional structures. This web server has integrated three novel ensemble learning models, namely Ensemble XGBoost, Ensemble SVM and Ensemble RF, to predict the carcinogenicity of chemicals.

How the ensemble models integrated in CarcinoPred-EL server formed?

These ensemble models were developed through the following process. Firstly, seven types of molecular fingerprints (including CDK, CDKExt, CDKGraph, MACCS, Pubchem, KR and KRC) were generated for a data set containing 1003 diverse compounds with rat carcinogenicity collected from the Carcinogenic Potency Database (CPDB) using the PaDEL-Descriptor software. For each set of molecular fingerprints, three basic models were then trained using the machine leaning algorithms of XGBoost (eXtreme Gradient Boosting), SVM (Support Vector Machine) and RF (Random Forest), respectively. Finally, the seven basic models generated by each algorithm were fused to form the ensemble model via averaging the probabilities from the basic models.

What are the performances for the models integrated in CarcinoPred-EL server?

The predictive performance of the basic models and ensemble models were evaluated by 5-fold cross-validation with 100 repeats. The ensemble models have outperformed all the basic models. The performance indicators for the three ensemble models are listed in the following table:

Models Accuracy (%) Sensitivity (%) Specificity (%) AUC (%)
Ensemble SVM 69.4 65.2 73.5 75.6
Ensemble RF 69.2 67.0 71.3 75.7
Ensemble XGBoost 70.1 67.0 73.1 76.5

How to use the CarcinoPred-EL server?

Users can draw their chemical structures in the ketcher canvas, or can enter the SMILES strings of their chemical to the textbox. It is also possible to upload a file containing the compounds to be predicted. The format of the file can be SMILES (with extension of .smiles or .smi), sdf, mol, or mol2. Up to 1000 molecules can be processed at one time. Users can select one or more models to make predictions in a single run.

How to interpret the output from CarcinoPred-EL server?

The CarcinoPred-EL server only accepts compounds that contain more than 3 carbon atoms. For compounds that do not meet this requirement, no prediction will be made. And these compounds will be listed on the “Failure predictions” section in the output page. The ensemble models classify compounds as Carcinogens and Non-Carcinogens and the results of probability values and classification labels are listed in the “Average” and “Class” column of the output table. The probability values from each basic model are also provided. The probability values ranges from 0 to 1. If the probability is greater than 0.5, the compound is considered to be a carcinogen. Otherwise, it is considered to be non-carcinogenic substances.