Adam Fedorowicz1, Lingyi Zheng2, Harshinder Singh1,2, Eugene Demchuk1,3
1 National Institute for Occupational
Safety and Health, Morgantown, WV
2 Department of
Statistics, West Virginia University, Morgantown, WV
3
School of Pharmacy, West Virginia University, Morgantown, WV
ACD, LLNA, QSAR, logistic regression, skin sensitization
Allergic Contact Dermatitis (ACD) is a common work-related skin disease that often develops as a result of repetitive skin exposures to a sensitizing chemical agent. A variety of experimental tests have been suggested to assess the skin sensitization potential. We applied a method of Quantitative Structure-Activity Relationship (QSAR) to relate measured and calculated physical-chemical properties of chemical compounds to their sensitization potential. Using statistical methods, each of these properties, called molecular descriptors, was tested for its propensity to predict the sensitization potential. A few of the most informative descriptors were subsequently selected to build a model of skin sensitization. In this work the murine Local Lymph Node Assay (LLNA) data were used. In principle, LLNA provides a standardized continuous scale suitable for quantitative assessment of skin sensitization. However, at present many LLNA results are still reported on a dichotomous scale, which is congruous to the scale of guinea pig tests, which were widely used in past years. Therefore, in this study only a dichotomous version of the LLNA data was used. To the statistical end, we relied on the logistic regression approach. This approach provides a statistical tool for investigating and predicting skin sensitization that is expressed only in categorical terms of activity and non-activity. Based on the data of compounds used in this study, our results suggest a QSAR model of ACD that is based on the following descriptors: nDB (number of double bonds), C-003 (number of CHR3 molecular subfragments), GATS6M (autocorrelation coefficient) and HATS6m (GETAWAY descriptor), although the relevance of the identified descriptors to the continuous ACD QSAR has yet to be shown.
The Bureau of Labor Statistics estimates that occupational skin diseases constitute the second largest group of occupational injuries in the U.S. [1]. Among them, Occupational Contact Dermatitis (OCD) is the most common cause of work-related skin illness comprising up to 95% of registered cases. Allergic Contact Dermatitis (ACD) may lead to severe recurrent forms of OCD because of long-lasting memory of the immune system. ACD usually develops as a result of repetitive skin exposures to a sensitizing chemical agent. Usually at least a single excessive exposure is essential in the development of the immune response. A variety of experimental tests have been suggested to assess the skin sensitization potential of a chemical [2]. Information that leads to the development of recommended skin exposure limits that would prevent workers from sensitizing overexposures is an important factor. Unfortunately, many experimental protocols result in a dichotomous conclusion, more appropriate for denial/acceptance decision-making in design and manufacturing of new chemicals rather than for preventive protection of workers occupationally involved with sensitizing chemical agents. The murine Local Lymph Node Assay (LLNA) has the capacity to provide a standardized continuous scale in the quantitative assessment of skin sensitization.
A combination of methods in statistics and computational chemistry, commonly referred to as Quantitative Structure-Activity Relationship (QSAR) modeling, complements the experimental approach. A method of QSAR is based on the examination of measured and calculated physicalchemical properties of chemical compounds, called molecular descriptors, with known biological activity, in this work the sensitization potential, and then relating a few of the most informative descriptors to the target bioactivity. The structure-activity relationships constructed this way provide a means of investigating and predicting the sensitization potential of the chemicals.
We rely on LLNA data to quantify the skin sensitization potential [3]. At present the LLNA data are (1) outnumbered by the long history of guinea pig assays, and (2) often reported as a dichotomous scale congruous to the guinea pig data. Therefore, the work has been started using a dichotomous version of LLNA data to identify molecular descriptors that may be effective in the continuous-scale LLNA QSAR. The work began from building a database of chemical names, structures, properties and bioactivities, along with design of appropriate software. Our immediate goal is to identify a pool of potentially informative molecular descriptors and chemical classes that are most appropriate for QSAR modeling to predict LLNA results. In the present work a QSAR based on a generalized linear model of logistic regression is proposed. The logistic regression permits construction of standard QSAR equations, in which the activity data are represented only in terms of activity (1) or nonactivity (0) values. In order to evaluate molecular properties, which can be associated with LLNA data on skin sensitization, 1204 molecular descriptors were calculated and tested for their significance in predicting the skin sensitization potential. Only a limited number of molecular descriptors were found to be statistically associated with skin sensitization.
These results suggest that a validated QSAR model of ACD may be built by using only a few appropriate parameters, although the relevance of identified descriptors to the continuous-scale of ACD-QSAR has yet to be shown. Further work will be focused on populating the QSAR database with continuous-scale ACD data and extending the database so it will contain more LLNA-tested compounds.
In our QSAR studies we applied a pool of LLNA-tested compounds consisting of 54 compounds from which 25 were active sensitizers and 29 were negative controls [4,5]. The molecular structures of these compounds were first encoded using the SMILES notation and were subsequently transformed into three-dimensional co-ordinates using Cerius2 from Accelrys, Inc. The Dragon 2.1 software developed by Milano Chemometrics and QSAR Research Group has been used to calculate a total of 1024 molecular descriptors, for each of the studied compounds. The statistical analysis was carried out using the SAS 8.2 statistical package.
The linear probability model is inadequate for modeling the probability of positive LLNA sensitization response, since it is heteroscedastic and often nonsensical. Depending on the choice of cumulative distribution function F, the probability of positive response of the LLNA sensitization test P{S=1|X1, X2, …, XN } = F(X`b) – can be represented either by the probit or the logistic regression model [6]. In the present study, we used the logistic regression and in this model, p(X) = P{S=1|X1, X2, …, XN}, that depends on molecular descriptors X1, X2, …, XN, is modeled in the form
|
|
|
|
EQ 1. |
or |
|
|
where b0, b1,
…, bN are regression
coefficients.
The logistic regression is a more appropriate
statistical tool than linear probability models, when the response
variable is binary (dichotomous). The properties of the
logistic function (EQ. 1) ensure that whatever
estimate of the response one obtains, it is always a number between 0
and 1 that can be easily translated into binary responses using an
appropriate threshold value (usually 0.5). The S-shape of the
logistic function is another important feature, which is particularly
appealing in epidemiology studies when the single variable X
is viewed as representing an index that combines contributions of
several risk factors and p(X)
represents the risk for a given value of X in single variable
logistic regression models.
The validity of logistic regression models was checked using cross validation, which, in general, treats n-1 out of n training observations as a training set. It reestimates the parameters of the model, and then classifies the observation based on the new parameter estimates. This is done for each of the n training observations. The misclassification rate for each group is the proportion of sample observations in group that are misclassified. This method achieves an almost unbiased estimate but with a relatively large variance.
The most predictive molecular descriptors were identified in several stages. At first, the statistical quality of a single-descriptor logistic model, the P-value, was assessed for each of the descriptors. Descriptors with the P-value above 0.05 were then omitted from further analysis. The remaining potentially predictive descriptors were subsequently used in an exhaustive search through all possible combinations of 1,2,3 and 4-descriptor models, along with a stepwise regression algorithm, which does not restrict the number of descriptors in the model. QSAR models which identified positive sensitizers with probability above 75% were analyzed in detail. The validity of these results was additionally verified using cross validation.
Overall 420 descriptors (out of 1204) were found to be statically significant at the P-level of 0.05. Table 1 shows the top part of a list of descriptors with P-values below the 0.01 threshold.
Table 1. Descriptors
|
No. |
Symbol |
Definition |
Class of Descriptors |
P-Value |
|---|---|---|---|---|
|
1 |
GATS6m |
Geary autocorrelation – lag 6 / weighted by atomic masses |
2D autocorrelations |
0.0042 |
|
2 |
RTe+ |
R maximal index / weighted by Sanderson electronegativities |
GETAWAY |
0.0049 |
|
3 |
RDF040p |
Radial Distribution Function –4.0 / weighted by atomic polarizabilities |
RDF |
0.0024 |
|
4 |
Rtu+ |
R maximal index / unweighted |
GETAWAY |
0.0045 |
|
5 |
RDF040v |
Radial Distribution Function –4.0 / weighted by atomic van der Waals volumes |
RDF |
0.0039 |
|
6 |
X1v |
Valence connectivity index chi-1 |
Topological |
0.0074 |
|
7 |
RDF050u |
Radial Distribution Function –5.0 / unweighted |
RDF |
0.0095 |
|
8 |
RDF050e |
Radial Distribution Function –5.0 / weighted by atomic Sanderson electronegativities |
RDF |
0.0061 |
|
9 |
RDF075v |
Radial Distribution Function –7.5 / weighted by atomic van der Waals volumes |
RDF |
0.0089 |
|
10 |
RDF075p |
Radial Distribution Function –7.5 / weighted by atomic polarizabilities |
RDF |
0.0082 |
|
11 |
X0v |
Valence connectivity index chi-0 |
Topological |
0.0085 |
|
12 |
X3v |
Valence connectivity index chi-3 |
Topological |
0.0061 |
|
13 |
RDF065p |
Radial Distribution Function –6.5 / weighted by atomic polarizabilities |
RDF |
0.0072 |
|
14 |
RDF065u |
Radial Distribution Function –6.5 / unweighted |
RDF |
0.0092 |
|
15 |
S2K |
2-path Kier alpha-modified shape index |
Topological |
0.0070 |
|
16 |
nDB |
Number of double bonds |
Constitutional |
0.0029 |
|
17 |
C-003 |
CHR3 |
Atom-centered fragments |
0.0005 |
|
18 |
E2m |
2nd component accessibility directional WHIM index / weighted by atomic masses |
WHIM |
0.0078 |
|
19 |
TI2 |
Second Mohar index TI2 |
Topological |
0.0040 |
|
20 |
Htp |
H total index / weighted by atomic polarizabilities |
GETAWAY |
0.0082 |
|
21 |
BEHp2 |
Highest eigenvalue n. 2 of Burden matrix / weighted by atomic polarizabilities |
BCUT |
0.0051 |
|
22 |
BEHe2 |
Highest eigenvalue n. 2 of Burden matrix / weighted by Sanderson electronegativities |
BCUT |
0.0097 |
Most of the descriptors with P-value below 0.01 can be partitioned into four broad classes:
Radial Distribution Function descriptors [7], which are based on the distance distribution in the geometrical representation of the molecule.
Topological descriptors, which are based on molecular graphs as a source of different probability distributions to which information theory definitions are applied [8].
GETAWAY class is a recently proposed [GEometry, Topology and Atom-Weights AssemblY] group of descriptors, which are based on a leverage matrix similar to that defined in statistics and usually used for regression diagnostics. These molecular descriptors try to match the three dimensional molecular geometry provided by the molecular influence matrix and atom relatedness by molecular topology, with chemical information by using different atomic weight schemes [9, 10].
BCUT it is a class of molecular descriptors defined as eigenvalues of a modified connectivity matrix, which is also called Burden Matrix B [8].
The selection of these classes of molecular descriptors seems to have a natural association with immunological activity measured by Local Lymph Node Assay, where the three dimensional structure recognition of a given antigen is responsible for the immunological response. However, the sophisticated representation of these descriptor classes impedes a simple interpretation of the mechanism of immunological response. Thus we can only rely on these QSAR models as an instrument of predicting the immunological activity.
Several tested QSAR models gave rise to interesting results and most of them contain 3 or 4 descriptors, We found that the best classification results were achieved with 3-4 parameter models, although we have identified several above-average models that include only 2 or even 1 descriptor. The best model that we identified so far consists of 4 descriptors:
|
EQ 2. |
|
nDB is the number of double bonds. It can be related to the hydrophobicity and reactivity.
GATS6m is the mass-weighted Geary graph spatial autocorrelation coefficient of the sixth lag. The Geary coefficient is a distance-type function varying from zero to infinity. Strong autocorrelation produces low values of this index; moreover, positive autocorrelation translates into values between 0 and 1 whereas negative autocorrelation produces values larger than 1.
HATS6e is the GETAWAY descriptor weighted by the atomic Sanderson electronegativities.
C-003 is the atom-centered fragments descriptor, indicating the presence of the CHR3 molecular subfragment.
The proposed QSAR model gives a percentage of positively predicted responses of 83% on the training set of compounds, and in cross validation it correctly identifies 79% of responses. The results of proposed QSAR model are summarized in table 2.
Table 2. Model Summary.
|
Percentage of correctly predicted responses |
Percentage of correctly identified active compounds |
Percentage of correctly identified inactive compounds |
|
|---|---|---|---|
|
Model |
83% |
72% |
93% |
|
Cross validation |
79% |
68% |
90% |
Table 3. presents the list of compounds tested in this study, together with their Local Lymph Node Activity data and the activity estimated by the application of the proposed QSAR model.
Table 3. LLNA-tested compounds.
|
No. |
Compound |
CAS |
LLNA |
Predicted skin sensitization |
|---|---|---|---|---|
|
1 |
chlorobenzene |
108-90-7 |
0 |
0 |
|
2 |
geraniol |
106-24-1 |
0 |
1 |
|
3 |
phenol |
108-95-2 |
0 |
0 |
|
4 |
2-chloroethanol |
107-07-3 |
0 |
0 |
|
5 |
benzaldehyde |
100-52-7 |
0 |
1 |
|
6 |
1-bromobutane |
109-65-9 |
0 |
0 |
|
7 |
1-butanol |
71-36-3 |
0 |
0 |
|
8 |
2-4-dichloronitrobenzene |
611-06-3 |
0 |
0 |
|
9 |
isopropanol |
67-63-0 |
0 |
0 |
|
10 |
glycerol |
56-81-5 |
0 |
0 |
|
11 |
hexane |
110-54-3 |
0 |
0 |
|
12 |
streptozotocin |
18883-66-4 |
0 |
0 |
|
13 |
4-aminobenzoic acid |
150-13-0 |
0 |
0 |
|
14 |
2-acetamidefluorene |
53-96-3 |
0 |
0 |
|
15 |
benzalkonium chloride |
8001-54-5 |
0 |
0 |
|
16 |
dimethyl-isophthalate |
1459-93-4 |
0 |
0 |
|
17 |
ethyl-methanesulfonate |
62-50-0 |
0 |
0 |
|
18 |
4-hydroxybenzoic acid |
99-96-7 |
0 |
0 |
|
19 |
lactic acid |
598-82-3 |
0 |
0 |
|
20 |
4-methoxyacetophenone |
100-06-1 |
0 |
0 |
|
21 |
6-Methylcoumarin |
92-48-8 |
0 |
0 |
|
22 |
methyl-4-hydroxybenzoate |
99-76-3 |
0 |
0 |
|
23 |
methyl salicylate |
119-36-8 |
0 |
0 |
|
24 |
2-nitrofluorene |
607-57-8 |
0 |
0 |
|
25 |
propylene glycol |
57-55-6 |
0 |
0 |
|
26 |
propyl paraben |
94-13-3 |
0 |
0 |
|
27 |
resorcinol |
108-46-3 |
0 |
0 |
|
28 |
salicylic acid |
69-72-7 |
0 |
0 |
|
29 |
di-2-furanylethanedione |
492-94-4 |
0 |
0 |
|
30 |
12-bromo-1-dodecanol |
3344-77-2 |
1 |
1 |
|
31 |
3-amino-5-mercapto-1-2-4-triazole |
16691-43-3 |
1 |
0 |
|
32 |
chloramine-T |
127-65-1 |
1 |
1 |
|
33 |
benzocaine |
94-09-7 |
1 |
0 |
|
34 |
urushiol V |
53237-59-5 |
1 |
1 |
|
35 |
2-aminophenol |
95-55-6 |
1 |
0 |
|
36 |
phthalic anhydride |
85-44-9 |
1 |
1 |
|
37 |
cinnamic aldehyde |
104-55-2 |
1 |
1 |
|
38 |
camphorquinone |
10373-78-1 |
1 |
1 |
|
39 |
2-hydroxyethyl-acrylate |
818-61-1 |
1 |
1 |
|
40 |
N-nitroso-N-methylurea |
684-93-5 |
1 |
1 |
|
41 |
diethyl-sulfate |
64-67-5 |
1 |
1 |
|
42 |
1-2-Benzisothiazol-3[2H]-one |
2634-33-5 |
1 |
1 |
|
43 |
butyl-glycidil ether |
2426-08-6 |
1 |
0 |
|
44 |
methyl-2-nonynoate |
111-80-8 |
1 |
1 |
|
45 |
2-vinylpyridine |
100-69-6 |
1 |
1 |
|
46 |
propyl gallate |
121-79-9 |
1 |
0 |
|
47 |
ethylene-glycol-dimethacrylate |
97-90-5 |
1 |
0 |
|
48 |
imidazolidinyl urea |
39236-46-9 |
1 |
1 |
|
49 |
tetrachlorosalicynanilide |
1154-59-2 |
1 |
0 |
|
50 |
oxazolone |
1564-29-0 |
1 |
1 |
|
51 |
acetyl-isovaleryl |
13706-86-0 |
1 |
1 |
|
52 |
hydroxycitronellal |
107-75-5 |
1 |
1 |
|
53 |
methylene diphenyl diisocyanate |
101-68-8 |
1 |
1 |
|
54 |
dodecyl methanesulphonate |
51323-71-8 |
1 |
1 |
The main goal of the presented study was to evaluate classes of molecular descriptors that later can be used in a comprehensive QSAR model of LLNA based on a large set of compounds. Our preliminary results demonstrate that the most promising molecular descriptors are derived either from three or two dimensional molecular structure indices, which are based on radial distribution functions, or topological indices, or autocorrelation functions. These classes of descriptors seem to be naturally related to the LLNA activity as they associate the immunological response with a three dimensional structure and shape of the sensitizing agents. These results suggest that a comprehensive QSAR model of ACD may be built by using only a few appropriate parameters, although the relevance of the identified descriptors to the continuous-scale ACD QSAR has yet to be shown. Further work will be focused on populating the QSAR database with continuous-scale ACD data and the expansion of the database. New predictive QSARs are expected to be useful in screening larger sets of compounds for their potential impact on the skin, and thus may suggest a useful order of priorities in experimental testing.
This research was supported by the National Occupational Research Agenda Dermal Exposure Research Program.
Worker Health Chartbook, 2000. Nonfatal Illness. DHHS (NIOSH) Publication No. 2002-120, April 2002.
Hewitt, P. & Maibach, H.I. Dermatotoxicology. In: Handbook of Occupational Dermatology (Kanerva, L., Eisner, P., Wahiberg, J.E., Maibach, H.I. eds), Springer, Berlin, 2000.
The Murine Local Lymph Node Assay: A Test Method for Assessing the Allergic Contact Dermatitis Potential of Chemicals/Compounds, NIH Publication No. 99-4494, February 1999.
J. Ashby, D.A. Basketter, D. Paton, I. Kimber, Structure-activity relationships in skin sensitization using murine local lymph node assay., Toxicology, 102 (1995) 177-194
K.E. Haneke, R.R. Tice, B.L. Carson, B.H. Margolin, W.S. Stokes, ICCVAM evaluation of the murine local lymph node assay. III. Data analyses completed by the national toxicology program interagency center for the evaluation of alternative toxicological methods. Regulatory Toxicology and Pharmacology, 34 (2001) 274-286
Agresti, A. Categorical Data Analysis, John Wiley & Sons, New York, 1990
Hemmer, M.C., Steinhauer, V. & Gasteiger J. Vibrat. Spectr. 19 151-164, 1999
Todeschini, R. & Consonni, V. Handbook of molecular descriptors. Wiley-VCH, Weinheim, Germany, 2000.
V. Consonni, R. Todeschini,M. Pavan Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors Journal of Chemical Information and Computer Sciences 42 (2002) 682-692
V. Consonni, R. Todeschini, M. Pavan, P. Gramatica Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies Journal of Chemical Information and Computer Sciences 42 (2002) 693-705