This exemplary new book highlights applications of multivariate techniques in the areas of drug therapy and toxicology, cancer, obesity and diabetes, as well as outlining applications to cardiovascular, infectious, inflammatory and oral diseases in detail.
"synopsis" may belong to another edition of this title.
After graduating in Chemistry/Statistical Analysis at Birkbeck College, University of London in 1981, Prof. Grootveld completed his Ph.D on bioanalytical chemistry and metallodrugs in 1985 at the same institution and then conducted post-doctoral work on the analysis of 'markers' of free radical activity in biofluids at King's College, University of London. He then spent 2 1/2 years lecturing and conducting research work at the Polytechnic of North London prior to taking up a Lectureship in Clinical Chemistry at St. Bartholomews and the Royal London School of Medicine and Dentistry in 1989, where he subsequently became Senior Lecturer and then Reader in Chemical Pathology. Later, he transferred to London South Bank University where he was also Reader in Chemical Pathology, and Director of their M.Sc Forensic Science course. He is now Professor of Chemical Pathology and Biomedical Materials at the University of Bolton where he has established and now directs a Master's course in Medical and Healthcare Devices which is the first of its kind available in the UK. He was Visiting Professor of Clinical Chemistry at Queen's University Belfast from 2001-2005. Prof. Grootveld is the author of almost 100 full, refereed research publications in reputable international scientific and/or clinical journals, 20 reviews and more than 160 refereed conference contributions.
Multivariate analysis of the multi-component analytical profiles of carefully collected biofluid and/or tissue biopsy specimens can provide a 'fingerprint' of their biomolecular/metabolic status. Therefore, if applied correctly, valuable information regarding disease indicators, disease strata and sub-strata, and disease activities can be obtained.
This exemplary new book highlights applications of these techniques in the areas of drug therapy and toxicology and cancer, as well as outlining applications to, for example, thyroid, inflammatory and oral diseases in detail. The book gives particular reference to cautionary measures that must be applied to the diagnosis and classification of these conditions or physiological criteria. Comprehensively covering a wide range of topics, of particular interest is the focus on experimental design and 'rights and wrongs' of the techniques commonly applied by researchers, and the very recent development of powerful 'Pattern Recognition' and 'Computational Intelligence' techniques.
The book provides a detailed introduction to the area, applications and common pitfalls of the techniques discussed before moving into detailed coverage of specific research areas (including bioenergetics and chemogenomics), each highlighted in individual chapters. This title will provide an invaluable resource to Medicinal Chemists, Biochemists and Toxicologists working in industry and academia.
Multivariate analysis of the multi-component analytical profiles of carefully collected biofluid and/or tissue biopsy specimens can provide a fingerprint of their biomolecular/metabolic status. Therefore, if applied correctly, valuable information regarding disease indicators, disease strata and sub-strata, and disease activities can be obtained.
This exemplary new book highlights applications of these techniques in the areas of drug therapy and toxicology and cancer, as well as outlining applications to, for example, thyroid, inflammatory and oral diseases in detail. The book gives particular reference to cautionary measures that must be applied to the diagnosis and classification of these conditions or physiological criteria. Comprehensively covering a wide range of topics, of particular interest is the focus on experimental design and rights and wrongs of the techniques commonly applied by researchers, and the very recent development of powerful Pattern Recognition and Computational Intelligence techniques.
The book provides a detailed introduction to the area, applications and common pitfalls of the techniques discussed before moving into detailed coverage of specific research areas (including bioenergetics and chemogenomics), each highlighted in individual chapters. This title will provide an invaluable resource to Medicinal Chemists, Biochemists and Toxicologists working in industry and academia.
Chapter 1 Introduction to the Applications of Chemometric Techniques in 'Omics' Research: Common Pitfalls, Misconceptions and 'Rights and Wrongs' Martin Grootveld, 1,
Chapter 2 Experimental Design: Sample Collection, Sample Size, Power Calculations, Essential Assumptions and Univariate Approaches to Metabolomics Analysis Martin Grootveld and Victor Ruiz Rodado, 35,
Chapter 3 Recent Developments in Exploratory Data Analysis and Pattern Recognition Techniques Martin Grootveld, 74,
Chapter 4 Analysis of High-dimensional Data from Designed Metabolomics Studies Johan A. Westerhuis, Ewoud J. J. van Velzen, Jeroen J. Jansen, Huub C. J. Hoefsloot and Age K. Smilde, 117,
Chapter 5 Current Trends in Multivariate Biomarker Discovery Darius M. Dziuda, 137,
Chapter 6 Discovery-based Studies of Mammalian Metabolomes with the Application of Mass Spectrometry Platforms Warwick B. Dunn, Catherine L. Winder and Kathleen M. Carroll, 162,
Chapter 7 Recent Advances in the Multivariate Chemometric Analysis of Cancer Metabolic Profiling Kenichi Yoshida and Martin Grootveld, 199,
Chapter 8 Group-specific Internal Standard Technology (GSIST) for Mass Spectrometry-based Metabolite Profiling Jiri Adamec, 220,
Chapter 9 18O-assisted 31P NMR and Mass Spectrometry for Phosphometabolomic Fingerprinting and Metabolic Monitoring Emirhan Nemutlu, Song Zhang, Andre Terzic and Petras Dzeja, 255,
Chapter 10 Investigations of the Mechanisms of Action of Oral Healthcare Products using 1H NMR-based Chemometric Techniques C. J. L. Silwood and Martin Grootveld, 287,
Chapter 11 Metabolomics Investigations of Drug-induced Hepatotoxicity Wei Tang and Qiuwei Xu, 323,
Chapter 12 Chemogenomics Virendra S. Gomase, Akshay N. Parundekar and Archana B. Khade, 357,
Subject Index, 379,
Introduction to the Applications of Chemometric Techniques in 'Omics' Research: Common Pitfalls, Misconceptions and 'Rights and Wrongs'
MARTIN GROOTVELD
Leicester School of Pharmacy, Faculty of Health and Life Sciences, De Montfort University, The Gateway, Leicester LE1 9BH, UK
Email: mgrootveld@dmu.ac.uk
1.1 Introduction
In this first chapter, I shall focus mainly on the two most widely employed multivariate (MV) assessment systems available in practice, specifically Principal Component Analysis (PCA) and Partial Least Squares methods, particularly Partial Least Squares-Discriminatory Analysis (PLS-DA), the first of which is an unsupervised exploratory dataset analysis (EDA) method, the second being a supervised pattern recognition technique (PRT). I have chosen to concentrate on these particular MV analysis methods here since there are numerous documented examples of the applications of these in the scientific, biomedical and/or clinical research areas in which they have sometimes been employed inappropriately, to say the least! Further details regarding the principles and modular applications of these two MV analysis approaches are provided in Appendices I and II.
1.2 Principal Component Analysis (PCA)
The applications of Principal Component Analysis (PCA) to the interpretation of MV metabolomic or chemometric datasets are manifold, and this is, perhaps, one of the most extensively applied techniques, examples of which are provided in refs 3–7, and which is sometimes employed in the first instance, if only for the detection and removal of statistical 'outlier' samples. The principles of this method involve the reduction of a large MV dataset (such as that arising from the 'bucketed' 1H NMR analysis of, say, a collection of biofluid samples, tissue biopsies or their extracts, or otherwise) to a much smaller number of 'artificial' variables known as Principal Components (PCs), which represent linear combinations of the primary (raw) dataset 'predictor' variables and, hopefully, will account for at least some, if not most, of their variance. These PCs can then, at least in principle, be employed as 'predictor' or criterion (X') variables in subsequent forms of analyses. It is clearly a valuable technique to apply when at least some level of 'redundancy' is suspected in the dataset, i.e. when some of the X variables are correlated or highly correlated (either positively or negatively) with each another. In metabolomics experiments, it is often the case that one or more (perhaps many) biofluid metabolite concentrations (or proportionately related parameters such as a resonance, signal or peak intensity) will be significantly correlated with one (or more) others, either positively or negatively. Obviously, in such situations, many of the predictor (X) variables can be rendered redundant, and this forms the basis of the PCA technique in terms of its dimensionality reduction strategy.
PCA is a procedure that converts a very large number of 'independent' variables (more realistically described as 'interdependent' variables in view of their multicorrelational status), i.e. 0.02–0.06 ppm 1H NMR spectral 'buckets' (which have variable frequency ranges if 'intelligently selected', and constant, uniform ones if not, the latter often being a pre-selected size of 0.04 or 0.05 ppm), many of which are correlated into a smaller number of uncorrelated PCs. Hence, a major objective of this form of multivariate analysis is to alleviate the dimensionality (i.e. the number of independent, possible 'predictor' variables) of the dataset whilst retaining as much of the original variance as possible. Hence, the first (primary) principal component is that which explains as much of the total variance as possible, the second as much of the remaining variance as possible, and so on with each succeeding PC until one with little or no contribution to variance is encountered; all components are, of course, orthogonal to (i.e. uncorrelated with) each other.
PCA can effectively delineate differing classifications within MV metabolomics datasets, and this is conducted according to the following procedure:
The data matrix is reduced to the much smaller number of PCs describing maximum variance within the dataset through decomposition of the X predictor variable matrix (containing the integral NMR buckets) into T score (containing class information projections of sample data onto each principal component through displacement from the origin) and P loading (describing the variables that influence the scores) matrices, such that X = t1 · p1T + ··· + tA · pAT, where the subscripted A value represents the total number of PCs, the residual information being included in a residual matrix E. The first PC should contain the maximum level of variance in the X matrix, such that the resulting deflated X matrix is then employed to seek a second component, orthogonal to the first, with the second highest variance contribution, and so on. PCA loadings with large values correspond to variables that have particularly high variance contributions towards them, and therefore they impart more to the total variance of the model system investigated.
However, there still remains much confusion regarding differences between the PCA and exploratory Factor Analysis (FA) techniques. Although similar in many respects (many of the stages followed are virtually identical), one of the most important conceptual differences between the two methods lies with the assumption of an underlying causal structure with FA (but not with PCA). Indeed, the FA technique relies on the assumption that covariation in the observed X variables is ascribable to the presence of one or several latent variables (or factors) that can (or do) exert a causal influence on the X variable dataset. Indeed, researchers often use FA when they are perhaps aware of a causal influence of latent factors on the dataset (for example, the clear influence of thyroid disease status on blood plasma thyroxine levels, or a type 1/type 2 diabetes disease classification on blood plasma glucose and, where appropriate, ketone body concentrations), and this technique has been much more extensively employed in, for example, the social and environmental science areas rather than in metabolomics research; hence, an exploratory FA permits researchers to identify the nature, total number and relative influence of these latent factors. Similarly, for sufficiently large MV datasets, the multiple FA (MFA) method serves to determine underlying relationships or 'signatures' between a series of causal latent variables and the MV dataset attained. In FA or MFA, we may also add the 'diagnostic' or other variables as supplementary ones rather than as latent causal factors.
For PCA, however, no prior assumptions regarding potential underlying causal latent variables are made; indeed, it is simply a dimensional alleviation technique that gives rise to a (relatively) much smaller number of (uncorrelated) PCs which account for as much of the MV dataset as possible (although the influence of or differences between such latent or explanatory variables are, of course, frequently investigated in a metabolomics sense).
Since PCs are defined as linear combinations of optimally weighted predictor (X) variables, it is possible to determine the 'scores' vectors of each one on each PC, which is considered significant (commonly determined via a Scree plot). For example, the first PC may be primarily ascribable to selected metabolic differences between two (or more) disease classification groups, whereas the second may arise from a second series of perhaps unknown, unrelated metabolic perturbations, or alternatively a further influential (perhaps latent) variable such as dietary habit or history, or further differences between sample donors, for example those regarding gender, age, family, ethnicity status, etc. Figure 1.1 shows typical Scree plots arising from the metabolomic PCA of intelligently bucketed datasets arising from the 1H NMR analysis of (a) human salivary supernatants (with 209 predictor variables, 480 samples and 2 oral health disease classifications) and (b) human urine (with only 22 predictor variables, 60 samples and again 2 disease classifications). For this latter example, we selected the most important bucket predictor variables via the prior performance of (1) model directed repetitions (>60 times) of the logistic model of correlated component regression (CCR, as outlined in Chapter 3) with corresponding validation, cross-validation (CV) and permutation testing, and (2) selected computational intelligence techniques, again with accompanying validation, cross-validation and permutation testing. For these Scree plots displayed in Figures 1.1(a) and (b), and Tables 1.1(a) and (b), respectively, list the number of PCs with eigenvalues >1, and their corresponding eigenvalues (i.e., the mean number of predictor X varoables per PC), together with the percentage of total variances accounted for by these PCs (the latter both individually and cumulative). From Figure 1.1(a) and Table 1.1(a), it can be observed that 14 PCs had eigenvalues >1, the first (PC1) with an eigenvalue of 121.66 (i.e. a mean value of 121.66 positively and/or negatively correlated predictor variables are responsible for it), the second 27 or so, the third 11 and the fourth 10, etc.; these first four PCs account for 58.2%, 12.85%, 5.3% and 4.9% of the total variance, respectively (total 81.2%). In Figure 1.1(b), however, only 8 PCs had eigenvalues >1, the first five accounting for only ca. 60% of the total variance. It should also be noted from Figure 1.1(b) that the Scree plot appears to have more than one simple break-point, the first after PC6, the second after PC12 (although PCs 9–12 are considered irrelevant since their eigenvalues are all <1). Therefore, for this latter example, it would appear that only PCs 1–6 should be considered as providing valuable MV information.
1.2.1 Critical Assumptions Underlying PCA
Now here's the difficult part! Indeed, this is where a lot of PCA applications to the analysis of metabolomics/chemometric datasets fall down, and hence fail or completely fail to provide satisfactory models for the diagnosis of human diseases, determinations of their severities, or responses to treatment, etc.
As with many alternative MV analysis techniques, the satisfactory application of PCA to the recognition of patterns or 'signatures' of metabolic biomarkers in metabolomics datasets (1H NMR-derived or otherwise) is critically dependent on the satisfaction of a series of assumptions. Unfortunately, such assumptions are rarely checked, evaluated or monitored prior to the performance of PCA, and hence results acquired can hardly be considered as having a sound basis. However, as noted below, some of these assumptions are of much more importance than others, and the technique serves to be relatively robust to violations of the selected criteria required.
These assumptions are:
(1) Primarily, since PCA is conducted on the analysis of a matrix of Pearson correlation coefficients, datasets acquired should satisfy all the relevant assumptions required for this statistic.
(2) A random sampling design should be employed, and hence each biofluid, tissue or alternative sample should contribute one, and only one, value (specifically, metabolite concentration or related measure, normalised and/or standardised) towards each observed 'predictor' (X) variable; these values should ideally represent those from a random sample drawn from the population(s) investigated.
(3) All biomolecule predictor (X) variables should be evaluated on suitable concentration (or directly proportional spectroscopic or chromatographic intensity measures), concentration interval or concentration ratio measurement levels.
(4) Each predictor variable measurement (for example, concentration or signal intensity) should be distributed normally, and those that deviate from this (i.e. those that demonstrate a limited level of kurtosis or skewness) can, at least in principle, be appropriately transformed in order to satisfy this assumption.
(5) Each pair of predictor (X) variables in the plethora of those available in an MV dataset should conform to a bivariate normal distribution; specifically, plots derived therefrom should form an elliptical scattergram. Notwithstanding, Pearson correlation coe?cients are remarkably robust against deviations from this assumption when the sample size is large (although this is often not the case in metabolomics experiments!). However, selected MV analysis techniques such as independent component analysis (ICA), which is covered in Chapter 3, also allow for quadratic or higher order polynomial relationships between the exploratory variables (although selected transformations of the dataset acquired may serve to convert such non-linear relationships to linear or approximately linear ones). An example which describes the application of a series of four such tests of normality for a large number of predictor X variables within a 1H NMR multivariate 'intelligently bucketed' urinary dataset is provided in Chapter 2. Appropriate transformations for the conversion of such non-normally distributed X variable datasets include the logarithmic (log10- or loge-) transformation for variables in which the standard deviation is proportional to the mean value (in this case, the distribution is positively skewed); the square root transformation for variables in which the estimated variance (s2) is proportional to the mean (which frequently occurs in cases where the variables represent counts such as the number of abnormal cells within a microscopic field, etc.); the reciprocal transformation for variables with standard deviations proportional to the square of the mean (this is usually applied to highly variable predictors such as blood serum creatinine concentrations); the arcsine (%)1/2 transformation for variables expressed as percentages, which tend to be binomially distributed (this transformation is likely to have some application to MV metabolomic datasets which have been normalised to a constant sum (say 100%) both with and without their subjection to the subsequent standardisation preprocessing step, details of which are provided in Chapter 2). Of course, the standardisation process (involving mean-centring and unit-variance scaling), will provide variables with mean values of zero and standard deviations and variance values of unity, and hence the performance of such transformations may be considered inappropriate). However, this standardisation process will certainly not achieve the conversion of a significantly skewed distribution into a non-skewed, symmetrical and perfectly normally distributed one!
(6) Watch out for outliers! The presence of even just one outlying data point can sometimes give rise to a strong (but overall false!) apparent correlation between, say, two metabolite levels, even if the complete dataset has been subjected to normalisation (row operation) and standardisation (column operation) procedures. Figure 1.2 shows an example of how this might arise. In addition to checking for outlying biofluid or tissue samples, which can easily be achieved by examinations of two- or three-dimensional PCA scores plots (such samples may occur from their collection from study participants taking or receiving project- or clinical trial-unauthorised medication, or further programme-prohibited agents such as alcoholic beverages, for example), researchers should also endeavour to check all the predictor variables individually for such outlying data points, and perhaps re- move them if proven necessary. In this manner, we can at least be confident that each predictor variable (column) dataset is outlier-free and will not be violating the 'no-outlier' assumption.
1.2.2 Number and Significance of Explanatory Variables Loading on a PC
When one or more explanatory variables, biomolecular or otherwise, load on a principal component, it is highly desirable for researchers to have an absolute minimum of three or so of these X variables per component; indeed, it is generally considered good practice to retain five or more of these variables per component, since some of these may be subsequently removed from the diagnostic criteria developed. However, in metabolomics datasets consisting of perhaps 200 or more of such variables (such as those generated from the high-resolution 1H NMR or LC-MS analysis of selected biofluids), it is not uncommon to encounter PCs that contain as many as 100–1000 or more of these X variables, which are all correlated (positively and/or negatively), and hence have autonomy and perhaps independence regarding their contributions to successive PCs, i.e. those which account for less and less of the total variance encountered in the dataset.
Excerpted from Metabolic Profiling by Martin Grootveld. Copyright © 2015 The Royal Society of Chemistry. Excerpted by permission of The Royal Society of Chemistry.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
"About this title" may belong to another edition of this title.
Seller: Basi6 International, Irving, TX, U.S.A.
Condition: Brand New. New. US edition. Expediting shipping for all USA and Europe orders excluding PO Box. Excellent Customer Service. Seller Inventory # ABEOCT25-221738
Seller: Romtrade Corp., STERLING HEIGHTS, MI, U.S.A.
Condition: New. This is a Brand-new US Edition. This Item may be shipped from US or any other country as we have multiple locations worldwide. Seller Inventory # ABBB-3955
Seller: SMASS Sellers, IRVING, TX, U.S.A.
Condition: New. Brand New Original US Edition. Customer service! Satisfaction Guaranteed. Seller Inventory # ASNT3-3955
Seller: SMASS Sellers, IRVING, TX, U.S.A.
Condition: New. Brand New Original US Edition. Customer service! Satisfaction Guaranteed. Seller Inventory # ASNNN-3955
Seller: Basi6 International, Irving, TX, U.S.A.
Condition: Brand New. New. US edition. Expediting shipping for all USA and Europe orders excluding PO Box. Excellent Customer Service. Seller Inventory # ABEOCT25-221739
Seller: Romtrade Corp., STERLING HEIGHTS, MI, U.S.A.
Condition: New. This is a Brand-new US Edition. This Item may be shipped from US or any other country as we have multiple locations worldwide. Seller Inventory # ABBB-73239
Seller: Books Puddle, New York, NY, U.S.A.
Condition: New. pp. 384. Seller Inventory # 2697165056
Seller: Majestic Books, Hounslow, United Kingdom
Condition: New. pp. 384. Seller Inventory # 96280799
Quantity: 4 available
Seller: SMASS Sellers, IRVING, TX, U.S.A.
Condition: New. Brand New Original US Edition. Customer service! Satisfaction Guaranteed. Seller Inventory # ASNT3-73239
Seller: SMASS Sellers, IRVING, TX, U.S.A.
Condition: New. Brand New Original US Edition. Customer service! Satisfaction Guaranteed. Seller Inventory # ASNNN-73239