The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to use RM for this Paper
Dear RM community,
Is somebody able to help me a bit closer. I know data mining approaches are sometimes different from the way a researcher needs to present his results. This paper uses data from the MIMIC II database which is a clinical database with 40000 ICU patients (https://mimic.physionet.org/). I thinks the authors have done a nice job and I would like to use this approach for the analysis of other attributes. My data is preprocessed but can't find how to use a variance inflation factor, the lowest smooth technique and finally to have the odds ratio calculated and presented in the results.
Hoping someone can help me.
Cheers
Sven
This article is the subject of my question:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095204
In the methods I read: Continuous variables were tested for normality by using Kolmogorov–Smirnov test. Data of normal distribution were expressed as mean±SD and compared using t test. Otherwise, Wilcoxon rank-sum test was used for comparison. Categorical variables were expressed as percentage and compared using Chi square test or Fisher's exact test as appropriate. ICU mortality was used as the study endpoint. To exclude confounding factors that may influence the association of iCa and mortality, logistic regression model was used to adjust for the odds ratios (OR). We built two models separately for Ca0 and Camean during ICU stay. The full model included all variables listed in Table 1.[8] Covariate selection was performed by using stepwise forward selection and backward elimination technique, with Ca0 and Camean remaining in the model. The significance level for selection was predefined as 0.15 and that for elimination was 0.2. After this step the main effect model was built. Lowess smooth technique was used to examine the relationship between iCa and mortality in logit.[9] To facilitate clinical interpretation of our results and to meet the interests of subject-matter audience, we planned to use linear spline function for model building.[10] The knots were chosen according to conventional classification of iCa ranges: relative to the normal range of 1.15–1.25 mmol/L, we defined hypocalcemia as mild, moderate and severe as 0.9–1.15, 0.8–0.9 and <0.8 mmol/L, respectively. Hypercalcemia was divided into mild, moderate and severe as 1.25–1.35, 1.35–1.45 and >1.45 mmol/L, respectively.[11], [12] Potential multicollinearity between covariates in the model were quantified by using variance inflation factor (VIF) which provided an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.[13] As a common rule of thumb, a VIF>5 was considered for the existence of multicollinearity. Furthermore, iCa was categorized into intervals and incorporated into regression models as design variable. Design variable, also known as dummy variable, is one that takes the value of 0 or 1 to indicate the presence or absence of some categorical effect that is expected to shift the outcome. It is frequent used for categorical variables with more than two categories. Normal range between 1.15 and 1.25 mmol/l was used as reference and ORs were reported for other intervals. Receiver operating characteristic curve (ROC) was depicted to show the diagnostic performance of fitted logistic regression models.
Is somebody able to help me a bit closer. I know data mining approaches are sometimes different from the way a researcher needs to present his results. This paper uses data from the MIMIC II database which is a clinical database with 40000 ICU patients (https://mimic.physionet.org/). I thinks the authors have done a nice job and I would like to use this approach for the analysis of other attributes. My data is preprocessed but can't find how to use a variance inflation factor, the lowest smooth technique and finally to have the odds ratio calculated and presented in the results.
Hoping someone can help me.
Cheers
Sven
This article is the subject of my question:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095204
In the methods I read: Continuous variables were tested for normality by using Kolmogorov–Smirnov test. Data of normal distribution were expressed as mean±SD and compared using t test. Otherwise, Wilcoxon rank-sum test was used for comparison. Categorical variables were expressed as percentage and compared using Chi square test or Fisher's exact test as appropriate. ICU mortality was used as the study endpoint. To exclude confounding factors that may influence the association of iCa and mortality, logistic regression model was used to adjust for the odds ratios (OR). We built two models separately for Ca0 and Camean during ICU stay. The full model included all variables listed in Table 1.[8] Covariate selection was performed by using stepwise forward selection and backward elimination technique, with Ca0 and Camean remaining in the model. The significance level for selection was predefined as 0.15 and that for elimination was 0.2. After this step the main effect model was built. Lowess smooth technique was used to examine the relationship between iCa and mortality in logit.[9] To facilitate clinical interpretation of our results and to meet the interests of subject-matter audience, we planned to use linear spline function for model building.[10] The knots were chosen according to conventional classification of iCa ranges: relative to the normal range of 1.15–1.25 mmol/L, we defined hypocalcemia as mild, moderate and severe as 0.9–1.15, 0.8–0.9 and <0.8 mmol/L, respectively. Hypercalcemia was divided into mild, moderate and severe as 1.25–1.35, 1.35–1.45 and >1.45 mmol/L, respectively.[11], [12] Potential multicollinearity between covariates in the model were quantified by using variance inflation factor (VIF) which provided an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.[13] As a common rule of thumb, a VIF>5 was considered for the existence of multicollinearity. Furthermore, iCa was categorized into intervals and incorporated into regression models as design variable. Design variable, also known as dummy variable, is one that takes the value of 0 or 1 to indicate the presence or absence of some categorical effect that is expected to shift the outcome. It is frequent used for categorical variables with more than two categories. Normal range between 1.15 and 1.25 mmol/l was used as reference and ORs were reported for other intervals. Receiver operating characteristic curve (ROC) was depicted to show the diagnostic performance of fitted logistic regression models.
0
Answers
i read the paper and it is really a different way of thinking.
Let me sum up: What the other does is creating features and then running a logistic regression with and without one attribute. I see nowhere which validation he uses. Without a validation, this approach is simply wrong.
I guess i need to think about this a bit more. It is def. no predictive task.
Best,
Martin
Dortmund, Germany
So the key question is: Is the mortality depended on the iCa level?
What a data scientist might do is doing two analysis. One with and one without iCa after wards one can compare the RoC using a X-Val and a T-Test. Then we can answer the question "Does it help to know the iCa to predict mortality?". Which might be related to the question above. Sadly this is all dependend on preprocessing etc. So i do not know how much to trust those p-values.
I am further not sure if this paper is a good data mining taast. If the question would be "how high is the mortality for this person?" then rapidminer would be the way to go. This sounds more like traditional statistics combined with mutli variate methods to get more signficance.
Dortmund, Germany
Any comments?
Sven
i thought a bit deeper. You can of course define a standard analysis to predict mortality. Afterwards you can use a better technique using more information (like iCa) and look if it becomes better. Usually the error of the cross validation should harm your p-values. I think this was forgotten in the mentioned paper.
All those p-values are p-values calculated like P(Better than before | Condition Preprocessing, Condition Learner,...) so i am not sure how useful the p-values are. T
What might be way more useful is to use the model to advice. You can design a model calculating mortality with all given measurements. Than you get a function you can minimize for your patient. The confusing point for me is: The advice would be "lower the blood pressure" and not how. But this might be way more beneficial.
Martin
Dortmund, Germany
Sven
Dortmund, Germany
Cheers
Sven
Cheers
Sven
And do you have a rapidminer process reading in the data once i have credentials?
Dortmund, Germany
I think full access is only possible if medically certified.
Some more info: database descriptionhttp://mimic.physionet.org/UserGuide/node18.html
Once access you get:
You may download the database from this page, or you may explore it online (see MIMIC II Explorer below). The flat files should be compatible with PostgreSQL version 8.4.8 or later.
What's New (Changes from 2.5):
Added Patients:
5,880 new subjects
6,556 new hospital admissions
8,058 new ICU admissions
Added Data Types:
Demographics: religion, ethnicity, marital status, insurance type, admission source
Procedure (CPT) codes
Diagnosis-related groups (DRGs)
Elixhauser comorbidity scores
Microbiology test results
LOINC coding for lab tests
Note that in previous releases, timestamps were a mixture of standard and daylight savings times. Starting with version 2.6, timestamps are uniformly expressed in EST (Eastern Standard Time), so that the interval between any two timestamps in a given record is simply the difference between them, even if a daylight savings time change occurred during the interval.
Added documentation:
MIMIC II SQL Cookbook: a collection of about 20 "recipes" for useful queries, including calculation of Elixhauser comorbidity scores from DRGs and ICD-9 codes (contributed by Joon Lee).
Virtual Machine:
We also provide a virtual machine hosting a complete copy of the MIMIC II database. The virtual machine image contains a bootable Linux system which has been pre-configured to download and import the MIMIC II database. It is particularly suited to researchers who would like to perform intensive processing of the data and require more flexible access than that provided by the MIMIC II Explorer (Query Builder).
Downloads:
All downloads are in the form of gzip-compressed tar archives ("tarballs"). See How can I unpack a .tar.gz archive? in the PhysioNet FAQ if you are unfamiliar with this format. The individual flat files, once unpacked, are in CSV format; within each line (table row), fields (columns) are separated by commas, and text strings are surrounded by double quotes. A Linux script is available for downloading all the files from the command line using the wget command.
Definitions: The definition tables contain information needed to interpret elements of the subject-specific data tables (As well as a folder regarding the database schema in PostgreSQL syntax). They consist of 11 files that can be extracted from mimic2cdb-2.6-Definitions.tar.gz.
Subject-specific data: All data for a given patient are contained in a set of 33 flat files for that patient. The data archives contain the flat files for about 1000 subjects each. These archives are typically 75-90 Mb each, and expand when decompressed to roughly ten times their size. The decompressed flat files occupy about 31 GB in all.
mimic2cdb-2.6-00.tar.gz (00001-00999)
mimic2cdb-2.6-01.tar.gz (01000-01999)
mimic2cdb-2.6-02.tar.gz (02000-02999)
mimic2cdb-2.6-03.tar.gz (03000-03999)
mimic2cdb-2.6-04.tar.gz (04000-04999)
mimic2cdb-2.6-05.tar.gz (05000-05999)
mimic2cdb-2.6-06.tar.gz (06000-06999)
mimic2cdb-2.6-07.tar.gz (07000-07999)
mimic2cdb-2.6-08.tar.gz (08000-08999)
mimic2cdb-2.6-09.tar.gz (09000-09999)
mimic2cdb-2.6-10.tar.gz (10000-10999)
mimic2cdb-2.6-11.tar.gz (11000-11999)
mimic2cdb-2.6-12.tar.gz (12000-12999)
mimic2cdb-2.6-13.tar.gz (13000-13999)
mimic2cdb-2.6-14.tar.gz (14000-14999)
mimic2cdb-2.6-15.tar.gz (15000-15999)
mimic2cdb-2.6-16.tar.gz (16000-16999)
mimic2cdb-2.6-17.tar.gz (17000-17999)
mimic2cdb-2.6-18.tar.gz (18000-18999)
mimic2cdb-2.6-19.tar.gz (19000-19999)
mimic2cdb-2.6-20.tar.gz (20000-20999)
mimic2cdb-2.6-21.tar.gz (21000-21999)
mimic2cdb-2.6-22.tar.gz (22000-22999)
mimic2cdb-2.6-23.tar.gz (23000-23999)
mimic2cdb-2.6-24.tar.gz (24000-24999)
mimic2cdb-2.6-25.tar.gz (25000-25999)
mimic2cdb-2.6-26.tar.gz (26000-26999)
mimic2cdb-2.6-27.tar.gz (27000-27999)
mimic2cdb-2.6-28.tar.gz (28000-28999)
mimic2cdb-2.6-29.tar.gz (29000-29999)
mimic2cdb-2.6-30.tar.gz (30000-30999)
mimic2cdb-2.6-31.tar.gz (31000-31999)
mimic2cdb-2.6-32.tar.gz (32000-32809)
The MIMIC Importer: Software for automatically creating a PostgreSQL database from the flat files above is available. Download and unpack MIMIC-Importer-2.6.tar.gz first, then download the definitions and subject-specific tarballs into the MIMIC-Importer-2.6 directory created by unpacking the MIMIC Importer tarball. Detailed instructions for using the software are available (a copy of the README included in the tarball). (Note: MIMIC II user Andrea Bravi has developed a Python version of the MIMIC Importer that Windows users may find simpler to run; find it at Andrea's GitHub page.)
Definition tables and maps
The definition tables are:
D_CAREGIVERS D_CHARTITEMS_DETAIL D_MEDITEMS
D_CAREUNITS D_IOITEMS D_PARAMMAP_ITEMS
D_CHARTITEMS D_LABITEMS PARAMETER_MAPPING
D_CODEDITEMS D_DEMOGRAPHICITEMS * D_WAVEFORM_SIG
* The D_WAVEFORM_SIG definitions table is not used in this release.
Subject data tables
The data archives unpack into directories for each subject. Each subject's directory contains 32 tables (flat files):
A_CHARTDURATIONS
ADDITIVES
ADMISSIONS
A_IODURATIONS
A_MEDDURATIONS
CENSUSEVENTS
CHARTEVENTS
COMORBIDITY_SCORES
DELIVERIES
DEMOGRAPHIC_DETAIL
DEMOGRAPHICEVENTS
D_PATIENTS
DRGEVENTS
ICD9
ICUSTAY_DAYS
ICUSTAY_DETAIL
ICUSTAYEVENTS
IOEVENTS
LABEVENTS
MEDEVENTS
MICROBIOLOGYEVENTS
NOTEEVENTS
POE_MED
POE_ORDER
PROCEDUREEVENTS
TOTALBALEVENTS
* WAVEFORM_METADATA
* WAVEFORM_SEGMENTS
* WAVEFORM_SEG_SIG
* WAVEFORM_SIGNALS
* WAVEFORM_TRENDS
* WAVEFORM_TREND_SIGNALS
* The WAVEFORM_* tables are not included in these flat files, although they are present in the on-line MIMIC II Explorer (see below).
An empty flat file indicates that patient's record does not include data of the corresponding type.
MIMIC II Explorer (Query Builder)
The MIMIC project provides the MIMIC II Explorer, a direct SQL interface to the MIMIC II Clinical Database, hosted on its secure web site.
To access the MIMIC II Explorer, you must use a MIMIC user name and password. Your PhysioNetWorks user name and password will not work on the MIMIC project's web site.
First-time users: Please note that your user name and a temporary password for the MIMIC portal were sent to you in two emails from mimic-support@physionet.org with the subject lines Your MIMIC-II User Account and Your MIMIC-II Password. Follow the instructions in the emails to change your MIMIC password (you may change it to match your PhysioNetWorks password if you wish). If you did not receive these emails, your spam filter may have rejected them; please check before writing to mimic-support@physionet.org to request that they be sent again.
The MIMIC project web site currently uses a self-signed SSL certificate. Your browser will warn you that it does not recognize the certificate the first time you visit; accept it in order to enter the site.
Go to the MIMIC II Explorer (Query Builder) [link opens in another window].
Cheers and thanks
Sven
Dortmund, Germany