Background Research

I2B2 background

Please visit:

for I2B2 background, it's mission and resources.

After successful installation of i2b2 Server and Client using VMWare image and source build including various i2b2 cells, and after exploring i2b2 features using i2b2 client tools for demo project provided, although we are in the process of exploring opportunities in detail for OpenI plug-in for i2b2, looking at the I2B2 data warehouse schema and demo data set available, at a minimum, our aim is to put together a demo of a OLAP reporting instance soon.

To be able to fully develop OLAP reporting infrastructure for I2B2 (for example, ETL, EAV and Metadata model and mapping, etc), following resources collected from various sources would be helpful:

Broad outlines of various components of I2B2

Also please see attached:


Addition details of I2B2 data warehouse star schema and EAV design

As described in the I2B2 FAQ, a central Observations fact table is surrounded by dimension tables for Patient, Provider, Encounter (Visit) and Clinical
Concept. (The primary key of the fact table is a composite of the four IDs related to these dimensions, plus a Start Date/time, indicating the date/time
of the Observation. Remember that the same observation can be recorded multiple times for the same patient.) Like all star schemas, I2B2 is an optimization
for a particular purpose – here, to identify anonymized patients who match a particular complex Boolean combination of clinical findings. The idea is that investigators,
before planning a research project, need to perform a feasibility analysis in terms of the number of patients in the local healthcare system matching a particular
clinical/disease profile. Only if the cohort identified is large enough is it worth making the effort to actually write up the proposal and ask for funding.

The Patient and Encounter dimensions are anonymized. The idea is that researchers should be able to ask research related questions of the data without the local Institutional
Review Board worrying about compromising patient confidentiality. If the proposal is approved and funding obtained, a separate linkage between surrogate Patient IDs and the
actual (identifiable) Patient IDs and demographic data can be utilized to identify real patients and contact them regarding participation in the study.

Clinical facts have different data types: numbers, short strings, long text (e.g., clinical text) and XML (clinical text marked up by NLP software). The I2B2 fact table uses an
EAV design with multiple data-type-specific columns in a single table, with an indicator column indicating the data type of the current column. (Thus, if the data type for a
particular row was Numeric, the numeric value column would be consulted.) In addition to the value column, additional (optional) columns provide extra information: end date/time (if applicable), units of measurement, modifiers (e.g., route of administration, for drugs).

All tables (fact + dimension) contain lineage information; the source from where it originated, and the date/times a particular row was updated in the source system, are downloaded from the source, and imported into the schema.

The Importance of Metadata to Large Biomedical Systems

Large-scale biomedical software systems, however, have certain aspects that make them particularly suitable to the application of metadata-based software design.
Clinical Study Data Management Systems (CSDMSs) share numerous similarities with EMRs, but serve the purpose of supporting research primarily, and patient care
secondarily or coincidentally. Given that many data elements originate in the leading-edge research of individual investigators using the CSDMS and may not be known
outside a single research group, the need to allow customers to define their own data elements becomes even more important. Metadata consequently plays a very large role in such systems.

In brief, a metadata-driven design facilitates greatly in flexibility and adaptability in Biomedical Systems. How I2B2 employs Metadata is still under investigation but it might be either
of these two approaches or both: 

  • A means of representing highly variable clinical data that faithfully reflects its sparseness:
    that is, while hundreds of thousands of past history items, clinical findings, diagnoses, and
    investigations (including laboratory tests) exist across all of medicine, only a very modest
    subset of these will apply to a given patient. One technique here is to use extensible, multidimensional
    hash tables that can be stored on disk; another technique employs the Entity-
    Attribute-Value (EAV) data model.
  • The use of a large sub-schema with numerous tables containing metadata. The metadata
    is used to support the presentation, capture, interpretation and export of the clinical
    data: without the metadata, the clinical data would essentially be unusable.

Requirements to extend i2b2 data model to provide a OLAP solution

After having an understanding of various components of i2b2 architecture and how they interact/communicate with each other, our primary goal is to identify structural changes and additions required in the i2b2 data model in order to provide a OLAP cube solution and query interface similar to i2b2 workbench.

NOTE: The i2b2 Workbench is not a query tool alone to query i2b2 star schema data model, it is a collection of client-side components that communicate with i2b2 Cells and help aggregate their functionality in the Hive. So the Workbench utilizes various database tables of the cells and metadata tables for it's functionality. These various database tables constitute i2b2 data warehouse.

In order to come up with a OLAP cube solution and query interface, we need to identify relevant tables and columns from the i2b2 data warehouse. Among the most
relevant are 'Observation_Fact', 'concept_dimension' and metadata tables. Below is the list of the 'Observation_Fact', 'concept_dimension' and metadata columns and their purpose. We have provided NOTE on these columns which we have found working with i2b2 data model and which will be important factors to be considered while extending i2b2 data model.

TABLE [Observation_Fact](
          [Encounter_Num] [int] NOT NULL,
          [Patient_Num] [int] NOT NULL,
          [Concept_Cd] [varchar](50) NOT NULL,
          [Provider_Id] [varchar](50) NOT NULL,
          [Start_Date] [datetime] NOT NULL,
          [Modifier_Cd] [varchar](100) NOT NULL,
          [ValType_Cd] [varchar](50) NULL,
          [TVal_Char] [varchar](255) NULL,
          [NVal_Num] [decimal](18, 5) NULL,
          [INSTANCE_NUM] [int] NULL,
          [ValueFlag_Cd] [varchar](50) NULL,
          [Quantity_Num] [decimal](18, 5) NULL,
          [Units_Cd] [varchar](50) NULL,
          [End_Date] [datetime] NULL,
          [Location_Cd] [varchar](50) NULL,
          [Observation_Blob] [text] NULL,
          [Confidence_Num] [decimal](18, 5) NULL,
          [Update_Date] [datetime] NULL,
          [Download_Date] [datetime] NULL,
          [Import_Date] [datetime] NULL,
          [Sourcesystem_Cd] [varchar](50) NULL,
          [UPLOAD_ID] [int] NULL

[Encounter_Num],[Patient_Num],[Concept_Cd],[Provider_Id]: FK to dimension table concept_dimension, Patient_Dimension, Provider_Dimension and Visit_Dimension.
[Modifier_CD]: The use of this column allows a single fact to be modified with an unlimited number of codes. For example a drug can be modified with dose, route, and frequency and a CPT procedure can be modified with attached codes such as a two surgeon team with anesthesia by anesthesiologist (62 and AA).

NOTE: The structure of the table itself does not reveal what the slowly changing dimension strategy is but Modifier_CD column suggest the strategy should be Type 2 SCD. But opposed to standard practice of assigning new dimension record is to a fresh surrogate primary key, and using that key from that moment forward in all fact tables that have that dimension as a foreign key; in the 'Observation_Fact' the modifier_cd column is used in conjunction with the instance_num column which is able to separate various facts into clusters, allowing several modifiers to be attached to the same fact and the primary key for the dimension table is concept_path + modifier_cd.

[Start_Date]: Starting date-time of observation.

[ValType_Cd]: The type of value object, for example, either N for numeric or T for text.
[TVal_Char]: If valtype_cd = 'T', then the text value goes here. If valtype_cd = 'N', then tval_char can be 'E' for equals, G for greater than, L for less than.
[NVal_Num]: If valtype_cd = 'N', then the numeric value goes here.
[ValueFlag_Cd]: Flag (for high or low values, for example).
[Quantity_Num]: Quantity of nval.
[Units_Cd]: Units of measurement of nval.

NOTE: ValType_Cd, TVal_Char, NVal_Num, ValueFlag_Cd, Quantity_Num and units_cd are confusing and hard to grasp columns of the fact table. These are associated with a particular concept and not all concept have value for these comulns. For instance, diagnosis of Acute Myocardial Infarction does not have a value but the tests used to make the diagnosis do have values (high, low, numeric, etc.). These are known as value constraint in i2b2 workbench. Please see screen capture below:

and Using SQL:

select o.concept_cd, name_char, valtype_cd, tval_char, nval_num, valueflag_cd
from observation_fact o
join concept_dimension c
on o.concept_cd = c.concept_cd
where valtype_cd = 'N'

[INSTANCE_NUM]: Encoded instance number that allows more that one modifier to be provided for each concept_cd. Each row will have a different modifier_cd but a similar instance_num.

NOTE: See notes for Modifier_CD.

[End_Date]: The end date-time for the observation.

NOTE: For Temporal criteria such as ranges of dates are relatively easy to model in the queries using the i2b2 workbench using Start_Date and End_Date columns. Data requests that included more complex temporal conditions such as data from the first week of every month for a period of one year, which would require multiple query conditions to be included, including dynamically determining the first week of a given month is not currently possible with i2b2 workbench. But it is relatively easy in a cube using a date dimension and calculating interval, [End_Date] - [Start_Date]. 

Rest of the columns of the Observation_Fact are for administrative purpose or are not used.

TABLE [concept_dimension](
          [concept_path] [varchar](700) NOT NULL,
          [concept_cd] [varchar](50) NULL,
          [name_char] [varchar](2000) NULL,
          [concept_blob] [text] NULL,
          [update_date] [datetime] NULL,
          [download_date] [datetime] NULL,
          [import_date] [datetime] NULL,
          [sourcesystem_cd] [varchar](50) NULL,
          [UPLOAD_ID] [int] NULL
[concept_path]: PK

NOTE: From the point of view of dimensional modeling design most limiting factor in the concept_dimension is that a concept is hierarchical information stored in concept_path column and the hierarchical path that leads to a term; and most of the workbench query usages concept_path. For example, to find all the patients that were diagnosed with migraines:

Select distinct(patient_num)
From observation_fact
Where concept_cd in
(select concept_cd
from concept_dimension
where concept_path like '%Neurologic Disorders (320-389)\(346) Migraine\%')

and if we wanted to get all the ages for patients having a Cholesterol lab, we could run the following query:

Select concept_cd
From observation_fact
where concept_cd like 'DEM|Age%'
and patient_num in
(select patient_num from observation_fact
where concept_cd in
(select concept_cd
from concept_dimension
where concept_path like '%LAB\(LLB16) Chemistry\(LLB17) Lipid Tests\CHOL\%'))

Notice how the path of the concept is used to query all concept ids that fall into the Cholesterol group. If we only wanted to query for patients with Plasma Cholesterol only, we would use the same query with the following path joined against concept_path:
'%LAB\(LLB16) Chemistry\(LLB17) Lipid Tests\CHOL\MCSQ-PCHOL\%'

[concept_cd]: Unique Key to join to Observation_Fact.concept_cd.

[name_char]: A string name that represents this concept, idea or person.

NOTE: Although a concept is hierarchical information stored in concept_path column, from dimensional modeling perspective we should be able to use this
column as it is a name that represents this concept (not a hierarchical path) but again limiting factor here is that diagnoses, procedures, medications and lab tests all are represented as concepts and there is no way to distinguish which are diagnoses, procedures, etc with out utilizing concept_path column. So we have to find a way to break down the concept_path. Metadata column C_BASECODE maps to the concept_cd in the star schema tables. Currently working on whether we can use C_BASECODE to break down the concept_path into separate elements or parts.

Metadata Tables:

NOTE: Currently (in the i2b2 demo project) concepts are defined in a single metadata table. The metadata table 'Table_Access' is used to obtain a list of "Metadata" tables within the i2b2 data model. Each Metadata table within the data model is represented by a single row in this table. The primary identifier of each Metadata table in the Table_Access table is in the "c_table_cd" column. See screen capture below:

As seen in the screen capture there are 3 metadata tables, 'BIRN' for Clinical Trials, 'CUST' for Custom Metadata and 'i2b2' for Ontology. This single metadata table approach is employed by i2b2 to allow the user to create a hierarchical categorization of the different concepts within the database. Query terms are selected from these categorizations. This is where we have to start applying dimension modeling concept in order to come up with a OLAP solution. In other words we have to define one metadata table for each data type. Examples of data types could be: diagnoses, procedures, demographics, lab tests, health history, transfusion data, microbiology data and various types of genetics data.

The structure of the metadata would be integral to the visualization of concepts as well as for querying the data and all metadata tables must have the same basic structure. After we define one metadata table for each data type, the 'Table_Access' would look like as below and there would be a one-to-one mapping between the entries of the table_access table and the root level nodes for a given project.

And all Metadata tables will have below structure, although some of the columns can be omitted, we are keeping them for future use. 

TABLE [I2B2_xx](
          [C_HLEVEL] [int] NOT NULL,
          [C_FULLNAME] [varchar](700) NOT NULL,
          [C_SYNONYM_CD] [char](1) NOT NULL,
          [C_VISUALATTRIBUTES] [char](3) NOT NULL,
          [C_TOTALNUM] [int] NULL,
          [C_NAME] [varchar](2000) NOT NULL,
          [C_BASECODE] [varchar](50) NULL,
          [C_METADATAXML] [text] NULL,
          [C_FACTTABLECOLUMN] [varchar](50) NOT NULL,
          [C_TABLENAME] [varchar](50) NOT NULL,
          [C_COLUMNNAME] [varchar](50) NOT NULL,
          [C_COLUMNDATATYPE] [varchar](50) NOT NULL,
          [C_OPERATOR] [varchar](10) NOT NULL,
          [C_DIMCODE] [varchar](700) NOT NULL,
          [C_COMMENT] [text] NULL,
          [C_TOOLTIP] [varchar](900) NULL,
          [UPDATE_DATE] [datetime] NULL,
          [DOWNLOAD_DATE] [datetime] NULL,
          [IMPORT_DATE] [datetime] NULL,
          [SOURCESYSTEM_CD] [varchar](50) NULL,
          [VALUETYPE_CD] [varchar](50) NULL

[C_HLEVEL]: c_hlevel is the hierarchical level of the term. The term at the highest level of a hierarchy has a value of 0, the next level has a value of 1 and so on.    

NOTE: After we define one metadata table for each data type C_HLEVEL would always be 0.

[C_FULLNAME]: c_fullname is the hierarchical path that leads to the term.
[C_NAME]: c_name is the descriptive text value for the term. It is what is displayed in the user interface.

NOTE: After we define one metadata table for each data type we can use either C_FULLNAME or C_NAME for descriptive text value for the term.

[C_TOTALNUM]: c_totalnum is not used in i2b2.

[C_BASECODE]: c_basecode this is the term that describes the ontological concept. This may be an ICD9 code (for diagnoses), or an NDC code (for medications) or a LOINC code (for lab tests). Or it may be any number of other coding systems, even home-grown ones.

[C_FACTTABLECOLUMN]: c_facttablecolumn is the name of the column in the fact table (observation_fact) that holds the concept code.

[C_TABLENAME]: c_tablename is the name of the lookup table that holds the concept code and concept path.

NOTE: There is also a lookup table within i2b2 metadata model. How we can utilize this table after we define one metadata table for each data type is under investigation.

[C_COLUMNNAME]: c_columnname is the name of the field that holds the concept path.

NOTE: The field could be either C_FULLNAME or C_NAME.

[C_COLUMNDATATYPE]: c_columndatatype is either 'T' for text or 'N' for numeric and describes the datatype of the concept.

NOTE: Have to rework on this column as it affects [ValType_Cd], [TVal_Char], [NVal_Num] and [ValueFlag_Cd] columns of the Observation_Fact.

We have built the i2b2 source in the local machine starting with PMService. We can now directly observe implications of the changes made and react/correct accordingly.

Notes on the Meta-Metadata framework from DCRI

The Meta-Metadata framework/tool created by Swati and team at Duke Clinical Research Institute is to facilitate metadata management and expose the data dictionary and data for review through generic reports before migrating to a common data repository. As such, currently the tool does not help us with the research ETL and OLAP/Cube work we are doing for i2b2. We can, however, help them to create the tool in open source platform.

We haven't found much documentation on the tool in the NET, except for a MIDR doc which is attached (6-2010 HDWA Conference_Chakraborty.pdf).

NCBO Ontology Extraction Workflow (2-stage process) using NCBO Extraction Tool

As outlined in 'i2b2+Ontology+Tutorial.pdf', NCBO Extraction Tool can be used to extracts ontology from NCBO bioportal.
Using the tool we have found that NCBOOntologyExtractAll component of the tool downloads file for an ontology concept URL
(for, e.g.,,
extracts ontology concept attributes and stores in file system to be loaded to staging database table.

Loading of ontology concept attributes is manual process and after we load the file, NCBOOntologyProcessAll component of the tool usages
the staging database table to prepare concept attributes and store in file system to be loaded in the final metadata table. Again the metadata
file will have to be loaded in the final metadata table manually.

This process is refered to as '2-stage process'. We are currently working on how we can extend this to load individual ontology concept dimension
and how synchronization of metadata and concept_dimension works. As mentioned in the section 3.3 of the 'NCBO_Extraction_Users_Guide_1-0.pdf', we
need to synchronize concept_dimension table so it contains the terms in new metadata table using synchronize icon in the workbench, but how
synchronization actually works, i.e, message exchange between cells, SQL used are under investigation.

Below are screen capture of table_access and reflection of the metadata for ontology id 45221 (ICD-9 we have downloaded using the tool) in the schema workbench.

Notes on Mapping and BioPortal

Mapping among different terminologies appears to be a complex process as cited here and in other
But BioPortal provides a mapping service: 

Seems like we can design a mapping metadata and data tables populating data from NCBO in advance 
(or on the fly) and use these table in our ETL process.  

Mapping in NCBO is based on the UMLS (Unified Medical Language System) so there might not exists mapping between some codes
for example, between ICD9 and SNOMED.

BTW - in the link below:

Paul had replied that 'Mapping data is stored in the bioportal_ui database in a table called 'mappings'. The column definitions are pretty obvious if you take a look.'

Just curious -- how Sina has access to bioportal_ui database, is she a member of the BioPortal, a privileged user?