Overview
This report was put together by another student working with Dosenet, Richie Woo. Richie first worked with our group as a volunteer during the 2019 BASF at Oracle Park, and happily expressed an interest in becoming one of our summer interns following that introduction. Richie performed data analysis of the CO2 and air quality (PM 2.5 and PM 10) sensors that we use at several of our locations. DoseNet has sensors placed around the world, Richie’s analysis of the data verified the continuing functionality –or lack thereof– for some of our sensors. Our in-house data was compared to publicly available data from BEACO2N, the BErkeley Atmospheric CO2 Observation Network and PurpleAir, similar projects that collect CO2 and air quality data respectively. Here is his summary of this work!
Introduction
My project was centered around data and error analysis of Dosenet’s air quality and CO2 sensors as compared to the sensor networks of PurpleAir and UC Berkeley’s BEACO2N respectively, using Python. Data from sensors were to be compared and cleaned, then analyzed for trends in the data to find inconsistencies and errors.
Data Acquisition and Preparation
Air quality data was retrieved from Dosenet and PurpleAir’s respective download pages. Valid Dosenet sensors included Campolindo High School, the Etcheverry Hall Roof, Exploratorium, Harbor Bay, Miramonte High School, Pinewood School, and the University of Washington. PurpleAir sensors within ½ mile of the Dosenet Locations were found which typically amounted to 1-2 PurpleAir comparison sensors per Dosenet sensor. Specific sensor names can be found in Table 1.
For each sensor location, a timeframe was selected by first converting time in string format to Unix time for speed of computation, then compared for overlapping time ranges. Any rows with null values were cut from the data. The time data was then converted to DateTime format and two merged datasets were created: one averaged hourly and one averaged daily. For each dataset two .csv files were saved, one for differences with absolute value and one for raw differences. Time ranges typically encompassed several months’ worth of data. The data is stored in the “processed-air-quality-data”. Daily averages do not contain a raw difference data dataset file.
Two key locations for CO2 data comparison were Etcheverry Hall and the Exploratorium. Etcheverry Hall was chosen for its proximity to Dosenet’s home base in UC Berkeley, and the Exploratorium was chosen for its proximity to the ocean, where meaningful statistics on sensor degradation and data drift could be found due to the corrosive nature of the moist and salty environment. CO2 sensor data was retrieved from Dosenet and BEACO2N’s respective download pages. Only one BEACO2N sensor was needed for data comparison and analysis per Dosenet sensor.
The method for data preparation was very similar to that of the air quality data. However, no .csv files were generated (they can be easily generated from the source code) and differences are not absolute. In addition, BEACO2N data had to be combed for values under 0, which indicated a sensor failure at the time. Time ranges for both datasets contained almost a years’ worth of data.
Analysis Procedure and Results
The first method of analysis was visual analysis using Plotly graphs. For each dataset, a graph of the data, and a graph of the differences between each particulate matter category was generated. For the most part, Dosenet’s sensors detected fewer particles than PurpleAir’s sensors and were less sensitive. In addition, large spikes in particulate matter concentrations could sometimes overwhelm the sensor and give an inaccurate reading. Graphs can be generated on-demand from the “graphing” notebook. More graphs are available in RichieWooHoo’s fork of the dosenet-analysis repository on Github.
Histograms were produced with the raw data from hourly PM 2.5 hourly averages using Plotly. These also include the average of the data in the title of the graph. These histograms are useful for visually checking the distribution of the data. For the most part, difference distributions of Dosenet and PurpleAir data were Gaussian and did not deviate far from 0, but Dosenet picked up fewer particles on average than PurpleAir. Generally, Dosenet sensors picked up an average of no more than 5µg/m3 less than PurpleAir, showing that these sensors are relatively accurate. The most notable difference was the first Exploratorium sensor, in which the Dosenet sensor picked up more particles than its PurpleAir counterpart.
Comparison scatterplots were made to find the strength of the linear correlation between the two air quality sensor systems, also using Plotly. For most sensors, r values were high, indicating a strong positive correlation, which is the ideal outcome. Etcheverry roof 2 and Harbor Bay had lover r values than the average, r = 0.69 and 0.777 respectively. Miramonte HS and Pinewood School did not have enough overlapping data to compare, and therefore resulted in an inaccurate r value. The second PurpleAir sensor for the University of Washington did not have any PM 2.5 data, so an r value could not be obtained.
BEACO2N data and Dosenet CO2 Data were compared only for the two points of interest listed in the Data Acquisition and Procedure section.
For the Exploratorium, comparison graphs were created to visually identify errors before other more detailed statistics were done. Most likely due to the nature of the location in which the sensor was placed, the Dosenet CO2 Sensor for the Exploratorium did not provide accurate or meaningful data, nor was a drift in the data present. The sensor should be replaced or recalibrated.
For the sensor at Etcheverry Roof, visual analysis from graphs showed significantly higher values from Dosenet compared to BEACO2N. Histograms show that Dosenet data is Gaussian with an average of 529 ppm whereas the more accurate BEACO2N data was heavily skewed left, at around 400 ppm. Scatterplots show a weak positive correlation with r = 0.230.
Discussion and Conclusion
- For air quality locations with enough data to make a meaningful analysis (which are Etcheverry roof, the Exploratorium, and the University of Washington) the Dosenet sensors picked up fewer particles on average, but the accuracy of the sensor is excellent. These sensors may need small adjustments over time, but provide an accurate assessment of air quality in the region.
- For Miramonte HS, Pinewood School, Campolindo HS, and Harbor bay, more data must be used to provide an accurate analysis. Dosenet data had many holes which made it difficult to find suitable time range(s) or enough data points to analyze. In addition, the lack of available (for instance Harbor Bay) sensors or questionable (in the instance of Miramonte HS) sensors led to the decrease of available comparison datasets. These sensors must be analyzed at a future date.
- The CO2 sensor at the Exploratorium must be replaced or recalibrated. The data does not represent the CO2 levels in the area and should not be used for analysis on radiation levels. The CO2 sensor at Etcheverry Roof is slightly closer to the actual concentration, but is not accurate and should be recalibrated.
Overall, formatting and fitting the data was a larger challenge than expected. Many unexpected variables came up that made data comparison difficult. Initially, the belief was that Dosenet data was inaccurate and had a lot of inconsistencies, but after further analysis, the air quality data was more accurate than expected. It was also expected that most datasets would have many comparable data points but after further data cleaning and formatting that was not the case. In the future, it will be necessary to find more data to compare to create whole datasets with months or years of data to compare.
We were very lucky to have Richie work with us! There is still plenty of work to be done, as Richie’s analysis showed us where there are some holes in our system, and provided some visualizations as to why it is integral to maintain our systems! Our DoseNet systems have many environmental sensors in them, and are made by students and researchers from the ground up! Analysis like this helps us evaluate the functioning of our network in relation to other local systems, as well as incorporates our focus on radiation data for a more complete, publically available picture of the world around us. Thanks Richie!