Table of contents
- Nalys solution
- Project challenges
- Algorithms and technologies
- Key facts & figures
This article describes a TaaS project delivered to a customer. For this project a core team with Nalys engineers was created, as well as engineers who where available for support. For this project a Qt application was created which had to perform some video processing and implement an AI to take care of actions to be taken.
This first section describes the mission, what was the goal, what problems had the customer, which are the requirements, etc. Next a summary is given about the implementation of the project. Then some encountered challenges and difficulties are discussed. And finally the used algorithms, technologies and methodologies are discussed in more depth.
The customer had a process where an operator needed to install a certain amount of, up to 100, devices to a predefined position. After the device was positioned it needed to be configured. This was a very time consuming task due to the large amount of devices that needed to be configured. Also there was a high possibility that the operator misconfigures the device trough the large amount that needed to be installed.
Therefore the customer raised the question if it was possible to automate this process with the main objective to reduce the human errors and to reduce the time consuming parts of the process.
About the project
This project started as a proof of concept for the customer to prove Nalys was able to deliver a satisfying result. After this proof of concept the project was proceeded on sprint basis.
The project consisted of multiple requirements:
- Detection of the devices with a camera
- Application should be run on a tablet
- Preferably the built-in camera of the tablet should be used
- Automatic configuring of the devices
- Application should compensate for the operator slightly moving the camera while standing
To very first step was attaching a LED to each device. This LED could be triggered to blink when a “blink” command was received. By attaching a LED to the device there is a very visual procedure to detect the position of a device.
There are 2 commands implemented to blink the LED of a device. A command to blink the LED of one specific device, and a command to blink the LEDs of all devices. When the latest command is sent, all LEDs are going go blink their LED one for one with a period of 47ms. By doing this there is no more than 1 device that illuminates it’s LED.
The second step performed in this project was research regarding technologies that could be used to achieve a solution. In this a deeper look into the OpenCV library has been done and how the interfaces where constructed for C++.
As third step, a test application was created to detect possible challenges. This first demo was done on a regular laptop, with a Linux distribution, and on a Raspberry PI. The demo consisted out of basic video capturing and applying a color detection algorithm. This was done to have an idea of the minimal HW requirements to have acceptable performance.
The fourth step was, analyzing detected challenges and creating a strategy to achieve the best performance results.
The solution that was implemented is as following. The created application is a cross-platform Qt application that can run on a tablet or on a laptop. The application has a main thread for the GUI, a thread that can communicate with the devices, a thread which is going to record frames received from the camera, 2 processing threads who are going to apply some algorithms on received frames and a mapping thread which is going to map the detected LEDs to a map containing the positions of the devices.
There are two reasons why threading is implemented in this application:
- If the processing and recording of the frames would be in the same thread as the main thread, the GUI would be non-responsive while the frames where processed. => recording is a blocking process
- The LEDs blink in a sequence with only 47ms between each LED. This means the application should capture, process, and visualize each frame within 47ms, which is not achievable on a portable device
- The target that runs the application will have multiple CPU cores, when splitting the work in multiple cores the application can rely on all available cores.
The main thread contains the GUI and manages all other actions.
When the operator wants to detect all devices, a video stream is started from the camera. On this video stream algorithms can be applied to detect the blinking LEDs. This recorder thread is responsible for capturing all the frames from the camera.
When a frame is captured the recorder sends the captured frame to one of the 2 processing threads for further processing.
There are 2 processing threads because processing of the frames is the most computational expensive part of the process. By dividing the load over 2 threads the total processing time can be reduced.
The tasks done in the processing thread are:
- Initializing and updating the object tracker
- Frame optimization
- Determine camera calibration
- Apply brightness threshold on the gray image to only keep the light of the LED
- Emit position of detected LED, original frame and process frame to the main thread for further processing
The determination of the position of the devices / LEDs is done with a build-in camera of a tablet. The main advantage of using a tablet is its mobility, you can take it everywhere. But this is also a big disadvantage, when you’re filming you need a stable video feed to determine precise positions of the LEDs. Therefore some object tracking needs to be done to compensate for the movement of the tablet. If no object tracking is performed, and the camera moves slightly, the position of the detected LEDs can’t be guaranteed anymore.
Before actual LED detection on a video can be done, each frame needs some preprocessing. This is mainly to improve the performance of the application. Tasks included in these frame conversions are:
- Setting the resolution of the camera (only done once)
- Convert frame from RGB to HSV color space
- Convert frame to gray image (by taking brightness channel of HSV)
The camera of the tablet is going to be used in a lot of different environments, this has an impact on the effectiveness of the detection algorithms. To compensate for this varying environment (dark, sunny, …), the camera automatically calibrates for the current environment.
This calibration phase is the first thing done when the tablet starts recording. When the calibration is completed, the LEDs are triggered to start blinking.
The calibration includes configuring the exposure of the camera to have the best frame quality. It also includes detecting fixed light sources such as lamps, windows, … and creating a mask to deduce these bright spots from the received frames. And finally it concludes the determination of the brightness threshold on which a pixel is considered as part of a flashing LED.
After the calibration is finished, the object tracker is up-to-date and the frame conversions are applied, the LED detection can start. This is done by removing the “bright spot mask” from the frame and applying the brightness threshold in order to exclusively keep pixels considered to be a LED.
From these pixels a position is extracted and emitted to the main thread for further processing.
The operator can upload a file containing a list of points which represent the position of the devices, these positions are theoretical positions. Therefore, when all LEDs are detected, a mapping needs to be done from these detected LEDs to the theoretical positions.
This mapping should take in account that:
- There could be removed devices
- There could be devices added
- Devices could be moved
The mapping algorithm used is ICP, Iterative Closest Point, which will try to map a set of points to another set of points by minimizing the total distance between the sets.
Controller communication thread
This thread is responsible for all communication to the devices. This communication can be a command to blink a LED again. This command is useful when not all devices are detected on the first try.
Another command can be to actually configure the device.
For this project a custom layout is designed such that no default buttons, labels, … are used.
To achieve a solution for all project requirements a couple of challenges needed to be tackled.
Hardware choice: A thorough investigation is performed regarding the different kind of tablets. This tablet should be robust enough to use in an industrial environment, the CPU must be able to perform all operations in real time, the camera should be of a decent quality, …
OS choice: The preferred OS is Linux, but a lot of tablets run on Windows. This is solved by making the application cross-platform.
False positives: When multiple light sources, next to the LEDs, were in the camera’s field of view, then these where also detected as a LED of a device. This is solved by adding a camera calibration step and creating a “bright spots” mask which is subtracted from each frame.
Camera calibration: Depending on the environment in which the tablet is used the camera needs to be re-calibrated to have the optimal detection rate of the LEDs. A dark environment needs a longer camera exposure than a bright environment.
Platform dependencies: Depending on the platform on which the application is run some algorithms are faster than others. This had especially an impact on the decision to implement 2 processing threads and on the used object tracking algorithm.
Algorithms and technologies
In the solution a couple of computer vision algorithms and methodologies are mentioned such as the conversion from RGB to HSV, the MOSSE and KCF object tracking algorithm, … This section goes over each used methodology in more depth, how each algorithm works and why a certain type of algorithm is chosen.
In video/image processing preprocessing your input is of big importance. If no preprocessing would be done on the input, then the applied computer vision algorithms are not going to be that precise.
Video processing is very similar to image processing, because a video is composed of a sequence of images. Such an image is constructed out of pixels, and each pixel is represented by an amount of bytes.
A first preprocessing step that is often performed is downsizing the frame resolution. For computer vision algorithms there is no need to have frames with a resolution of 1080p.
There are different kinds of color spaces which define what the bytes of a pixel represent. Such color spaces are for example RGB (Red Green Blue), HSV (Hue Saturation Value), CMYK (Cyan Magenta Yellow Black), YUV, …
Typical preprocessing steps are:
- Downsizing the frame
- Calibrating the camera (When access to the camera)
- Converting to the most useful color space
- Converting to gray image
- Applying filters (Erosion, Dilation, …)
Each color space has its own properties, and depending on the use case it’s better to make a conversion between 2 color spaces. For example, in image analysis the pixels are often converted to a HSV color model instead of the typical RGB color model.
Each color space has its own advantages and disadvantages. The most common used color spaces are RGB, HSV and CYMK.
The RGB, Red Green Blue, color space is the most common color space used. The RGB model represents its colors in terms of the primary spectral color components R=700nm, G=546.1nm and B=435.8nm. With this color space you need 24-bits to represent 1 pixel: 8-bits for the amount of red, 8-bits for the amount of green and 8-bits for the amount of blue. In short, the amount of red, green and blue is represented with a value between 0 and 255.
The RGB color space is the standard in internet applications. With this color space it’s possible to create 16.777.216 colors, but only 216 of those are considered as “safe” colors. Safe colors are colors that can be displayed reasonably reliably, independently of the viewer hardware.
A common variant on the RGB color space, is the BGR color space. It’s exactly the same, but the position of the red and blue is swapped.
The HSV, Hue Saturation Value, color space is an alternative representation of the RGB color space, it’s designed to more closely align with how the human eye perceives color. This model is a polar color model in which the radial represents the Hue, the radius the distance to the color white, and the height of the cylinder represents the brightness or intensity of the color.
- Hue: rotation on the circular color wheel
- Saturation: distance to the color white
- Value: the brightness or intensity of the color
In this color model it’s not possible to represent the Hue, Saturation and Value with a value between 0 and 255. In this model the hue is represented with a value between 0 and 360, the saturation between 0 and 100 and the brightness is also represented with a value between 0 and 100.
CMYK, Cyan Magenta Yellow Key (black), is used for printers. This color model refers to the four ink plates used in most color printers.
This color space is rarely, or never, used in image analysis. But, if you want to know more about this color space you can can click on the link in the references.
Convert RGB to HSV
In the RGB color space the components are all correlated with the amount of light hitting the object. If the amount of light changes, the value for Red, Green and Blue are going to change. Thus, in the RGB color space the lighting and color information are intertwined with each other. In computer vision you often want to separate the color information from the lighting information, that’s why very often the RGB color space is converted to a HSV color space.
Convert to gray image
In cases where the color has no extra value, a conversion to gray image is often performed. If the frame is already converted to a HSV color space this can be done very easily by removing all channels except the V-channel, which represents the brightness.
When it’s possible to calibrate the camera the frame quality can be improved a lot. A camera can be used in many environments which can have impact on the image quality, such as the lighting conditions. When you’re filming in a dark environment a higher camera exposure is useful, to let more light in. When you’re in a very bright environment it’s better to have a low camera exposure.
The steps taken to calibrate the camera in this project are as follows:
- Convert frame to HSV color space
- Keep only the brightness channel of the frame
- Create a histogram of the brightness values
- Set exposure of the camera depending on the result of the histogram
The target camera captures frames in BGR format. For our future algorithms we need to separate the color from the lighting information. Therefore a conversion from BGR to HSV is required. For this project the color of the LEDs where irrelevant, so only the brightness channel was left. A big advantage of this is that only 8-bits per pixel are left from the original 24 -bits, this results in faster processing of the frames.
For each frame a histogram is taken from these brightness values. If there are too many bright pixels, a shorter camera exposure is set. This process is repeated until a good histogram is acquired.
When all previous steps are done, the resulting frame needs a final pre-processing step, some morphological operations. These operations can be used to remove noise, dilate detected blobs, …
From previous steps the resulted output was a gray image, but for our final led detection it’s necessary to determine if the pixel belongs to a LED. This is a binary decision. Therefore a threshold is applied on the gray image, this threshold is going to map all pixels with a lower value of the threshold to 0 and all pixels higher than the threshold to 255.
After a threshold there is a possibility of noise, to remove that noise some filters such as erosion, dilation, opening and closing can be used.
Choosing a correct threshold is a lot of trial and error, but having a good look at the brightness distribution in the histogram can already give a good indication what the value should be. In this example the threshold is set at 240.
To remove individual small bright spots an erosion filter can be used. The input for an erosion filter is the source image and a “kernel”. This kernel is a matrix, from a certain dimension, filled with ones. Then this mask is moved across the source image and for each pixel is checked. If the pixel is placed in the center of the matrix, and all pixels beneath the matrix are 255, then the checked pixel is kept at a value of 255. If not all pixels beneath the mask are 255, then the value of the pixel is changed to 0.
This gives as result that detected bright spots get smaller, and that too small bright spots are completely removed.
After previous filter, the erosion filter, the LEDs can already be detected. But because of the erosion they are very small. To compensate for the erosion filter a dilation filter is applied, this has the opposite effect from the erosion filter.
A dilation filter also uses a matrix filled with ones. When moving the mask across the image it’s checked if one of the pixels beneath the mask is 255.
Detection of the LEDs
After this preprocessing the output is a binary frame where the pixels that belong to a LED are white, and all other pixels are black. From this binary frame there are couple of algorithms to detect these groups of white pixels and determine the size of them.
A first approach is with the find contours method from OpenCV. This method results in a set of points which make the contour of the detected blob. From this set of points the x, y, width and height can be determined which represents a bounding rectangle.
A second approach is the connected components method from OpenCV. This method results in a set op centroids and a set of stats. The set of centroids contains a point for each detected blob and the set of stats contains more information about the size of each detected blob.
Both methods result in following detection image:
What is object tracking
The most simple definition of object tracking is: locating an object in successive frames of a video. In computer vision this term is a very broad term which is conceptually similar, but covers technically different ideas.
Following are all studied under object tracking:
- Dense Optical Flow: Estimation of the motion vector for each pixel in a frame.
- Sparse Optical Flow: Tracking of the location of a few feature points in an image.
- Kalman Filtering: Predict location of a moving object, based on prior motion information.
- Meanshift and Camshift: Detect color histogram in successive frames based on a confidence map. Or locating the maxima of a density function.
- Object trackers: In this case multiple objects can be tracked in successive frames. These trackers are mostly used in conjunction with an object detector.
Why do object tracking
Why is tracking needed, and not only using detection? It’s possible to detect objects in each frame again, so why do tracking and not do a detection in each new frame.
In this section we cover following reasons:
- Tracking is faster than Detection
- Tracking can help when detection fails
- Tracking preserves identity
Where “tracking preserves identity” is the main reason for using object tracking in this project.
Tracking is faster than detection
This seems counterintuitive, but most tracking algorithms are faster than detection algorithms. Unlike detection algorithms, more information is available with tracking algorithms. This can be a previous location, a motion model, an appearance model, … All this information can be used to do a specific search in a section of the next frame. With detection you start from scratch for each frame.
Often these tracking algorithms work in conjunction with detection algorithms. An object can change over time, it can go behind an obstacle, tracking algorithms can accumulate errors that needs to be corrected, … For all of those reasons it’s useful to update the tracking algorithm with a new detection.
Tracking can help when detection fails
When the object of interested gets occluded, detection is going to fail. With tracking there’s information about previous location which can be used to handle some level of occlusion.
Tracking preserves identity
If continuous detection would be used there is no correlation between the detected object in the previous and current frame. The object doesn’t have an identity and could be on a complete different position in a following frame. Tracking provides a way to couple an identity to an object, such that a “detected” object in 2 frames can be connected as the same.
Investigated tracking algorithms
There are a lot of different tracking algorithms such as BOOSTING, MIL, KCF, MOSSE, TLD, … But in these project 2 are investigated and experimented with, the KCF (Kernelized Correlation Filters) and MOSSE algorithms.
The first discussed tracker is KCF, this tracker is the most accurate tracker so far and is the tracker to use. This tracker is very fast on CPUs such as the intel i5 and i7, 300 frames per second or higher, but is a factor of 10 to 20 slower on embedded platforms. Therefore the MOSSE algorithm is investigated which is faster, 450 frames per second or higher, but less accurate.
Instead of having 1 sample of the object to track, KCF makes use of “bags” of positive and negative examples. The collection of samples in the positive bag contains the “tracked” object, but also some samples from its immediate neighborhood. In these samples the object of interest isn’t going to be centered anymore. Next to that KCF is also going to add some transformations and translations of the object to the bag of positive results.
This idea builds further on the algorithm implemented in the MIL algorithm, but KCF extends this by utilizing the fact that multiple samples in the positive bag are overlapping. This overlapping data leads to couple mathematical properties that are exploited by the KCF tracker to increase the tracking speed and accuracy.
The MOSSE, Minimum Output Sum of Squared Errors, algorithm is based on adaptive correlation for its tracking. The MOSSE tracker needs a single frame to be initialized and is robust to lighting, scale, pose, and non-rigid deformations. It can also detect occlusion, which makes it able to recover when the object reappears.
The targets appearance is modeled by adaptive correlation filters, and tracking is performed via convolution.
Matching detected LEDs with ICP
In this project we result in 2 sets of points. A set of points with predefined positions of the LEDs, and a set with the position of the detected LEDs. This second set could have the same, larger or smaller amount of points than the predefined set. The goal is match each detected LED to its predefined position in the first set. And preferably also show which LED isn’t detected or which LED wasn’t predefined.
The algorithm used to perform this mapping is ICP (Iterative Closest Point). ICP is an algorithm that searches a translation and rotation to minimize the sum of the squared error between the source cloud and target cloud. Because it will minimize the general difference between all points, it can be used when the size of the source and target cloud are different, and when we don’t know which point maps to which point.
The complete mapping process consists out following steps:
- Upscaling size of coordinates detected LEDs to size of predefined coordinates
- Align centers
- ICP (Iterative Closest Point)
The first step is making sure the coordinate system is the same for both sets. Here we make sure that the distance between the min x and maximum x value is the same, and likewise for the y-axis.
Second the centers are aligned such that the center of the first set overlaps the center of the second set.
Finally the ICP mapping algorithm is applied to try to match each point from set 1 to its corresponding point in set 2.
In this article we described a project where LEDs needed to be detected with the camera of a tablet. Couple of computer vision algorithms where applied to achieve this result such as camera calibration, object tracking, color space conversions, morphological operations, … This application was completely created in C++ with the Qt framework and OpenCV library.
Also the different algorithms and technologies used in this project are described such as the different steps that can be done during preprocessing, as well as the object tracking and mapping algorithms.
Key facts and figures
- 50 - 100 LEDs that needs to be detected
- LEDs can be detected up to 15 meters distance
- This project was implemented over a period of 5 months