Mobile devices, including smart phones, have become extremely popular. They are ubiquitous tools in our daily life, used for communication and computing. Furthermore, they have come to play an important role in the sensing and monitoring of various physical data using their equipped sensors such as Global Positioning Systems (GPS) and acceleration sensors. Actually, many papers have reported automatic monitoring systems of physical data using mobile devices. For instance, Tamminen et al.  recognized user actions using acceleration. Roggen et al.  implemented a system to recognize both surrounding and user situations. Gyorbiro et al.  conducted recognition of the user action situation combining body sensors and mobile devices.
Here, turning our eyes to sensory sound data as input data, we can readily perceive many noteworthy features of sound. The first feature is that, sound can provide a tremendous amount of information to understand the context and circumstances of the real world. A captured picture and video might include surrounding scenery and objects such as trees, buildings, and tables. However, it is restricted to obtaining and understanding certain higher semantics of the surrounding context because they can describe only visible information. A captured audio segment can provide invisible information related to our surroundings. They are, for example, “this place is crowded with many people,” “wind is very strong,” or “there are many insects here.” Consequently, recognizing such sounds can engender a higher level of context understanding. The second feature is that, although we often do not realize it, several sounds exist around us at any time and any place. Sound can be retrieved automatically through a microphone on devices. Therefore, it is much easier to collect surrounding sound data continuously without forcing users to conduct special actions.
From these viewpoints, we specifically aim at examining sound-based context recognition. The sounds that we describe in this paper are those that widely exist in the real world around us. Such sounds include birdsongs, chirping of insects, school chimes, traffic noises from cars or motorcycles, footsteps and announcements at a train station, sounds of children’s laughter, and others. For discussion, we define such sounds as “environmental sound.” Some earlier reports [4, 5] specifically describe examination of these sounds, and describe high classification rates in various environments. These proposals used Mel Frequency Cepstral Coefficient (MFCC) as a mode of feature extraction. MFCC is a traditional frequency-domain feature value. However, MFCC presents some drawbacks. For example, it is difficult to extract correct feature values from sound sources that include noise. This problem leads to markedly lower recognition and classification ability when attempting to recognize mixture sounds such as environmental sounds. One means to improve the recognition is to use a stereo microphone as conventional studies have done. This measure might increase the recognition capability because stereo sound has more accurate data than monaural microphones can provide. It is nevertheless unrealistic to use stereo microphones because almost all mobile devices have no stereo microphones but instead have monaural microphones.
For full text: click here
(Author: Reona Mogi, Hiroyuki Kasai